CN104391941A - Method for rapidly establishing full-text retrieval tool for common files - Google Patents

Method for rapidly establishing full-text retrieval tool for common files Download PDF

Info

Publication number
CN104391941A
CN104391941A CN201410684418.9A CN201410684418A CN104391941A CN 104391941 A CN104391941 A CN 104391941A CN 201410684418 A CN201410684418 A CN 201410684418A CN 104391941 A CN104391941 A CN 104391941A
Authority
CN
China
Prior art keywords
full
module
text
retrieval
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410684418.9A
Other languages
Chinese (zh)
Inventor
刘粉粉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201410684418.9A priority Critical patent/CN104391941A/en
Publication of CN104391941A publication Critical patent/CN104391941A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8358Query translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/838Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Abstract

The invention discloses a method for rapidly establishing a full-text retrieval tool for common files, belonging to the field of retrieval tools. The method specifically comprises the steps that: (1) a document analysis module reads all the file analysis HTTP requests and sends the requests to a Chinese words segmentation module; (2) the Chinese words segmentation module segments attribute content in the received HTTP requests; (3) a full-text retrieval establishment module customizes a retrieval service type; (4) a retrieval module after analyzing a retrieval command performs corresponding operation and finishes establishment of the retrieval tool; (5) after a user submits search terms, the retrieval module performs word segmentation treatment on the search terms, generates an inquiry request and inquires in an index library and takes on the inquired result to the user. The method for rapidly establishing the full-text retrieval tool for common files realizes establishment of search engine dedicated to personnel and enterprises, personal retrieval requirements can be realized only by taking relatively little time and effort, and a plenty of internal files can be managed easily.

Description

A kind of method of rapid build active file full-text search instrument
Technical field
The present invention discloses a kind of method of rapid build gopher, belongs to gopher field, specifically a kind of method of rapid build active file full-text search instrument.
Background technology
Full-text search is by the arbitrary content information searching retrieval out in whole book of storage, entire article.It can to obtain in full the information such as relevant chapter, paragraph, sentence, word as required, is that is similar to and adds a label to each words of whole book, also can carry out various statistics and analysis.Solr is an independently enterprise-level search application server, and it externally provides the api interface being similar to Web-service.User can pass through http request, submits the XML file of certain format to, generating indexes to search engine server; Also can be operated by Http Get and propose search request, and obtain returning results of XML format.
The search need of now a lot of users also rests on the database stage, but when search mission charge capacity is very large, the performance of database also has limitation.And the search of content for a large amount of files, database can complete hardly, or the difficulty that complete process is suitable, and select a ripe search engine of increasing income as core, a gopher that can be user and use is built with this, it is good selection, but a practical text search tool builds very complicated, and there is no unified and simple construction method, the invention provides a kind of method of rapid build active file full-text search instrument, based on the active file gopher of the search engine solr that increases income, by file stored in search engine, structure full-text index is carried out to it, all related contents can be retrieved fast according to search keyword, finally present to user.Utilize the method, individual can be realized and enterprise builds exclusive search engine, only need spend less time and efforts, the Search Requirement of self can be reached, easily manage a large amount of internal files.
Summary of the invention
The present invention is directed to deficiency and the problem of prior art existence, a kind of method of rapid build active file full-text search instrument is provided, be applicable to individual and set up the gopher that can be retrieved the various file accumulated over a long period fast, be more suitable for enterprise and carry out managing internal heap file, file needed for fast searching.
The method of a kind of rapid build active file of the present invention full-text search instrument, the concrete scheme of proposition is:
A system for rapid build active file full-text search instrument, realize based on solr, comprise document parsing module, Chinese word segmentation module, full-text index sets up module, full-text index storehouse, retrieval module;
Document parsing module is responsible for resolution file;
Chinese word segmentation module in charge uses Chinese Word Automatic Segmentation, file content is carried out full text participle, to set up full-text index;
Full-text index is set up module in charge and is carried out full-text index to the word after Chinese word-dividing mode participle;
Full-text index storehouse is responsible for data and is stored;
Retrieval module is responsible for the various retrievals realizing user.
A method for rapid build active file full-text search instrument, realize based on solr, concrete steps are
1. document parsing module is converted into XML format after reading all document analysis, each document analysis is become two attributes, and composition HTTP request sends to Chinese word segmentation module;
2. Chinese word segmentation module carries out participle to the property content received in HTTP request, and set up module through full-text index after all properties participle and set up index, segmentation methods is configured by configuration file;
3. full-text index sets up Custom modules index service type, plans the field that will store and the field that will preserve, then the index of all foundation and data are stored into full-text index storehouse in configuration file;
4., after retrieval module is resolved retrieval command, from full-text index storehouse, obtain index, retrieve accordingly, delete, revise index operation, complete the structure of gopher;
5., after submit queries word, retrieval module can carry out the process such as participle to query word, and generated query request, then inquire about in index database, and inquiry acquired results is presented to user.
Described step 1. in two attributes becoming of each document analysis be the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores.
Described step 2. in full-text index set up module and set up inverted data structure index.
Step 4. in after retrieval module resolves retrieval command, also can realize the sequence of result for retrieval, the highlighted display of keyword, search key weighted.
Described active file is word, pdf, txt form.
Usefulness of the present invention is: the active file gopher that the present invention is based on the search engine solr that increases income, by file stored in search engine, structure full-text index is carried out to it, all related contents can be retrieved fast according to search keyword, finally present to user, utilize this method, can realize individual and enterprise build exclusive search engine, only need spend less time and efforts, the Search Requirement of self can be reached, the internal file that easily management is a large amount of.
accompanying drawing illustrates:
The method flow schematic diagram of a kind of rapid build active file of Fig. 1 full-text search instrument.
Embodiment
By reference to the accompanying drawings to the present invention to further elaboration:
Embodiment 1
Based on search engine solr, build a kind of system of rapid build active file full-text search instrument, comprise document parsing module, Chinese word segmentation module, full-text index sets up module, full-text index storehouse, retrieval module; Chinese word segmentation module, full-text index sets up module, full-text index storehouse, and retrieval module works based on search engine solr;
Document parsing module is responsible for resolution file;
Chinese word segmentation module in charge uses Chinese Word Automatic Segmentation, file content is carried out full text participle, to set up full-text index;
Full-text index is set up module in charge and is carried out full-text index to the word after Chinese word-dividing mode participle;
Full-text index storehouse is responsible for data and is stored;
Retrieval module is responsible for the various retrievals realizing user.
A method for rapid build active file full-text search instrument, concrete steps are
1. document parsing module is converted into XML format after reading word document analysis, each document analysis is become two attributes, be the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores, and composition HTTP request sends to Chinese word segmentation module;
2. Chinese word segmentation module carries out participle to the property content received in HTTP request, and set up module through full-text index after all properties participle and set up inverted data structure index, segmentation methods is configured by configuration file;
3. full-text index sets up Custom modules index service type, plans the field that will store and the field that will preserve, then the index of all foundation and data are stored into full-text index storehouse in configuration file;
4., after retrieval module is resolved retrieval command, from full-text index storehouse, obtain index, retrieve accordingly, delete, revise index operation, complete the structure of gopher;
5., after submit queries word, retrieval module can carry out the process such as participle to query word, and generated query request, then inquire about in index database, and inquiry acquired results is presented to user.
Embodiment 2
Based on search engine solr, build a kind of system of rapid build active file full-text search instrument, comprise document parsing module, Chinese word segmentation module, full-text index sets up module, full-text index storehouse, retrieval module; Chinese word segmentation module, full-text index sets up module, full-text index storehouse, and retrieval module works based on search engine solr;
Document parsing module is responsible for resolution file;
Chinese word segmentation module in charge uses Chinese Word Automatic Segmentation, file content is carried out full text participle, to set up full-text index;
Full-text index is set up module in charge and is carried out full-text index to the word after Chinese word-dividing mode participle;
Full-text index storehouse is responsible for data and is stored;
Retrieval module is responsible for the various retrievals realizing user.
A method for rapid build active file full-text search instrument, concrete steps are
1. document parsing module reads after pdf document is resolved and is converted into XML format, each document analysis is become two attributes, be the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores, and composition HTTP request sends to Chinese word segmentation module;
2. Chinese word segmentation module carries out participle to the property content received in HTTP request, and set up module through full-text index after all properties participle and set up inverted data structure index, segmentation methods is configured by configuration file;
3. full-text index sets up Custom modules index service type, plans the field that will store and the field that will preserve, then the index of all foundation and data are stored into full-text index storehouse in configuration file;
4. after retrieval module is resolved retrieval command, index is obtained from full-text index storehouse, retrieve accordingly, delete, revise index operation, also can realize the sequence of result for retrieval, the highlighted display of keyword, search key weighted, complete the structure of gopher;
5., after submit queries word, retrieval module can carry out the process such as participle to query word, and generated query request, then inquire about in index database, and inquiry acquired results is presented to user.

Claims (6)

1. a system for rapid build active file full-text search instrument, realize based on solr, it is characterized in that comprising document parsing module, Chinese word segmentation module, full-text index sets up module, full-text index storehouse, retrieval module;
Document parsing module is responsible for resolution file;
Chinese word segmentation module in charge uses Chinese Word Automatic Segmentation, file content is carried out full text participle, to set up full-text index;
Full-text index is set up module in charge and is carried out full-text index to the word after Chinese word-dividing mode participle;
Full-text index storehouse is responsible for data and is stored;
Retrieval module is responsible for the various retrievals realizing user.
2. a method for rapid build active file full-text search instrument, utilizes the system of a kind of rapid build active file full-text search instrument as claimed in claim 1, it is characterized in that concrete steps are
1. document parsing module is converted into XML format after reading all document analysis, each document analysis is become two attributes, and composition HTTP request sends to Chinese word segmentation module;
2. Chinese word segmentation module carries out participle to the property content received in HTTP request, and set up module through full-text index after all properties participle and set up index, segmentation methods is configured by configuration file;
3. full-text index sets up Custom modules index service type, plans the field that will store and the field that will preserve, then the index of all foundation and data are stored into full-text index storehouse in configuration file;
4., after retrieval module is resolved retrieval command, from full-text index storehouse, obtain index, retrieve accordingly, delete, revise index operation, complete the structure of gopher;
5., after submit queries word, retrieval module can carry out the process such as participle to query word, and generated query request, then inquire about in index database, and inquiry acquired results is presented to user.
3. the method for a kind of rapid build active file full-text search instrument according to claim 2, it is characterized in that two attributes that during described step 1., each document analysis becomes are the filename of file and the entire contents of file respectively, wherein filename comprises the absolute path that file stores.
4. the method for a kind of rapid build active file full-text search instrument according to Claims 2 or 3, is characterized in that full-text index is set up module and set up inverted data structure index during described step 2..
5. the method for a kind of rapid build active file full-text search instrument according to claim 4, after it is characterized in that during step 4. that retrieval module resolves retrieval command, also can realize the sequence of result for retrieval, the highlighted display of keyword, search key weighted.
6. the method for a kind of rapid build active file full-text search instrument according to Claims 2 or 3 or 5 any one, is characterized in that described active file is word, pdf, txt form.
CN201410684418.9A 2014-11-25 2014-11-25 Method for rapidly establishing full-text retrieval tool for common files Pending CN104391941A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410684418.9A CN104391941A (en) 2014-11-25 2014-11-25 Method for rapidly establishing full-text retrieval tool for common files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410684418.9A CN104391941A (en) 2014-11-25 2014-11-25 Method for rapidly establishing full-text retrieval tool for common files

Publications (1)

Publication Number Publication Date
CN104391941A true CN104391941A (en) 2015-03-04

Family

ID=52609845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410684418.9A Pending CN104391941A (en) 2014-11-25 2014-11-25 Method for rapidly establishing full-text retrieval tool for common files

Country Status (1)

Country Link
CN (1) CN104391941A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021625A (en) * 2016-07-26 2016-10-12 浪潮软件集团有限公司 Mixed application method of two word segmenters based on SOLR search engine
CN106649529A (en) * 2016-10-21 2017-05-10 天津海量信息技术股份有限公司 Full-text retrieval method applied during transmission through HTTP protocol
CN106649800A (en) * 2016-12-29 2017-05-10 南威软件股份有限公司 Solr-based Chinese search method
CN106844700A (en) * 2017-02-03 2017-06-13 山东浪潮商用系统有限公司 It is a kind of to ask tax system based on Sorl
CN106951419A (en) * 2016-01-06 2017-07-14 北京仿真中心 A kind of isomery manufacturing service of facing cloud manufacture finds system and method
CN108255972A (en) * 2017-12-27 2018-07-06 浪潮通用软件有限公司 A kind of text searching method and system
WO2020097997A1 (en) * 2018-11-14 2020-05-22 山东大学 Search result display method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488702A (en) * 2013-09-06 2014-01-01 云南电力试验研究院(集团)有限公司电力研究院 SorlCloud based unstructured data retrieval method and system
CN103729463A (en) * 2014-01-14 2014-04-16 赛特斯信息科技股份有限公司 Method for implementing full-text retrieval based on Lucene and Solr
CN103778202A (en) * 2014-01-10 2014-05-07 江苏哲勤科技有限公司 Enterprise electronic document managing server side and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488702A (en) * 2013-09-06 2014-01-01 云南电力试验研究院(集团)有限公司电力研究院 SorlCloud based unstructured data retrieval method and system
CN103778202A (en) * 2014-01-10 2014-05-07 江苏哲勤科技有限公司 Enterprise electronic document managing server side and system
CN103729463A (en) * 2014-01-14 2014-04-16 赛特斯信息科技股份有限公司 Method for implementing full-text retrieval based on Lucene and Solr

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951419A (en) * 2016-01-06 2017-07-14 北京仿真中心 A kind of isomery manufacturing service of facing cloud manufacture finds system and method
CN106021625A (en) * 2016-07-26 2016-10-12 浪潮软件集团有限公司 Mixed application method of two word segmenters based on SOLR search engine
CN106649529A (en) * 2016-10-21 2017-05-10 天津海量信息技术股份有限公司 Full-text retrieval method applied during transmission through HTTP protocol
CN106649800A (en) * 2016-12-29 2017-05-10 南威软件股份有限公司 Solr-based Chinese search method
CN106844700A (en) * 2017-02-03 2017-06-13 山东浪潮商用系统有限公司 It is a kind of to ask tax system based on Sorl
CN108255972A (en) * 2017-12-27 2018-07-06 浪潮通用软件有限公司 A kind of text searching method and system
WO2020097997A1 (en) * 2018-11-14 2020-05-22 山东大学 Search result display method and device

Similar Documents

Publication Publication Date Title
CN104391941A (en) Method for rapidly establishing full-text retrieval tool for common files
CN103020281B (en) A kind of data storage and retrieval method based on spatial data numerical index
CN103049575B (en) A kind of academic conference search system of topic adaptation
CN101206670B (en) System and method for transferring non construction information to content
CN102930060B (en) A kind of method of database quick indexing and device
CN107038207A (en) A kind of data query method, data processing method and device
CN101685444B (en) System and method for realizing metadata search
US20150310129A1 (en) Method of managing database, management computer and storage medium
US20160048584A1 (en) On-the-fly determination of search areas and queries for database searches
US20090157801A1 (en) System and method for integrating external system data in a visual mapping system
CN102193917A (en) Method and device for processing and querying data
CN107085583B (en) Electronic document management method and device based on content
US11216516B2 (en) Method and system for scalable search using microservice and cloud based search with records indexes
CN102810114A (en) Personal computer resource management system based on body
CN101196900A (en) Information searching method based on metadata
US20130191328A1 (en) Standardized framework for reporting archived legacy system data
CN102262650A (en) Linked databases
EP2889788A1 (en) Accessing information content in a database platform using metadata
CN110413570A (en) A kind of document index and search method and its device
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
WO2021043088A1 (en) File query method and device, and computer device and storage medium
CN103020300B (en) Method and device for information retrieval
Liu et al. A study of entity search in semantic search workshop
CN105740997A (en) Method and device for controlling task flow, and database management system
Lu et al. Language engineering for the Semantic Web: A digital library for endangered languages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150304