CN106294695A

CN106294695A - A kind of implementation method towards the biggest data search engine

Info

Publication number: CN106294695A
Application number: CN201610640922.8A
Authority: CN
Inventors: 张剑
Original assignee: Net Peace Computer Security Detection Technique Co Ltd Of Shenzhen
Current assignee: Net Peace Computer Security Detection Technique Co Ltd Of Shenzhen
Priority date: 2016-08-08
Filing date: 2016-08-08
Publication date: 2017-01-04

Abstract

The invention discloses a kind of implementation method towards the biggest data search engine, relate to search engine technique field.Based on HTTP and Apache Lucene, build ROSE search engine system；Creating the index of ROSE search engine system, after index creation is good, fileinfo can be retrieved by user with input inquiry condition, when user input query condition, first carry out text analyzing, then from index data base search index, finally the result obtained is returned to user.The method can be good at the full-text search function of real-time streaming data, and jointly completes calculating task with distributed system, makes full use of high-speed computation and the storage of cluster, improves the response speed of Data Analysis Services.

Description

A kind of implementation method towards the biggest data search engine

Technical field

The present invention relates to search engine technique field, particularly relate to a kind of realization side towards the biggest data search engine Method.

Background technology

A lot of web applications are directed to the analyzing and processing of mass data, and the mass data storage of general formatting exists Data base, nonformatted data stores with document form, or mixes storage with data base and document form.As data base and File system runs into the data volume that TB data are the biggest, and its analyzing and processing speed will become very slow, and response speed can not Meet the demand of user.

Traditional network application system framework, mainly has C/S model (or B/S), and S refers to Server (server end), and B refers to Browser (browser end), C refer to Client (client), differ only in main business logic and be placed on client before both End is also placed on server end.As it is shown in figure 1, as a example by C/S model, client passes through UI, the data produced alternately with user Typically can submit to server by network mode and carry out Business Processing, the business datum after process can be stored in data base or literary composition In part system, wait that secondary uses, the such as operation such as data query, statistics and data mining.This framework (is often referred in big data The data volume of TB level) in the case of, the analyzing and processing bottleneck of data is concentrated mainly on the I/O of data base and file system, internal memory and CPU disposal abilities etc., can cause system response even to cannot respond to too slowly, and this system the most not possess extensibility, Increase storage and calculating resource can not improve its performance.

Apache Hadoop distributed computing system is a software frame realized with java language, by big gauge Running the Distributed Calculation of mass data in the cluster of calculation machine composition, it can allow application program support thousands of nodes and PB level Other data.It mainly solves data volume problem, has superiority in the storage processing big data quantity and simple computation problem.It is suitable for In the batch processing task of massive data files, be not suitable for the scene that requirement of real-time is high, be not suitable for user operation, amendment data frequency Numerous scene.

Summary of the invention

The technical problem to be solved is to provide a kind of implementation method towards the biggest data search engine, should Method can be good at the full-text search function of real-time streaming data, and jointly completes calculating task with distributed system, fills Divide high-speed computation and the storage utilizing cluster, improve the response speed of Data Analysis Services.Achieve expanding of the biggest data The analyzing and processing of exhibition, the data that system produces need not first store, and directly can be processed in real time and be reflected in response results.

For solving above-mentioned technical problem, the technical solution used in the present invention is: one is drawn towards the biggest data search The implementation method held up, including implemented below step:

1) based on HTTP and Apache Lucene, ROSE search engine system is built；

2) index of ROSE search engine system is created, by the document information of various forms and database data are entered Row information extraction, and select different text analyzers to carry out text analyzing according to file type, create index, generate index number According to storehouse；

3), after index creation is good, fileinfo can be retrieved by user with input inquiry condition, works as user input query During condition, first carry out text analyzing, then from index data base search index, finally the result obtained is returned to user.

The technical scheme optimized further is described step 2) in create the step of index and comprise the following steps:

A, appointment create the catalogue indexed；

B, establishment Directory object；

Index file object IndexWriter is write in C, establishment；

D, obtain source file File array to determine index content；

E, with circulation by each file write index, first create Document object and Field object, represent number respectively According to the Column Properties in the data line in the table of storehouse and this row；Then Field is joined in Document, finally by IndexWriter calls function addDocument and document index is write in index data base；

Index object IndexWriter is write in F, closedown.

The technical scheme optimized further is described step 2) in the step of retrieval comprise the following steps:

Index object IndexReader is read in A, establishment；

B, establishment object search IndexSearcher；

C, establishment morphological analysis object Analyer；

D, establishment syntactic analysis object QueryParser

E, QueryParser call parser and carry out syntactic analysis, generate query grammar tree, put it in Query；

F, IndexSearcher call search method and scan for query grammar tree Query, obtain result set TopDocs；

G, according to TopDocs obtain corresponding ScoreDoc；

H, according to ScoreDoc obtain corresponding Document document；

I, according to Document obtain corresponding Field attribute.

The technical scheme optimized further is that ROSE search engine system is provided with the http interface of standard to realize logarithm According to index increase, delete, revise, inquire about.

The technical scheme optimized further is that ROSE search engine system can quickly set up cluster by Zookeeper, And go to search according to the correlation behavior of the cluster safeguarded in server after doing hash operation according to the ID value of current index record Hash value, in which Range, finds the shard of correspondence；Leader sets up in this shard index, until Leader Node updates has terminated, and version number and document finally are transmitted to belong to together the replicas node of a Shard.

Use and have the beneficial effects that produced by technique scheme: present invention have the advantage that

(1) full-text search of real-time streaming data is supported

ROSE is based primarily upon HTTP and Apache Lucene and realizes, it is possible to the full text well completing real-time streaming data is searched Suo Gongneng；The field changed in just data base can be inquired, as some table in data base is realized insert mono- Or many data, what he can be real-time indexes the data creation just now increasing insertion.And its permission is looked into by unique key Look for the latest edition data of any document, and need not reopen searcher.

(2) analyzing and processing based on real-time streaming data is supported

ROSE not merely supports the full-text search of real-time streaming data, but also supports to be analyzed the data searched place Reason.ROSE can be grouped according to the field of Facet and add up while search key, and it can't be revised and look into Asking object information, simply add count information according to classification on Query Result, then user does into one according to count information The inquiry of step.

(3) the extendible plug-in unit system for full-text search

ROSE can realize some specific functions by more integrated plug-in units, realizes including KAnalyzer, mmseg4j The Chinese word segmentation function of full-text search, it is possible to integrated solr_pagerlai realize full-text search after search element two-page separation function. Extendible plug-in unit system makes ROSE more quickly with convenient.

Accompanying drawing explanation

Fig. 1 is traditional network application system Organization Chart；

Fig. 2 is ROSE search engine system structure chart of the present invention；

Fig. 3 is the architectural framework figure of ROSE search engine system of the present invention；

Fig. 4 is index creation of the present invention and search procedure figure.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.

As in figure 2 it is shown, the invention discloses a kind of implementation method towards the biggest data search engine, including following reality Existing step:

1) based on HTTP and Apache Lucene, ROSE (Real-time OceanData Search Engine) is built Search engine system；

The step creating index comprises the following steps: (with reference to Fig. 3 and Fig. 4)

A, appointment create the catalogue indexed；

B, establishment Directory object；

Index file object IndexWriter is write in C, establishment；

D, obtain source file File array to determine index content；

Index object IndexWriter is write in F, closedown.

After index creation is good, index file just can be retrieved by user with input inquiry condition, the step bag of its retrieval Include following steps:

Index object IndexReader is read in A, establishment；

B, establishment object search IndexSearcher；

C, establishment morphological analysis object Analyer；

D, establishment syntactic analysis object QueryParser

G, according to TopDocs obtain corresponding ScoreDoc；

H, according to ScoreDoc obtain corresponding Document document；

I, according to Document obtain corresponding Field attribute.

About Lucene system for be made up of 7 bag modules altogether, respectively: analysis, document, index, QueryParser, search, store, util.Cooperate between each bag module work, and each bag has the most again specifically Function: analysis module is mainly responsible for Language Processing and morphological analysis；Including dividing that some acquiescences of Lucene carry Word device, as filtered out the StopAnalyzer class of " stop-word " and conventional StandardAnalyzer class, WhitespaceAnalyzer presses the class etc. of space character participle；Document module is mainly used in management document structure, quite Multiple information " territory " (Field) can be comprised in the list structure of relational database, a document, be similar in relation table Corresponding row；Index module is mainly responsible for index management, including creating index, deletion index, read-write index, merging and optimize Index etc.；Store module is mainly responsible for read-write and storage index；QueryParser is mainly responsible for syntactic analysis, for resolve and Perform query statement；Search module is mainly responsible for searching, managing, searches out result set according to condition from index file；util Module is tool kit, is some common tool classes and the set of method.

The embodiment optimized further is that ROSE search engine system is provided with the http interface of standard and realizes data Index increase, delete, revise, inquire about.In ROSE, user is by the ROSE being deployed in servlet server Web application sends HTTP request and starts index and search；ROSE accepts request, determines suitable ROSE to be used RequestHandler, then processes request.Returned response in the same way by HTTP, default configuration returns the mark of ROSE Quasi-XML responds, it is also possible to the standby response format of configuration ROSE.

Four different indexes can be transmitted to ROSE index servlet to ask:

Add/update allow to ROSE add document or update document, until submit to after just can search these add and Update.

Commit tells ROSE, it should make all changes done since submitting to last time to search.

The file of optimize reconstruct Lucene, to improve search performance, performs after having indexed to optimize generally to compare Good.If updating relatively more frequent, then should arrange to optimize utilization rate is relatively low when.One index can also be just without optimizing Often run.Optimization is a time-consuming more process.

Delete can be specified by id or inquiry, deletes by id and deletion has the document specifying id；Delete by inquiry Except all documents that Delete query is returned.

Realize adding document index then to have only to call searching interface and submit XML message in the way of HTTP POST.

The embodiment optimized further is that ROSE search engine system can quickly set up cluster by Zookeeper, and ID value according to current index record goes to search according to the correlation behavior of the cluster safeguarded in server after doing hash operation Hash value, in which Range, finds the shard of correspondence；Leader sets up in this shard index, until Leader Node updates has terminated, and version number and document finally are transmitted to belong to together the replicas node of a Shard.

Present disclosure applies equally to put into the system of actual operation, it is only necessary to do small on source code to application program Amendment, system deployment to increase by 1 index server or an index server cluster according to historical data amount.

The main flow of ROSE search engine system application includes:

(1) user sends add request by client, and submits corresponding document to；

(2) server-side application receives the document that client submits to, and file is stored in file system and to data Storehouse updates relative recording；

(3) index server call analyzing and processing application program the data that user submits are analyzed process, and general at Data after reason are indexed；

(4) user sends Query, Update or Delete request by client；

(5) after server-side application receives client's request, direct search index server is straight by index server Connect the inquired about data of return or perform update, delete operation.

Advantage of the present invention is:

1) full-text search of real-time streaming data and Distributed Calculation function

ROSE is based primarily upon HTTP and Apache Lucene and realizes, it is possible to the full text well completing real-time streaming data is searched Suo Gongneng.ROSE is an independent enterprise-level search application server.Principle is that document utilizes XML to be added to one by Http In search set；Inquiring about this set is also to receive an XML/JSON response by http to realize.Its key property includes: Efficiently, caching function, vertical search function flexibly, be highlighted Search Results, improve availability by index copy, carry Field is defined, type and text analyzing is set, it is provided that Web-based enterprise management interface etc. for a set of powerful Data Schema.

The core concept of Distributed Calculation function is that ROSE is completed calculating task, fully profit jointly by a distributed system With power high-speed computation and the storage of cluster.There is the feature of high fault tolerance, and be designed to be deployed in cheap (low- Cost) on hardware.And it provides high transmission rates (high throughput) to carry out the data of access application, is suitable for those There is the application program of super large data set (large data set).

2) extendible distributed computing architecture

ROSE can quickly set up cluster by Zookeeper, and provides simple slicing algorithm, i.e. according to current The ID value of index record does hash operation, after go to search hash value at which according to the correlation behavior of cluster safeguarded in server In individual Range, find the shard of correspondence；Leader sets up in this shard index, until Leader node updates terminates Complete, version number and document finally are transmitted to belong to together the replicas node of a Shard.Therefore, this framework can be dynamic Carrying out dispose, work including hardware can be increased simultaneously, configurable multiple servers manage data simultaneously.

3) extendible plug-in unit system

The stream realizing real-time big data processes, then high-speed access data and quickly return result data result set are one The problem that must must consider.And realize the full-text search of ROSE based on HTTP and Apache Lucene and can extend other plug-in units Complete specific function.Such as IKAnalyzer, the segmenter such as mmseg4j, paoding realizes Chinese word segmentation function, it is possible to Integrated solr_pager realizes searching for two-page separation function, and data can be processed and divide by this characteristic faster Analysis.

Claims

1. the implementation method towards in real time big data search engine, it is characterised in that: include implemented below step:

1) based on HTTP and Apache Lucene, ROSE search engine system is built；

2) index of ROSE search engine system is created, by the document information of various forms and database data are carried out letter Breath extraction, and select different text analyzers to carry out text analyzing according to file type, create index, generate index data Storehouse；

3), after index creation is good, fileinfo can be retrieved by user with input inquiry condition, when user input query condition Time, first carry out text analyzing, then from index data base search index, finally the result obtained is returned to user.

A kind of implementation method towards the biggest data search engine the most according to claim 1, it is characterised in that: described Step 2) in create index step comprise the following steps:

A, appointment create the catalogue indexed；

B, establishment Directory object；

Index file object IndexWriter is write in C, establishment；

D, obtain source file File array to determine index content；

E, with circulation by each file write index, first create Document object and Field object, respectively representation database Data line in table and the Column Properties in this row；Then Field is joined in Document, finally by IndexWriter Call function addDocument document index to be write in index data base；

Index object IndexWriter is write in F, closedown.

A kind of implementation method towards the biggest data search engine the most according to claim 1, it is characterised in that: described Step 2) in retrieval step comprise the following steps:

Index object IndexReader is read in A, establishment；

B, establishment object search IndexSearcher；

C, establishment morphological analysis object Analyer；

D, establishment syntactic analysis object QueryParser

G, according to TopDocs obtain corresponding ScoreDoc；

H, according to ScoreDoc obtain corresponding Document document；

I, according to Document obtain corresponding Field attribute.

A kind of implementation method towards the biggest data search engine the most according to claim 1, it is characterised in that: ROSE Search engine system is provided with the http interface of standard and realizes the increase of the index to data, deletes, revises, inquires about.

A kind of implementation method towards the biggest data search engine the most according to claim 1, it is characterised in that: ROSE Search engine system can quickly set up cluster by Zookeeper, and is hash behaviour according to the id value of current index record After work, the correlation behavior according to the cluster safeguarded in server goes lookup hash value in which Range, finds correspondence shard；Leader sets up in this shard index, until Leader node updates has terminated, finally by version number and literary composition Shelves are transmitted to belong to together the replicas node of a Shard.