CN105069101A

CN105069101A - Distributed index construction and search method

Info

Publication number: CN105069101A
Application number: CN201510481248.9A
Authority: CN
Inventors: 强保华; 曾冰; 王玉峰; 王勇; 张学庆
Original assignee: Guilin University of Electronic Technology; CETC 54 Research Institute
Current assignee: Guilin University of Electronic Technology; CETC 54 Research Institute
Priority date: 2015-08-07
Filing date: 2015-08-07
Publication date: 2015-11-18

Abstract

The invention discloses a distributed index construction and search method and aims at realizing the rapidness and high efficiency of search. The index construction method comprises the following steps: executing a Map process; reading a pre-processed file on HDFS; regularly reading valid data and packaging; executing a Reduce process; reading data processed through a Combine process; initiating Lucene; packaging valid information into an index data structure; constructing indexes by utilizing a full-text search engine tool; and respectively storing index files of blocks. The search method comprises the following steps: I, acquiring original data from the internet, carrying out clustering and duplicate removal on the original data, and uploading the original data onto a distributed file system; II, constructing indexes in parallel for the pre-processed data bocks by utilizing the distributed index construction method; III, respectively storing the index files to each node of the cluster; IV, distributing a search request to each node by the system; V, executing search and returning search result by each node according to the request; and VI, sorting the returned results by the system.

Description

Distributed index builds and search method

Technical field

The present invention relates to searching engine field, be specifically related to a kind of distributed index and build and search method.

Background technology

Popularizing fast of Network Information, the arrival of especially large data age, makes the unstructured data of various isomery on internet start to occur explosive growth.Search engine technique be people from the data of magnanimity fast and effectively retrieve useful information and provide good solution.But along with the swift and violent increase of large data age web database technology and customer volume, traditional centralized full-text search mode can not adapt to the requirement of current retrieval performance.Wherein most distinct issues are size very fast expansions of increase along with data scale of index file, thus make cannot adopt concentrated mode to store and organize index, and meanwhile, huge index file also causes very large impact to recall precision.

The process of introducing to mass data of distributed computing technology serves significant supporting role, and the difficult problem simultaneously also for running in traditional global search technology provides good solution.The most ripe in current distributed computing technology is HDFS (distributed file system) and MapReduce (distributed computing platform).

Summary of the invention

The invention provides a kind of quick indexing based on Hadoop (distributed system architecture) to build and distributed search method, system carries out block parallel index building in conjunction with distributed computing technology to magnanimity raw data, in retrieving, adopt distributed search method retrieve data rapidly and efficiently, and ensure the high efficiency of rapid build index and retrieval respectively by MapReduce (distributed computing platform) and RMI (remote method invocation (RMI)) framework.

A kind of distributed index construction method, comprising: perform Map (Part I of MapReduce framework) process, reads through pretreated file on HDFS, reads valid data and encapsulate by canonical; Perform Reduce (MapReduce framework Part III) process, read the data after Combine (Part II of MapReduce framework) process process, initialization Lucene (full-text search engine), effective information is packaged into index data structure, utilizes full-text search engine tools build index; The index file of piecemeal is stored respectively.

A kind of distributed search method, comprises the steps:

(1) obtain raw data from internet, carry out cluster and duplicate removal process, and upload in distributed file system.

(2) aforementioned distributed index construction method is utilized to walk abreast index building to pretreated deblocking.

(3) index file is stored into respectively each node of cluster.

(4) system distribution retrieval request is to each node.

(5) each node performs according to request and retrieves and return result for retrieval.

(6) system sorts to the result that node returns.

Accompanying drawing explanation

Fig. 1 is FB(flow block) of the present invention.

Fig. 2 is the FB(flow block) that in the present invention, distributed index builds.

Embodiment

As shown in Figure 1, the quick indexing based on Hadoop builds and distributed search, and illustrate index construct and the distributed search process of whole system, pretreatment module carries out cluster and duplicate removal process to raw data, for index construct module provides effective former data.Index construct module is on each node be also stored into respectively through pretreated Data distribution8 formula index building in cluster, and MapReduce is attached in index construct the efficiency that improve index building by index construct module.

Distributed search module obtains the retrieval request of user by page request, retrieval request is distributed to each node in cluster by system, after node obtaining request, retrieval service is just retrieved local index file and result for retrieval is turned back to system, system carries out sequencing by merging to the data that all nodes return, and finally returns to user.

The concrete steps that distributed index builds are:

(1) create HDFSDocument: be used for the class of the effective information encapsulated in raw data, such inherits Writable interface, may be used for the transmission between hadoop cluster.

(2) HDFSOutputFormat is created: this is the class inheriting Formatting Output class FileOutputFormat in Hadoop Open Framework, may be used in distributed computing framework MapReduce, define oneself distinctive output format, namely create the output format of index.

(3) create HDFSIndexer: this is the class of actual generating indexes file, it encapsulates the FileSystem class in the IndexWriter class of Open Framework Lucene and hadoop, can distributed index building easily.

(4) read source document by function map, and encapsulated by canonical extraction effective information, finally the data of encapsulation are transferred to Reduce node.

(5) data transmitted by function reduce reading map are merged, last call format output class initialization index construct resource, then call the data write index file that index building function will merge.

(6) by function writeIndex generating indexes file, the data that reduce function transfer is come are resolved, is packaged into the data layout required for Open Framework Lucene, finally writes index file.

Distributed search method, is formed primarily of following steps:

(1) obtain raw data from internet, carry out cluster and duplicate removal process, and upload on HDFS.

(3) index file is stored into respectively each node of cluster.

(4) system distribution retrieval request is to each node.

(6) system is carried out merging to the result that node returns and is sorted.

The present invention is by building search system, make full use of the high efficiency of distributed computing technology, combined the structure of index, the retrieval of data and distributed computing technology core business, system obtains raw data by the reptile instrument of increasing income from internet, pretreatment module carries out corresponding pre-service to raw data, index construct module to walk abreast index building to pretreated deblocking, the index file of piecemeal is stored on each node of cluster respectively, starts the retrieval service process retrieval request on each node.The final rapidly and efficiently property realizing retrieval in search system.

Claims

1. a distributed index construction method, described method comprises:

Perform Map process, read through pretreated file on HDFS, read valid data by canonical and encapsulate;

Perform Reduce process, read the data after the process of Combine process, initialization Lucene, is packaged into index data structure by effective information, utilizes full-text search engine tools build index;

The index file of piecemeal is stored respectively;

Wherein: Map refers to the Part I of distributed computing platform framework; Combine refers to the Part II of distributed computing platform framework; Reduce refers to the Part III of distributed computing platform framework; HDFS refers to distributed file system; Lucene refers to full-text search engine.

2. a distributed search method, described method comprises the steps:

(1) obtain raw data from internet, carry out cluster and duplicate removal process, and upload in distributed file system;

(2) method described in claim 1 is utilized to walk abreast index building to pretreated deblocking;

(3) index file is stored into respectively each node of cluster;

(4) system distribution retrieval request is to each node;

(5) each node performs according to request and retrieves and return result for retrieval;

(6) system sorts to the result that node returns.