CN105069101A - Distributed index construction and search method - Google Patents

Distributed index construction and search method Download PDF

Info

Publication number
CN105069101A
CN105069101A CN201510481248.9A CN201510481248A CN105069101A CN 105069101 A CN105069101 A CN 105069101A CN 201510481248 A CN201510481248 A CN 201510481248A CN 105069101 A CN105069101 A CN 105069101A
Authority
CN
China
Prior art keywords
index
distributed
data
node
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510481248.9A
Other languages
Chinese (zh)
Inventor
强保华
曾冰
王玉峰
王勇
张学庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
CETC 54 Research Institute
Original Assignee
Guilin University of Electronic Technology
CETC 54 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology, CETC 54 Research Institute filed Critical Guilin University of Electronic Technology
Priority to CN201510481248.9A priority Critical patent/CN105069101A/en
Publication of CN105069101A publication Critical patent/CN105069101A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/134Distributed indices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed index construction and search method and aims at realizing the rapidness and high efficiency of search. The index construction method comprises the following steps: executing a Map process; reading a pre-processed file on HDFS; regularly reading valid data and packaging; executing a Reduce process; reading data processed through a Combine process; initiating Lucene; packaging valid information into an index data structure; constructing indexes by utilizing a full-text search engine tool; and respectively storing index files of blocks. The search method comprises the following steps: I, acquiring original data from the internet, carrying out clustering and duplicate removal on the original data, and uploading the original data onto a distributed file system; II, constructing indexes in parallel for the pre-processed data bocks by utilizing the distributed index construction method; III, respectively storing the index files to each node of the cluster; IV, distributing a search request to each node by the system; V, executing search and returning search result by each node according to the request; and VI, sorting the returned results by the system.

Description

Distributed index builds and search method
Technical field
The present invention relates to searching engine field, be specifically related to a kind of distributed index and build and search method.
Background technology
Popularizing fast of Network Information, the arrival of especially large data age, makes the unstructured data of various isomery on internet start to occur explosive growth.Search engine technique be people from the data of magnanimity fast and effectively retrieve useful information and provide good solution.But along with the swift and violent increase of large data age web database technology and customer volume, traditional centralized full-text search mode can not adapt to the requirement of current retrieval performance.Wherein most distinct issues are size very fast expansions of increase along with data scale of index file, thus make cannot adopt concentrated mode to store and organize index, and meanwhile, huge index file also causes very large impact to recall precision.
The process of introducing to mass data of distributed computing technology serves significant supporting role, and the difficult problem simultaneously also for running in traditional global search technology provides good solution.The most ripe in current distributed computing technology is HDFS (distributed file system) and MapReduce (distributed computing platform).
Summary of the invention
The invention provides a kind of quick indexing based on Hadoop (distributed system architecture) to build and distributed search method, system carries out block parallel index building in conjunction with distributed computing technology to magnanimity raw data, in retrieving, adopt distributed search method retrieve data rapidly and efficiently, and ensure the high efficiency of rapid build index and retrieval respectively by MapReduce (distributed computing platform) and RMI (remote method invocation (RMI)) framework.
A kind of distributed index construction method, comprising: perform Map (Part I of MapReduce framework) process, reads through pretreated file on HDFS, reads valid data and encapsulate by canonical; Perform Reduce (MapReduce framework Part III) process, read the data after Combine (Part II of MapReduce framework) process process, initialization Lucene (full-text search engine), effective information is packaged into index data structure, utilizes full-text search engine tools build index; The index file of piecemeal is stored respectively.
A kind of distributed search method, comprises the steps:
(1) obtain raw data from internet, carry out cluster and duplicate removal process, and upload in distributed file system.
(2) aforementioned distributed index construction method is utilized to walk abreast index building to pretreated deblocking.
(3) index file is stored into respectively each node of cluster.
(4) system distribution retrieval request is to each node.
(5) each node performs according to request and retrieves and return result for retrieval.
(6) system sorts to the result that node returns.
Accompanying drawing explanation
Fig. 1 is FB(flow block) of the present invention.
Fig. 2 is the FB(flow block) that in the present invention, distributed index builds.
Embodiment
As shown in Figure 1, the quick indexing based on Hadoop builds and distributed search, and illustrate index construct and the distributed search process of whole system, pretreatment module carries out cluster and duplicate removal process to raw data, for index construct module provides effective former data.Index construct module is on each node be also stored into respectively through pretreated Data distribution8 formula index building in cluster, and MapReduce is attached in index construct the efficiency that improve index building by index construct module.
Distributed search module obtains the retrieval request of user by page request, retrieval request is distributed to each node in cluster by system, after node obtaining request, retrieval service is just retrieved local index file and result for retrieval is turned back to system, system carries out sequencing by merging to the data that all nodes return, and finally returns to user.
The concrete steps that distributed index builds are:
(1) create HDFSDocument: be used for the class of the effective information encapsulated in raw data, such inherits Writable interface, may be used for the transmission between hadoop cluster.
(2) HDFSOutputFormat is created: this is the class inheriting Formatting Output class FileOutputFormat in Hadoop Open Framework, may be used in distributed computing framework MapReduce, define oneself distinctive output format, namely create the output format of index.
(3) create HDFSIndexer: this is the class of actual generating indexes file, it encapsulates the FileSystem class in the IndexWriter class of Open Framework Lucene and hadoop, can distributed index building easily.
(4) read source document by function map, and encapsulated by canonical extraction effective information, finally the data of encapsulation are transferred to Reduce node.
(5) data transmitted by function reduce reading map are merged, last call format output class initialization index construct resource, then call the data write index file that index building function will merge.
(6) by function writeIndex generating indexes file, the data that reduce function transfer is come are resolved, is packaged into the data layout required for Open Framework Lucene, finally writes index file.
Distributed search method, is formed primarily of following steps:
(1) obtain raw data from internet, carry out cluster and duplicate removal process, and upload on HDFS.
(2) aforementioned distributed index construction method is utilized to walk abreast index building to pretreated deblocking.
(3) index file is stored into respectively each node of cluster.
(4) system distribution retrieval request is to each node.
(5) each node performs according to request and retrieves and return result for retrieval.
(6) system is carried out merging to the result that node returns and is sorted.
The present invention is by building search system, make full use of the high efficiency of distributed computing technology, combined the structure of index, the retrieval of data and distributed computing technology core business, system obtains raw data by the reptile instrument of increasing income from internet, pretreatment module carries out corresponding pre-service to raw data, index construct module to walk abreast index building to pretreated deblocking, the index file of piecemeal is stored on each node of cluster respectively, starts the retrieval service process retrieval request on each node.The final rapidly and efficiently property realizing retrieval in search system.

Claims (2)

1. a distributed index construction method, described method comprises:
Perform Map process, read through pretreated file on HDFS, read valid data by canonical and encapsulate;
Perform Reduce process, read the data after the process of Combine process, initialization Lucene, is packaged into index data structure by effective information, utilizes full-text search engine tools build index;
The index file of piecemeal is stored respectively;
Wherein: Map refers to the Part I of distributed computing platform framework; Combine refers to the Part II of distributed computing platform framework; Reduce refers to the Part III of distributed computing platform framework; HDFS refers to distributed file system; Lucene refers to full-text search engine.
2. a distributed search method, described method comprises the steps:
(1) obtain raw data from internet, carry out cluster and duplicate removal process, and upload in distributed file system;
(2) method described in claim 1 is utilized to walk abreast index building to pretreated deblocking;
(3) index file is stored into respectively each node of cluster;
(4) system distribution retrieval request is to each node;
(5) each node performs according to request and retrieves and return result for retrieval;
(6) system sorts to the result that node returns.
CN201510481248.9A 2015-08-07 2015-08-07 Distributed index construction and search method Pending CN105069101A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510481248.9A CN105069101A (en) 2015-08-07 2015-08-07 Distributed index construction and search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510481248.9A CN105069101A (en) 2015-08-07 2015-08-07 Distributed index construction and search method

Publications (1)

Publication Number Publication Date
CN105069101A true CN105069101A (en) 2015-11-18

Family

ID=54498471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510481248.9A Pending CN105069101A (en) 2015-08-07 2015-08-07 Distributed index construction and search method

Country Status (1)

Country Link
CN (1) CN105069101A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787097A (en) * 2016-03-16 2016-07-20 中山大学 Distributed index establishment method and system based on text clustering
CN106326387A (en) * 2016-08-17 2017-01-11 电子科技大学 Distributive data storage architecture, data storage method and data inquiry method
CN107463670A (en) * 2017-08-03 2017-12-12 北京奇艺世纪科技有限公司 Data storage, read method and device
CN107679248A (en) * 2017-10-30 2018-02-09 江苏鸿信系统集成有限公司 A kind of intelligent data search method
CN111695001A (en) * 2020-06-17 2020-09-22 科技谷(厦门)信息技术有限公司 Mixed data management system in big data scene
CN113297205A (en) * 2020-07-30 2021-08-24 阿里巴巴集团控股有限公司 Index construction and data access processing method, device, equipment and medium
CN117763109A (en) * 2023-12-21 2024-03-26 湖南领众档案管理有限公司 Data checking method for file full-text retrieval

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682036A (en) * 2011-03-18 2012-09-19 新奥特(北京)视频技术有限公司 Non-editing based method and system for searching media assets
CN102799622A (en) * 2012-06-19 2012-11-28 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework
CN102819592A (en) * 2012-08-08 2012-12-12 河海大学 Lucene-based desktop searching system and method
CN103049575A (en) * 2013-01-05 2013-04-17 华中科技大学 Topic-adaptive academic conference searching system
CN103631922A (en) * 2013-12-03 2014-03-12 南通大学 Hadoop cluster-based large-scale Web information extraction method and system
CN103678490A (en) * 2013-11-14 2014-03-26 桂林电子科技大学 Deep Web query interface clustering method based on Hadoop platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682036A (en) * 2011-03-18 2012-09-19 新奥特(北京)视频技术有限公司 Non-editing based method and system for searching media assets
CN102799622A (en) * 2012-06-19 2012-11-28 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework
CN102819592A (en) * 2012-08-08 2012-12-12 河海大学 Lucene-based desktop searching system and method
CN103049575A (en) * 2013-01-05 2013-04-17 华中科技大学 Topic-adaptive academic conference searching system
CN103678490A (en) * 2013-11-14 2014-03-26 桂林电子科技大学 Deep Web query interface clustering method based on Hadoop platform
CN103631922A (en) * 2013-12-03 2014-03-12 南通大学 Hadoop cluster-based large-scale Web information extraction method and system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787097A (en) * 2016-03-16 2016-07-20 中山大学 Distributed index establishment method and system based on text clustering
CN106326387A (en) * 2016-08-17 2017-01-11 电子科技大学 Distributive data storage architecture, data storage method and data inquiry method
CN106326387B (en) * 2016-08-17 2019-06-04 电子科技大学 A kind of Distributed Storage structure and date storage method and data query method
CN107463670A (en) * 2017-08-03 2017-12-12 北京奇艺世纪科技有限公司 Data storage, read method and device
CN107463670B (en) * 2017-08-03 2020-06-05 北京奇艺世纪科技有限公司 Data storage and reading method and device
CN107679248A (en) * 2017-10-30 2018-02-09 江苏鸿信系统集成有限公司 A kind of intelligent data search method
CN111695001A (en) * 2020-06-17 2020-09-22 科技谷(厦门)信息技术有限公司 Mixed data management system in big data scene
CN111695001B (en) * 2020-06-17 2023-05-30 科技谷(厦门)信息技术有限公司 Mixed data management system under big data scene
CN113297205A (en) * 2020-07-30 2021-08-24 阿里巴巴集团控股有限公司 Index construction and data access processing method, device, equipment and medium
CN117763109A (en) * 2023-12-21 2024-03-26 湖南领众档案管理有限公司 Data checking method for file full-text retrieval
CN117763109B (en) * 2023-12-21 2024-06-11 湖南领众档案管理有限公司 Data checking method for file full-text retrieval

Similar Documents

Publication Publication Date Title
CN105069101A (en) Distributed index construction and search method
CN103106249B (en) A kind of parallel data processing system based on Cassandra
CN103631909B (en) System and method for combined processing of large-scale structured and unstructured data
CN111258978B (en) Data storage method
CN104298771A (en) Massive web log data query and analysis method
CN111382226A (en) Database query retrieval method and device and electronic equipment
CN106326429A (en) Hbase second-level query scheme based on solr
Alarabi et al. TAREEG: a MapReduce-based web service for extracting spatial data from OpenStreetMap
CN104536959A (en) Optimized method for accessing lots of small files for Hadoop
CN105550268A (en) Big data process modeling analysis engine
CN103761080A (en) Structured query language (SQL) based MapReduce operation generating method and system
CN103678491A (en) Method based on Hadoop small file optimization and reverse index establishment
CN105808746A (en) Relational big data seamless access method and system based on Hadoop system
CN104239377A (en) Platform-crossing data retrieval method and device
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN103914483B (en) File memory method, device and file reading, device
CN109086573B (en) Multi-source biological big data fusion system
CN107943952A (en) A kind of implementation method that full-text search is carried out based on Spark frames
CN104778182A (en) Data import method and system based on HBase (Hadoop Database)
CN104731945A (en) Full-text searching method and device based on HBase
CN103823846A (en) Method for storing and querying big data on basis of graph theories
CN105740264A (en) Distributed XML database sorting method and apparatus
CN108319608A (en) The method, apparatus and system of access log storage inquiry
CN104268298A (en) Method for creating database index and inquiring data
CN116795859A (en) Data analysis method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20151118

WD01 Invention patent application deemed withdrawn after publication