WO2014180411A1 - 分布式索引的生成方法及装置 - Google Patents

分布式索引的生成方法及装置 Download PDF

Info

Publication number
WO2014180411A1
WO2014180411A1 PCT/CN2014/078696 CN2014078696W WO2014180411A1 WO 2014180411 A1 WO2014180411 A1 WO 2014180411A1 CN 2014078696 W CN2014078696 W CN 2014078696W WO 2014180411 A1 WO2014180411 A1 WO 2014180411A1
Authority
WO
WIPO (PCT)
Prior art keywords
reduce
index
file system
jobs
index library
Prior art date
Application number
PCT/CN2014/078696
Other languages
English (en)
French (fr)
Inventor
韩丙卫
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2014180411A1 publication Critical patent/WO2014180411A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures

Definitions

  • the present invention relates to the field of communications, and in particular to a method and an apparatus for generating a distributed index.
  • Big Data is often used to describe a large amount of unstructured and semi-structured data created by a company that consumes too much time and money when downloaded to a relational database for analysis.
  • Big data analytics is often associated with cloud computing because real-time large dataset analysis requires a MapReduce-like framework to distribute work to dozens, hundreds, or even thousands of computers. Big data in the Internet industry usually refers to such a phenomenon: Internet user behavior data generated and accumulated by Internet companies in their daily operations.
  • the present invention provides a method and an apparatus for generating a distributed index, so as to at least solve the problem that an efficient and fast index cannot be created for massive data in the related art. According to an aspect of the present invention, a method of generating a distributed index is provided.
  • the method for generating a distributed index includes: determining the number of map jobs in Hadoop according to the data amount of the original data; distributing the data processed by each map job to a plurality of reduce jobs, and generating An index library corresponding to each reduce job, wherein the number of reduce jobs and the correspondence between each reduce job and one or more map jobs are pre-configured; and the index library corresponding to each reduce job Consolidate.
  • generating an index library corresponding to each reduce job includes: obtaining a type of a file system currently supported; determining a generation manner of an index library corresponding to each reduce job according to a type of the file system; generating and each according to a generation manner The index library corresponding to the reduce job.
  • generating an index library corresponding to each reduce job according to the generating manner includes: when the file system type is a Hadoop Distributed File System (HDFS), generating an index library corresponding to each reduce job in the local disk, and then Upload the index library generated on the local disk to HDFS; or, when the file system type is other than HDFS, support the shared distributed file system (DFS), directly generate the same in the remaining DFS that supports sharing.
  • the index library corresponding to each reduce job includes: when the file system type is a Hadoop Distributed File System (HDFS), generating an index library corresponding to each reduce job in the local disk, and then Upload the index library generated on the local disk to HDFS; or, when the file system type is other than HDFS, support the shared distributed file system (DFS), directly generate the same in the remaining DFS that supports sharing.
  • the index library corresponding to each reduce job includes: when the file system type is a Hadoop Distributed File System (HDFS), generating an index library corresponding to each reduce job in the local disk, and then Upload the index
  • the merging of the index library corresponding to each reduce job includes: when the file system type is HDFS, downloading the index library corresponding to each reduce job in the HDFS to the local disk; The index libraries corresponding to the reduce jobs are merged; the merged index libraries are uploaded to the HDFS, and the index libraries corresponding to each reduce job in the local disk are deleted.
  • the merging of the index library corresponding to each reduce job comprises: merging the index libraries corresponding to each reduce job generated in the remaining DFS supporting the shared file when the type of the file system is the remaining DFS supporting the shared operation.
  • the index library corresponding to each reduce job generated in the remaining DFS that supports sharing is deleted.
  • a distributed index generating apparatus is provided.
  • the apparatus for generating a distributed index includes: a determining module configured to determine a number of map map jobs in Hadoop according to a data amount of the original data; and a generating module configured to allocate data processed by each map job to the plurality of The protocol reduces the job and generates an index library corresponding to each reduce job.
  • the number of reduce jobs and the correspondence between each reduce job and one or more map jobs are pre-configured; the merge module is configured to merge the index libraries corresponding to each reduce job.
  • the generating module includes: an obtaining unit, configured to acquire a type of the currently supported file system; and a determining unit configured to determine, according to the type of the file system, a generating manner of the index library corresponding to each reduce job; generating a unit, being set to Generate an index library corresponding to each reduce job according to the generation method.
  • the generating unit is configured to: when the file system type is Hadoop distributed file system HDFS, generate an index library corresponding to each reduce job in the local disk, and then upload the index library generated in the local disk to Or the generating unit is configured to generate an index library corresponding to each reduce job directly in the remaining DFS that supports sharing when the type of the file system is other than the HDFS supporting the shared distributed file system DFS.
  • the merging module comprises: a downloading unit, configured to download the index library corresponding to each reduce job in the HDFS to the local disk when the file system is of the type HDFS; the first merging unit is set to be in the local disk pair The index library corresponding to each reduce job is merged; the first processing unit is configured to upload the merged index library to the HDFS, and delete the index library corresponding to each reduce job in the local disk.
  • the merging module comprises: a second merging unit, configured to merge the index libraries corresponding to each reduce job generated in the remaining DFS that support sharing when the type of the file system is the remaining DFS that supports sharing; The processing unit is configured to delete the index library corresponding to each reduce job generated in the remaining DFS that supports sharing.
  • the number of map jobs in Hadoop is determined according to the data amount of the original data; the data processed by each map job is distributed to a plurality of reduce jobs, and an index library corresponding to each reduce job is generated, The number of reduce jobs and the correspondence between each reduce job and one or more map jobs are pre-configured; the index libraries corresponding to each reduce job are merged, that is, by using the map job in Hadoop and The reduce job processes the original data, generates an index library corresponding to each reduce job, and then merges the index library corresponding to each reduce job, thereby solving the problem that the related technology cannot create an efficient and fast index for massive data.
  • the problem in turn, enables efficient and fast indexing of massive amounts of data.
  • FIG. 1 is a flowchart of a method for generating a distributed index according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for generating a distributed index according to a preferred embodiment of the present invention
  • FIG. 4 is a block diagram showing the structure of a distributed index generating apparatus according to a preferred embodiment of the present invention.
  • BEST MODE FOR CARRYING OUT THE INVENTION the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.
  • 1 is a flow chart of a method for generating a distributed index according to an embodiment of the present invention. As shown in FIG.
  • the method may include the following processing steps: Step S102: determining the number of map jobs in Hadoop according to the data amount of the original data; Step S104: Allocating data processed by each map job to multiple reduce jobs, And generating an index library corresponding to each reduce job, wherein the number of reduce jobs and the correspondence between each reduce job and one or more map jobs are pre-configured; step S106: pair with each reduce The index library corresponding to the job is merged.
  • Step S102 determining the number of map jobs in Hadoop according to the data amount of the original data
  • Step S104 Allocating data processed by each map job to multiple reduce jobs, And generating an index library corresponding to each reduce job, wherein the number of reduce jobs and the correspondence between each reduce job and one or more map jobs are pre-configured
  • step S106 pair with each reduce The index library corresponding to the job is merged.
  • step S104 generating an index library corresponding to each reduce job may include the following operations: Step S1: acquiring a type of the currently supported file system; Step S2 : determining a generation manner of the index library corresponding to each reduce job according to the type of the file system; Step S3: generating an index library corresponding to each reduce job according to the generation manner.
  • Step S1 acquiring a type of the currently supported file system
  • Step S2 determining a generation manner of the index library corresponding to each reduce job according to the type of the file system
  • Step S3 generating an index library corresponding to each reduce job according to the generation manner.
  • M is a positive integer
  • the amount of data processed by each map job is dynamically configurable.
  • the map data processing plugin is set.
  • the intermediate key value pair generated after each map job processing is periodically written to the local disk, and the local disk can be divided into N (N is a positive integer), N is user-defined, and each partition is separately Corresponds to a reduce job.
  • N is a positive integer
  • N is user-defined
  • each partition is separately Corresponds to a reduce job.
  • the index is created to support the Hadoop Distributed File System (HDFS) and other distributed file systems (DFS) that support sharing.
  • HDFS Hadoop Distributed File System
  • DFS distributed file systems
  • step S3 generating an index library corresponding to each reduce job according to the generation manner may include one of the following steps: Step S31: When the file system type is Hadoop Distributed File System (HDFS), the local disk Generating an index library corresponding to each reduce job, and then uploading the index library generated in the local disk to the HDFS; Step S32: When the file system type is a distributed file system supporting sharing other than HDFS ( When DFS), the index library corresponding to each reduce job is generated directly in the remaining DFS that supports sharing.
  • HDFS Hadoop Distributed File System
  • DFS distributed file system supporting sharing other than HDFS
  • each reduce job if the currently supported file system is of type HDFS, each reduce job generates a temporary index library in the local file system (ie, the local disk); then, in the final cleaning process of the reduce job, Upload the temporary index library generated in the local file system to the HDFS file system.
  • the currently supported file system type is the remaining DFS that supports sharing, you can generate a temporary index library directly in the DFS file system.
  • the merging of the index library corresponding to each reduce job may include the following operations: Step S4: When the file system type is HDFS, the index library corresponding to each reduce job in the HDFS is used.
  • Step S5 merge the index library corresponding to each reduce job on the local disk
  • Step S6 Upload the merged index library to the HDFS, and delete the index library corresponding to each reduce job in the local disk.
  • the currently supported file system type is HDFS
  • all the temporary index libraries are first downloaded from the HDFS file system to the local file system by the Hadoop index master (second); All the temporary index libraries in the local file system are merged on the node to generate a complete index library; again, the complete index library is uploaded to the HDFS file system on the index master node; then, the local file is placed on the index master node.
  • Each temporary index library in the system is deleted.
  • step S106 merging the index library corresponding to each reduce job may include the following steps: Step S7: When the type of the file system is the remaining DFS that supports sharing, the remaining DFS is supported in the shared support. The index library corresponding to each reduce job is merged; Step S8: The index library corresponding to each reduce job generated in the remaining DFS that supports sharing is deleted.
  • the indexing master node of Hadoop first merges the temporary index library in the DFS file system into a complete index library for retrieval and use; Delete each temporary index library in the DFS file system on the index master node.
  • the process may include the following processing stages:
  • Hadoop's map job phase uses a distributed implementation that processes data in parallel, where the number of map jobs needs to be dynamically determined by the amount of data collected.
  • the data is processed by the collected text file or database file of Hadoop's map job, and the contents of each field (ie, key-value pair) required for creating the index are generated, thereby greatly improving the data processing performance. Since the plug-in processing is supported during the acquisition, different processing methods can be customized according to the amount of data.
  • the second phase Create the index phase, which is the reduce operation phase of Hadoop, and create a distributed index library.
  • the maximum value reduceNum of the parallel processing of the reduce job is determined by setting the number of reduce jobs.
  • the data generated during the data collection phase is assigned to each reduce job as an index by HashCode()%reduceNum, and each reduce job generates its own temporary index library file.
  • the index is created to support the Hadoop Distributed File System (HDFS) and other distributed file systems (DFS) that support sharing.
  • HDFS Hadoop Distributed File System
  • DFS distributed file systems
  • the third stage In the index merge stage, according to each temporary index library generated by each reduce job obtained in the index creation stage, the index master merges the index indexes to merge the temporary index libraries into one complete index library.
  • FIG. 3 is a structural block diagram of a device for generating a distributed index according to an embodiment of the present invention. As shown in FIG.
  • the apparatus may include: a determining module 10 configured to determine a number of mapping map jobs in Hadoop according to a data amount of the original data; and a generating module 20 configured to allocate the data processed by each map job to at most The rules reduce the jobs, and generate an index library corresponding to each reduce job, wherein the number of reduce jobs and the correspondence between each reduce job and one or more map jobs are pre-configured; the merge module 30 , set to merge the index library corresponding to each reduce job.
  • the device shown in FIG. 3 solves the problem that the related art cannot create an efficient and fast index for massive data, thereby realizing efficient and rapid indexing of massive data.
  • FIG. 3 solves the problem that the related art cannot create an efficient and fast index for massive data, thereby realizing efficient and rapid indexing of massive data.
  • FIG. 3 solves the problem that the related art cannot create an efficient and fast index for massive data, thereby realizing efficient and rapid indexing of massive data.
  • FIG. 3 solves the problem that the related art cannot create an efficient and fast index for
  • the generating module 20 may include: an obtaining unit 200 configured to acquire a type of a currently supported file system; and a determining unit 202 configured to determine an index corresponding to each reduce job according to a type of the file system
  • the generating method of the library; the generating unit 204 is configured to generate an index library corresponding to each reduce job according to the generating manner.
  • the generating unit 204 is configured to generate an index library corresponding to each reduce job in the local disk when the file system type is the Hadoop distributed file system HDFS, and then be in the local disk.
  • the generated index library is uploaded to the HDFS; or, the generating unit 204 is configured to generate and each directly in the remaining shared DFS when the file system type is other than the HDFS supporting the shared distributed file system DFS.
  • the index library corresponding to the reduce job may include: a downloading unit 300, configured to download an index library corresponding to each reduce job in the HDFS to a local disk when the type of the file system is HDFS;
  • the merging unit 302 is configured to merge the index library corresponding to each reduce job on the local disk pair;
  • the first processing unit 304 is configured to upload the merged index library to the HDFS, and delete the index library corresponding to each reduce job in the local disk.
  • the merging module 30 may include: a second merging unit 306, configured to generate and each reduce job generated in the remaining shared DFS when the type of the file system is the remaining DFS that supports sharing The corresponding index library is merged; the second processing unit 308 is configured to delete the index library corresponding to each reduce job generated in the remaining DFS that supports sharing.
  • the above-mentioned embodiments achieve the following technical effects (it is required that the effects are achievable by some preferred embodiments): by using the technical solution provided by the embodiment of the present invention, The raw data is processed by the map-reduce programming model in Hadoop, and the index library corresponding to each reduce job is generated, and then the index library corresponding to each reduce job is merged to form a complete index library for retrieval.
  • modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention.
  • a distributed index generation method and apparatus provided by an embodiment of the present invention have the following beneficial effects:
  • the original data is processed by using a map job and a reduce job in Hadoop, and each reduce job is generated.
  • the corresponding index library is then combined with the index library corresponding to each reduce job, thereby realizing efficient and rapid indexing of massive data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种分布式索引的生成方法及装置,在上述方法中,根据原始数据的数据量确定Hadoop中的map作业的数量;将经过各个map作业处理后的数据分配至多个reduce作业,并生成与每个reduce作业对应的索引库,其中,reduce作业的数量以及每个reduce作业与一个或多个map作业之间的对应关系均为预先配置完成的;对与每个reduce作业对应的索引库进行合并。根据本发明提供的技术方案,实现了对海量数据高效地、快速地进行索引。

Description

分布式索引的生成方法及装置 技术领域 本发明涉及通信领域, 具体而言, 涉及一种分布式索引的生成方法及装置。 背景技术 随着云时代的来临, 大数据 (Big data) 也吸引了越来越多的关注。 大数据通常用 来形容一个公司创造的大量非结构化和半结构化数据, 这些数据在下载到关系型数据 库用于分析时会耗费过多的时间和金钱。 大数据分析常与云计算联系到一起, 因为实 时的大型数据集分析需要像 MapReduce—样的框架来向数十、数百或甚至数千的电脑 分配工作。 而大数据在互联网行业通常指代这样一种现象: 互联网公司在日常运营中 生成、 累积的用户网络行为数据。 这些数据的规模是如此庞大, 以至于无法采用 G或 T来衡量。 大数据到底有多大?仅通过一天的时间, 互联网产生的全部内容即可刻满 1.68亿 张 DVD; 发送的邮件量可以达到 2940亿封之多; 发出的社区帖子能够达到 200万个; 销售的手机为 37.8万台…… 截止至 2012年,数据量已经从 TB( 1TB=1024GB )级别跃升到 PB ( 1PB=1024TB)、
EB ( 1EB=1024PB ) 乃至 ZB ( 1ZB=1024EB ) 级别。 国际数据公司 (IDC) 的研究结 果表明,2008年全球产生的数据量为 0.49ZB,2009年全球产生的数据量为 0.8ZB,2010 年全球产生的数据量增长为 1.2ZB,而 2011年全球产生的数据量更是高达 1.82ZB,相 当于全球每人产生 200GB以上的数据。 到 2012年为止, 人类生产的所有印刷材料的 数据量是 200PB,全人类历史上说过的所有话的数据量大约是 5EB。 IBM的研究表明, 整个人类文明所获得的全部数据中, 有 90%是过去两年内产生的。而到了 2020年, 全 世界所产生的数据规模将达到今天的 44倍。 目前, 在大数据时代, 如何从大数据中快速有效地搜索出用户所关心的数据已经 成为日趋重要的问题。 高效快速的创建索引是用户进行搜索的前提, 而相关技术中通 常采用的创建索引的技术方案均为单线程的, 在面对海量数据时存在性能瓶颈, 由于 对系统要求较高, 并且系统扩展能力有限, 其已经无法满足用户在海量数据中快速有 效地进行数据检索的需求。 发明内容 本发明提供了一种分布式索引的生成方法及装置, 以至少解决相关技术中无法对 海量数据创建高效快速的索引的问题。 根据本发明的一个方面, 提供了一种分布式索引的生成方法。 根据本发明的分布式索引的生成方法包括: 根据原始数据的数据量确定 Hadoop 中的映射 (map) 作业的数量; 将经过各个 map 作业处理后的数据分配至多个规约 (reduce) 作业, 并生成与每个 reduce作业对应的索引库, 其中, reduce作业的数量 以及每个 reduce作业与一个或多个 map作业之间的对应关系均为预先配置完成的;对 与每个 reduce作业对应的索引库进行合并。 优选地, 生成与每个 reduce作业对应的索引库包括: 获取当前支持的文件系统的 类型; 根据文件系统的类型确定与每个 reduce作业对应的索引库的生成方式; 按照生 成方式生成与每个 reduce作业对应的索引库。 优选地, 按照生成方式生成与每个 reduce作业对应的索引库包括: 当文件系统的 类型为 Hadoop分布式文件系统(HDFS)时, 在本地磁盘中生成与每个 reduce作业对 应的索引库, 然后将在本地磁盘中生成的索引库均上传至 HDFS; 或者, 当文件系统 的类型为除 HDFS之外的其余支持共享的分布式文件系统 (DFS) 时, 直接在其余支 持共享的 DFS中生成与每个 reduce作业对应的索引库。 优选地, 对与每个 reduce作业对应的索引库进行合并包括: 当文件系统的类型为 HDFS时, 将 HDFS中的与每个 reduce作业对应的索引库下载至本地磁盘; 在本地磁 盘对与每个 reduce作业对应的索引库进行合并;将合并后得到的索引库上传至 HDFS, 并将本地磁盘中的与每个 reduce作业对应的索引库进行删除。 优选地, 对与每个 reduce作业对应的索引库进行合并包括: 当文件系统的类型为 其余支持共享的 DFS时, 对其余支持共享的 DFS中生成的与每个 reduce作业对应的 索引库进行合并;将其余支持共享的 DFS中生成的与每个 reduce作业对应的索引库进 行删除。 根据本发明的另一方面, 提供了一种分布式索引的生成装置。 根据本发明的分布式索引的生成装置包括: 确定模块, 设置为根据原始数据的数 据量确定 Hadoop中的映射 map作业的数量; 生成模块, 设置为将经过各个 map作业 处理后的数据分配至多个规约 reduce作业, 并生成与每个 reduce作业对应的索引库, 其中, reduce作业的数量以及每个 reduce作业与一个或多个 map作业之间的对应关系 均为预先配置完成的;合并模块,设置为对与每个 reduce作业对应的索引库进行合并。 优选地, 生成模块包括: 获取单元, 设置为获取当前支持的文件系统的类型; 确 定单元,设置为根据文件系统的类型确定与每个 reduce作业对应的索引库的生成方式; 生成单元, 设置为按照生成方式生成与每个 reduce作业对应的索引库。 优选地, 生成单元, 设置为当文件系统的类型为 Hadoop分布式文件系统 HDFS 时, 在本地磁盘中生成与每个 reduce作业对应的索引库, 然后将在本地磁盘中生成的 索引库均上传至 HDFS; 或者, 生成单元, 设置为当文件系统的类型为除 HDFS之外 的其余支持共享的分布式文件系统 DFS时, 直接在其余支持共享的 DFS中生成与每 个 reduce作业对应的索引库。 优选地,合并模块包括:下载单元,设置为当文件系统的类型为 HDFS时,将 HDFS 中的与每个 reduce作业对应的索引库下载至本地磁盘; 第一合并单元, 设置为在本地 磁盘对与每个 reduce作业对应的索引库进行合并; 第一处理单元, 设置为将合并后得 到的索引库上传至 HDFS, 并将本地磁盘中的与每个 reduce作业对应的索引库进行删 除。 优选地, 合并模块包括: 第二合并单元, 设置为当文件系统的类型为其余支持共 享的 DFS时, 对其余支持共享的 DFS中生成的与每个 reduce作业对应的索引库进行 合并; 第二处理单元,设置为将其余支持共享的 DFS中生成的与每个 reduce作业对应 的索引库进行删除。 通过本发明实施例, 采用根据原始数据的数据量确定 Hadoop中的 map作业的数 量; 将经过各个 map作业处理后的数据分配至多个 reduce作业, 并生成与每个 reduce 作业对应的索引库,该 reduce作业的数量以及每个 reduce作业与一个或多个 map作业 之间的对应关系均为预先配置完成的; 对与每个 reduce作业对应的索引库进行合并, 即通过采用 Hadoop 中的 map作业和 reduce作业对原始数据进行处理, 生成与每个 reduce作业对应的索引库, 然后对与每个 reduce作业对应的索引库进行合并, 由此解 决了相关技术中无法对海量数据创建高效快速的索引的问题, 进而实现了对海量数据 高效地、 快速地进行索引。 附图说明 此处所说明的附图用来提供对本发明的进一步理解, 构成本申请的一部分, 本发 明的示意性实施例及其说明用于解释本发明, 并不构成对本发明的不当限定。 在附图 中: 图 1是根据本发明实施例的分布式索引的生成方法的流程图; 图 2是根据本发明优选实施例的分布式索引的生成方法的流程图; 图 3是根据本发明实施例的分布式索引的生成装置的结构框图; 图 4是根据本发明优选实施例的分布式索引的生成装置的结构框图。 具体实施方式 下文中将参考附图并结合实施例来详细说明本发明。 需要说明的是, 在不冲突的 情况下, 本申请中的实施例及实施例中的特征可以相互组合。 图 1是根据本发明实施例的分布式索引的生成方法的流程图。 如图 1所示, 该方 法可以包括以下处理步骤: 步骤 S102: 根据原始数据的数据量确定 Hadoop中的 map作业的数量; 步骤 S104: 将经过各个 map作业处理后的数据分配至多个 reduce作业, 并生成 与每个 reduce作业对应的索引库, 其中, reduce作业的数量以及每个 reduce作业与一 个或多个 map作业之间的对应关系均为预先配置完成的; 步骤 S106: 对与每个 reduce作业对应的索引库进行合并。 相关技术中, 无法对海量数据创建高效、 快速的索引。 采用如图 1所示的方法, 通过采用 Hadoop中的 map作业和 reduce作业对原始数据进行处理,生成与每个 reduce 作业对应的索引库, 然后对与每个 reduce作业对应的索引库进行合并, 由此解决了相 关技术中无法对海量数据创建高效快速的索引的问题,进而实现了对海量数据高效地、 快速地进行索引。 优选地, 在步骤 S104中, 生成与每个 reduce作业对应的索引库可以包括以下操 作: 步骤 S1 : 获取当前支持的文件系统的类型; 步骤 S2: 根据文件系统的类型确定与每个 reduce作业对应的索引库的生成方式; 步骤 S3 : 按照生成方式生成与每个 reduce作业对应的索引库。 在优选实施例中, 首先, 需要确定待获取的原始数据的数据量的大小, 并划分成 M (M为正整数)份, 其中, 每份数据分别对应一个 map作业。 当然, 每个 map作业 所处理的数据量是可以动态配置的。 由此, 设置 map数据处理插件。 此外, 经过各个 map作业处理后产生的中间键值对集合会定期写入本地磁盘, 本地磁盘又可以被划分 成 N (N为正整数)个, N是用户自定义设置的, 每个分区分别对应一个 reduce作业。 通过配置 reduce作业的最大数目, 以提高分布式索引的创建效率, 并且按照用户配置 的 reduce作业的数量设置 reduce数据处理插件。在该优选实施例中, 创建索引能够支 持 Hadoop分布式文件系统 (HDFS) 以及其它可支持共享的分布式文件系统 (DFS)。 因此, 可以根据创建索引过程中所支持的文件系统的类型差异确定与每个 reduce作业 对应的索引库的生成方式, 然后按照生成方式生成与每个 reduce作业对应的索引库。 优选地, 在步骤 S3中, 按照生成方式生成与每个 reduce作业对应的索引库可以 包括以下步骤之一: 步骤 S31 : 当文件系统的类型为 Hadoop分布式文件系统 (HDFS) 时, 在本地磁 盘中生成与每个 reduce作业对应的索引库, 然后将在本地磁盘中生成的索引库均上传 至 HDFS; 步骤 S32: 当文件系统的类型为除 HDFS之外的其余支持共享的分布式文件系统 (DFS) 时, 直接在其余支持共享的 DFS中生成与每个 reduce作业对应的索引库。 在优选实施例中, 如果当前支持的文件系统的类型为 HDFS, 那么每个 reduce作 业均在本地文件系统 (即本地磁盘) 中生成临时索引库; 然后, 在 reduce作业最后的 清理过程中, 可以将在本地文件系统中生成的临时索引库上传至 HDFS文件系统中。 如果当前支持的文件系统的类型为其余支持共享的 DFS,则可以直接在 DFS文件系统 中生成临时索引库。 优选地, 在步骤 S106中, 对与每个 reduce作业对应的索引库进行合并可以包括 以下操作: 步骤 S4: 当文件系统的类型为 HDFS时,将 HDFS中的与每个 reduce作业对应的 索引库下载至本地磁盘; 步骤 S5: 在本地磁盘对与每个 reduce作业对应的索引库进行合并; 步骤 S6: 将合并后得到的索引库上传至 HDFS, 并将本地磁盘中的与每个 reduce 作业对应的索引库进行删除。 在优选实施例中,如果当前支持的文件系统的类型为 HDFS,那么,首先由 Hadoop 的索引主节点 (master)从 HDFS文件系统中将全部临时索引库下载至本地文件系统; 其次, 在索引主节点上对本地文件系统中的全部临时索引库进行合并, 生成完整的索 引库; 再次, 在索引主节点上将完整的索引库上传至 HDFS文件系统中; 然后, 在索 引主节点上将本地文件系统中的各个临时索引库进行删除; 最后, Hadoop的索引从节 点 (slave) 从 HDFS文件系统中将完整的索引库下载至本地文件系统中, 以便检索使 用。 优选地, 在步骤 S106中, 对与每个 reduce作业对应的索引库进行合并可以包括 以下步骤: 步骤 S7: 当文件系统的类型为其余支持共享的 DFS时, 对其余支持共享的 DFS 中生成的与每个 reduce作业对应的索引库进行合并; 步骤 S8: 将其余支持共享的 DFS中生成的与每个 reduce作业对应的索引库进行 删除。 在优选实施例中, 如果当前支持的文件系统的类型为其余支持共享的 DFS, 那么 先由 Hadoop的索引主节点将 DFS文件系统中的临时索引库合并成完整的索引库, 以 便检索使用; 再在索引主节点上将 DFS文件系统中的各个临时索引库进行删除。 下面将结合图 2所示的优选实施方式对上述优选实施过程作进一步的描述。 图 2是根据本发明优选实施例的分布式索引的生成方法的流程图。 如图 2所示, 该流程可以包括以下处理阶段: 第一阶段: 数据采集阶段, 即 Hadoop的 map作业阶段, 数据采集阶段是设置索 引的前置准备阶段, 其能够为创建索引提供数据支持。 Hadoop的 map作业阶段所采 用的是分布式的实现方式, 其可以并行地处理数据, 其中, map作业的数量需要由采 集的数据量动态确定。 利用 Hadoop的 map作业的采集文本文件或者数据库文件对数 据进行处理, 生成创建索引所需要的各个字段 (即键值对 (key, value) 集合) 的内 容, 由此极大地提高了数据处理性能。 而在采集时由于支持插件处理, 因此可以根据 数据量定制不同的处理方式。 第二阶段: 创建索引阶段, 即 Hadoop的 reduce作业阶段, 创建分布式索引库。 通过设置 reduce作业的数目来确定 reduce作业并行处理的最大数值 reduceNum。在数 据采集阶段生成的数据通过 HashCode()%reduceNum来分配具体的数据到各个 reduce 作业作为索引, 每个 reduce作业分别生成自身的临时索引库文件。 需要说明的是, 创建索引能够支持 Hadoop分布式文件系统 (HDFS) 以及其它可 支持共享的分布式文件系统 (DFS)。 第三阶段: 索引合并阶段, 根据创建索引阶段得到的各个 reduce作业生成的各个 临时索引库, 由索引主节点调用索引合并将各个临时索引库合并成一个完整索引库。 在执行索引合并时, 可以逐个读取各个临时索引库, 将临时索引库合并至单独的主索 引库, 最后将各个临时索引库进行删除, 而由主索引库提供检索服务。 图 3是根据本发明实施例的分布式索引的生成装置的结构框图。 如图 3所示, 该 装置可以包括:确定模块 10,设置为根据原始数据的数据量确定 Hadoop中的映射 map 作业的数量; 生成模块 20, 设置为将经过各个 map作业处理后的数据分配至多个规约 reduce作业, 并生成与每个 reduce作业对应的索引库, 其中, reduce作业的数量以及 每个 reduce作业与一个或多个 map作业之间的对应关系均为预先配置完成的;合并模 块 30, 设置为对与每个 reduce作业对应的索引库进行合并。 采用如图 3所示的装置, 解决了相关技术中无法对海量数据创建高效快速的索引 的问题, 进而实现了对海量数据高效地、 快速地进行索引。 优选地, 如图 4所示, 生成模块 20可以包括: 获取单元 200, 设置为获取当前支 持的文件系统的类型; 确定单元 202, 设置为根据文件系统的类型确定与每个 reduce 作业对应的索引库的生成方式;生成单元 204,设置为按照生成方式生成与每个 reduce 作业对应的索引库。 优选地, 如图 4所示, 生成单元 204, 设置为当文件系统的类型为 Hadoop分布式 文件系统 HDFS时, 在本地磁盘中生成与每个 reduce作业对应的索引库, 然后将在本 地磁盘中生成的索引库均上传至 HDFS; 或者, 生成单元 204, 设置为当文件系统的类 型为除 HDFS之外的其余支持共享的分布式文件系统 DFS时,直接在其余支持共享的 DFS中生成与每个 reduce作业对应的索引库。 优选地, 如图 4所示, 合并模块 30可以包括: 下载单元 300, 设置为当文件系统 的类型为 HDFS时, 将 HDFS中的与每个 reduce作业对应的索引库下载至本地磁盘; 第一合并单元 302, 设置为在本地磁盘对与每个 reduce作业对应的索引库进行合并; 第一处理单元 304,设置为将合并后得到的索引库上传至 HDFS,并将本地磁盘中的与 每个 reduce作业对应的索引库进行删除。 优选地, 如图 4所示, 合并模块 30可以包括: 第二合并单元 306, 设置为当文件 系统的类型为其余支持共享的 DFS时,对其余支持共享的 DFS中生成的与每个 reduce 作业对应的索引库进行合并; 第二处理单元 308, 设置为将其余支持共享的 DFS中生 成的与每个 reduce作业对应的索引库进行删除。 从以上的描述中, 可以看出, 上述实施例实现了如下技术效果 (需要说明的是这 些效果是某些优选实施例可以达到的效果): 采用本发明实施例所提供的技术方案, 能 够通过采用 Hadoop中的 map-reduce编程模型对原始数据进行处理,生成与每个 reduce 作业对应的索引库, 然后对与每个 reduce作业对应的索引库加以合并, 形成一个完整 的索引库, 以便检索使用, 由此解决了相关技术中无法对海量数据创建高效快速的索 引的问题, 进而实现了对海量数据高效地、 快速地进行索引。 显然, 本领域的技术人员应该明白, 上述的本发明的各模块或各步骤可以用通用 的计算装置来实现, 它们可以集中在单个的计算装置上, 或者分布在多个计算装置所 组成的网络上, 可选地, 它们可以用计算装置可执行的程序代码来实现, 从而, 可以 将它们存储在存储装置中由计算装置来执行, 并且在某些情况下, 可以以不同于此处 的顺序执行所示出或描述的步骤, 或者将它们分别制作成各个集成电路模块, 或者将 它们中的多个模块或步骤制作成单个集成电路模块来实现。 这样, 本发明不限制于任 何特定的硬件和软件结合。 以上所述仅为本发明的优选实施例而已, 并不用于限制本发明, 对于本领域的技 术人员来说, 本发明可以有各种更改和变化。 凡在本发明的精神和原则之内, 所作的 任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。 工业实用性 如上所述, 本发明实施例提供的一种分布式索引的生成方法及装置具有以下有益 效果: 通过采用 Hadoop中的 map作业和 reduce作业对原始数据进行处理, 生成与每 个 reduce作业对应的索引库, 然后对与每个 reduce作业对应的索引库进行合并, 进而 实现了对海量数据高效地、 快速地进行索引。

Claims

权 利 要 求 书
1. 一种分布式索引的生成方法, 包括:
根据原始数据的数据量确定 Hadoop中的映射 map作业的数量; 将经过各个 map作业处理后的数据分配至多个规约 reduce作业,并生成与 每个 reduce作业对应的索引库, 其中, 所述 reduce作业的数量以及所述每个 reduce作业与一个或多个 map作业之间的对应关系均为预先配置完成的; 对与所述每个 reduce作业对应的索引库进行合并。
2. 根据权利要求 1所述的方法, 其中, 生成与所述每个 reduce作业对应的索引库 包括:
获取当前支持的文件系统的类型;
根据所述文件系统的类型确定与所述每个 reduce作业对应的索引库的生成 方式;
按照所述生成方式生成与所述每个 reduce作业对应的索引库。
3. 根据权利要求 2所述的方法, 其中, 按照所述生成方式生成与所述每个 reduce 作业对应的索引库包括:
当所述文件系统的类型为 Hadoop分布式文件系统 HDFS时, 在本地磁盘 中生成与所述每个 reduce作业对应的索引库, 然后将在所述本地磁盘中生成的 索引库均上传至所述 HDFS; 或者,
当所述文件系统的类型为除所述 HDFS之外的其余支持共享的分布式文件 系统 DFS时, 直接在所述其余支持共享的 DFS中生成与所述每个 reduce作业 对应的索引库。
4. 根据权利要求 3所述的方法, 其中, 对与所述每个 reduce作业对应的索引库进 行合并包括:
当所述文件系统的类型为所述 HDFS 时, 将所述 HDFS 中的与所述每个 reduce作业对应的索引库下载至所述本地磁盘;
在所述本地磁盘对与所述每个 reduce作业对应的索引库进行合并; 将合并后得到的索引库上传至所述 HDFS, 并将所述本地磁盘中的与所述 每个 reduce作业对应的索引库进行删除。
5. 根据权利要求 3所述的方法, 其中, 对与所述每个 reduce作业对应的索引库进 行合并包括:
当所述文件系统的类型为所述其余支持共享的 DFS时,对所述其余支持共 享的 DFS中生成的与所述每个 reduce作业对应的索引库进行合并;
将所述其余支持共享的 DFS中生成的与所述每个 reduce作业对应的索引库 进行删除。
6. 一种分布式索引的生成装置, 包括:
确定模块, 设置为根据原始数据的数据量确定 Hadoop中的映射 map作业 的数量;
生成模块, 设置为将经过各个 map 作业处理后的数据分配至多个规约 reduce作业, 并生成与每个 reduce作业对应的索引库, 其中, 所述 reduce作业 的数量以及所述每个 reduce作业与一个或多个 map作业之间的对应关系均为预 先配置完成的;
合并模块, 设置为对与所述每个 reduce作业对应的索引库进行合并。
7. 根据权利要求 6所述的装置, 其中, 所述生成模块包括:
获取单元, 设置为获取当前支持的文件系统的类型;
确定单元, 设置为根据所述文件系统的类型确定与所述每个 reduce作业对 应的索引库的生成方式;
生成单元, 设置为按照所述生成方式生成与所述每个 reduce作业对应的索 引库。
8. 根据权利要求 7所述的装置, 其中, 所述生成单元, 设置为当所述文件系统的 类型为 Hadoop分布式文件系统 HDFS时,在本地磁盘中生成与所述每个 reduce 作业对应的索引库, 然后将在所述本地磁盘中生成的索引库均上传至所述 HDFS; 或者, 所述生成单元, 设置为当所述文件系统的类型为除所述 HDFS 之外的其余支持共享的分布式文件系统 DFS 时, 直接在所述其余支持共享的 DFS中生成与所述每个 reduce作业对应的索引库。 根据权利要求 8所述的装置, 其中, 所述合并模块包括: 下载单元, 设置为当所述文件系统的类型为所述 HDFS时, 将所述 HDFS 中的与所述每个 reduce作业对应的索引库下载至所述本地磁盘;
第一合并单元, 设置为在所述本地磁盘对与所述每个 reduce作业对应的索 引库进行合并;
第一处理单元, 设置为将合并后得到的索引库上传至所述 HDFS, 并将所 述本地磁盘中的与所述每个 reduce作业对应的索引库进行删除。 根据权利要求 8所述的装置, 其中, 所述合并模块包括: 第二合并单元, 设置为当所述文件系统的类型为所述其余支持共享的 DFS 时,对所述其余支持共享的 DFS中生成的与所述每个 reduce作业对应的索引库 进行合并;
第二处理单元, 设置为将所述其余支持共享的 DFS 中生成的与所述每个 reduce作业对应的索引库进行删除。
PCT/CN2014/078696 2013-12-17 2014-05-28 分布式索引的生成方法及装置 WO2014180411A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310695615.6 2013-12-17
CN201310695615.6A CN104714983B (zh) 2013-12-17 2013-12-17 分布式索引的生成方法及装置

Publications (1)

Publication Number Publication Date
WO2014180411A1 true WO2014180411A1 (zh) 2014-11-13

Family

ID=51866791

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/078696 WO2014180411A1 (zh) 2013-12-17 2014-05-28 分布式索引的生成方法及装置

Country Status (2)

Country Link
CN (1) CN104714983B (zh)
WO (1) WO2014180411A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105430078A (zh) * 2015-11-17 2016-03-23 浪潮(北京)电子信息产业有限公司 一种海量数据的分布式存储方法
US11216516B2 (en) 2018-06-08 2022-01-04 At&T Intellectual Property I, L.P. Method and system for scalable search using microservice and cloud based search with records indexes

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354251B (zh) * 2015-10-19 2018-10-30 国家电网公司 电力系统中基于Hadoop的电力云数据管理索引方法
CN105610899B (zh) * 2015-12-10 2019-09-24 浪潮(北京)电子信息产业有限公司 一种文本文件并行上传方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467570A (zh) * 2010-11-17 2012-05-23 日电(中国)有限公司 用于分布式数据仓库的连接查询系统和方法
US20130151535A1 (en) * 2011-12-09 2013-06-13 Canon Kabushiki Kaisha Distributed indexing of data
CN103246549A (zh) * 2012-02-07 2013-08-14 阿里巴巴集团控股有限公司 一种数据转存的方法及系统
CN103440244A (zh) * 2013-07-12 2013-12-11 广东电子工业研究院有限公司 一种大数据存储优化方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100162230A1 (en) * 2008-12-24 2010-06-24 Yahoo! Inc. Distributed computing system for large-scale data handling
CN102479217B (zh) * 2010-11-23 2015-07-15 腾讯科技(深圳)有限公司 一种分布式数据仓库中实现计算均衡的方法及装置
US9361323B2 (en) * 2011-10-04 2016-06-07 International Business Machines Corporation Declarative specification of data integration workflows for execution on parallel processing platforms
CN102436491A (zh) * 2011-11-08 2012-05-02 张三明 一种基于BigBase的海量图片搜索系统及方法
CN102426609B (zh) * 2011-12-28 2013-02-13 厦门市美亚柏科信息股份有限公司 一种基于MapReduce编程架构的索引生成方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467570A (zh) * 2010-11-17 2012-05-23 日电(中国)有限公司 用于分布式数据仓库的连接查询系统和方法
US20130151535A1 (en) * 2011-12-09 2013-06-13 Canon Kabushiki Kaisha Distributed indexing of data
CN103246549A (zh) * 2012-02-07 2013-08-14 阿里巴巴集团控股有限公司 一种数据转存的方法及系统
CN103440244A (zh) * 2013-07-12 2013-12-11 广东电子工业研究院有限公司 一种大数据存储优化方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105430078A (zh) * 2015-11-17 2016-03-23 浪潮(北京)电子信息产业有限公司 一种海量数据的分布式存储方法
CN105430078B (zh) * 2015-11-17 2019-03-15 浪潮(北京)电子信息产业有限公司 一种海量数据的分布式存储方法
US11216516B2 (en) 2018-06-08 2022-01-04 At&T Intellectual Property I, L.P. Method and system for scalable search using microservice and cloud based search with records indexes

Also Published As

Publication number Publication date
CN104714983B (zh) 2019-02-19
CN104714983A (zh) 2015-06-17

Similar Documents

Publication Publication Date Title
US10601660B2 (en) Auto discovery of configuration items
Nayak et al. Type of NOSQL databases and its comparison with relational databases
Das et al. Big data analytics: A framework for unstructured data analysis
Casado et al. Emerging trends and technologies in big data processing
US8775464B2 (en) Method and system of mapreduce implementations on indexed datasets in a distributed database environment
US9002871B2 (en) Method and system of mapreduce implementations on indexed datasets in a distributed database environment
Mackey et al. Improving metadata management for small files in HDFS
CN106030573A (zh) 半结构化数据作为第一等级数据库元素的实现
WO2016180055A1 (zh) 数据存储、读取的方法、装置及系统
US9514244B2 (en) Dynamic assignment of business logic based on schema mapping metadata
WO2014180411A1 (zh) 分布式索引的生成方法及装置
Khine et al. Big data for organizations: a review
Das et al. A study on big data integration with data warehouse
US11061736B2 (en) Multiple parallel reducer types in a single map-reduce job
CN110347654B (zh) 一种上线集群特性的方法和装置
US11636111B1 (en) Extraction of relationship graphs from relational databases
CN106446039B (zh) 聚合式大数据查询方法及装置
Khan et al. On designing a generic framework for big data-as-a-service
Salloum et al. A sampling-based system for approximate big data analysis on computing clusters
CN109947702A (zh) 索引构建方法及装置、电子设备
CN107665241B (zh) 一种实时数据多维度去重方法和装置
Shetty et al. A novel web service composition and web service discovery based on map reduce algorithm
CN112817930A (zh) 一种数据迁移的方法和装置
CN104239576A (zh) 查找HBase表列值中所有行的方法和装置
CN104424309A (zh) 一种基于科技媒体云计算非结构化数据处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14795393

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14795393

Country of ref document: EP

Kind code of ref document: A1