CN103399944A - Implementation method and implementation device for data duplication elimination query - Google Patents

Implementation method and implementation device for data duplication elimination query Download PDF

Info

Publication number
CN103399944A
CN103399944A CN2013103539781A CN201310353978A CN103399944A CN 103399944 A CN103399944 A CN 103399944A CN 2013103539781 A CN2013103539781 A CN 2013103539781A CN 201310353978 A CN201310353978 A CN 201310353978A CN 103399944 A CN103399944 A CN 103399944A
Authority
CN
China
Prior art keywords
query
database
results
node
plurality
Prior art date
Application number
CN2013103539781A
Other languages
Chinese (zh)
Inventor
宋怀明
王勇
苗艳超
刘新春
邵宗有
Original Assignee
曙光信息产业(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 曙光信息产业(北京)有限公司 filed Critical 曙光信息产业(北京)有限公司
Priority to CN2013103539781A priority Critical patent/CN103399944A/en
Publication of CN103399944A publication Critical patent/CN103399944A/en

Links

Abstract

The invention discloses an implementation method and an implementation device for data duplication elimination query. The implementation method comprises the following steps: querying all database nodes in a plurality of database nodes to obtain query results, wherein for all the database nodes querying the query results, the query results queried by the database nodes are subjected to the duplication elimination operation, and the query results subjected to the duplication elimination operation are taken as the query results of the database nodes; merging the query results of the database nodes. The implementation method and the implementation device achieve the duplication elimination query operation of large-scale data in a database cluster, avoid the condition that duplication elimination query operation is confined to the duplication elimination division column, achieve the duplication elimination query to any data columns, and in addition, avoid the problem of repeated calculation in duplication elimination query, and improve the duplication elimination query efficiency.

Description

数据去重查询的实现方法和实现装置 Implementation deduplication data query and apparatus for implementing

技术领域 FIELD

[0001] 本发明涉及数据库存储领域,具体来说,涉及一种数据去重查询的实现方法和实现装置。 [0001] The present invention relates to the field of database is stored, particularly, to a data deduplication query implemented method and apparatus implemented.

背景技术 Background technique

[0002] 消除重复记录是目前数据库系统中常见的查询操作类型,这类查询通常也称作去重查询。 [0002] Elimination of duplicate records is a database system common query type, also commonly referred to these queries heavy query. 比如数据库应用系统通常需要列出所有不相同的记录,或者统计不相同的记录,或者统计不相同的记录的数量。 Such as database applications usually need to list all the different records or statistics are not the same records or statistics are not the same number of records.

[0003] 在单个数据库系统中,目前比较成熟的消除重复的方法主要有排序合并方法和散列合并方法。 [0003] in a single database system, relatively mature methods are duplicate elimination sort-merge and hash method combining method. 但是在由多个相互独立的数据库系统组成的数据库集群中,重复记录可能分布在不同的数据库服务器上,而由于数据库节点之间的网络传输和通信的开销,增加了跨节点数据去重查询的处理难度,此时就不能用现有的排序合并方法和散列合并的方法进行消除重复了,基于这一情况,人们想到了利用去重列划分的方式进行消除重复查询,然而,通常针对去重列的Hash数据划分策略,虽然能够很好的减少节点之前的数据交互,但是在面对其他属性的去重查询时,却也不可避免的引入节点之间大量的数据交互,增加了查询处理的复杂度,进而无法解决任意数据列的消除重复查询的问题,而且,现有的利用去重列的方式进行消除重复查询,在对查询结果汇总时,有可能会有重复的计算,使得消除重复查询的效率不是很理想。 However, a plurality of independent database cluster system consisting of a database, the duplicate records may be distributed on different database servers, and the overhead due to network transport and communications between the database nodes, the node adds across query data deduplication difficult to treat, can not be used at this time merge the existing sorting method and hash method to eliminate duplicate merger, and based on this, people thought were restated to eliminate the use of division manner to inquiries, however, usually go for Hash partitioning strategy restated data, although the decrease can be a good interaction data before the node, but in the face of heavy inquiries to other properties, but also inevitably introduce a large amount of data exchanged between the nodes, increasing the query processing complexity, and thus can not solve the problem of eliminating duplicate queries arbitrary data columns, and, using a conventional embodiment will be restated to eliminate repetitive queries, when the query results are summarized, there may be duplicate calculations, to eliminate such the efficiency of repetitive queries is not very satisfactory.

[0004] 针对相关技术中无法解决任意数据列的消除重复查询、以及在查询结果汇总时有重复计算,导致消除重复查询的效率不是很理想的问题,目前尚未提出有效的解决方案。 [0004] Eliminating duplicate query, as well as double counting when query results are summarized, resulting in the elimination of duplicate query efficiency is not very ideal of the problem, there is no effective solution for related technology can not solve any of the data columns.

发明内容 SUMMARY

[0005] 针对相关技术中无法解决任意数据列的消除重复查询、以及在查询结果汇总时有重复计算,导致消除重复查询的效率不是很理想的问题,本发明提出一种数据去重查询的实现方法和实现装置,能够解决现有相关技术不能在数据库集群上进行大规模数据的去重查询的问题,实现了任意数据列的去重查询。 [0005] in the related art can not solve any of the column data to eliminate duplication of query, as well as double counting results are summarized in the query, the query result in the elimination of duplicate efficiency is not ideal problem, the present invention is to a data de-duplication of query of method and apparatus for implementing existing technologies can solve the problem can not be to re-query the data on a large scale database cluster to achieve any column of data to re-query.

[0006] 本发明的技术方案是这样实现的: [0006] aspect of the present invention is implemented as follows:

[0007] 根据本发明的一个方面,提供了一种数据去重查询的实现方法。 [0007] In accordance with one aspect of the invention, there is provided a method for implementing data deduplication query.

[0008] 该数据去重查询的实现方法包括: [0008] Implementation of the deduplication data query comprises:

[0009] 对多个数据库节点中的每个数据库节点进行查询,得到查询结果,其中,对于每个查询到多个查询结果的数据库节点,对从该数据库节点查询得到的多个查询结果进行去重操作,并将去重操作得到的结果作为该数据库节点的查询结果; [0009] Each database node in the plurality of nodes database query to obtain the query result, wherein, for each of the plurality of query results to query the database nodes, the plurality of query results obtained from the database query node to re-operation, and the result of the operation to the re obtained as a result of the database query node;

[0010] 对多个数据库节点的查询结果进行合并。 [0010] The results of the query multiple databases nodes are combined.

[0011] 其中,在对多个查询结果进行去重操作时,可根据预定的排序方式对需要去重的查询结果进行排序,并基于排序后的查询结果进行去重操作。 [0011] wherein, when a plurality of query results to retry, the query results may be sorted according to the required predetermined weight to sort, and to re-operation based on the query results sorted.

[0012] 其中,在对多个数据库节点的查询结果进行合并时,可对至少两个数据库节点的查询结果进行合并,并且,将合并后的查询结果存储至预定的存储区域内;并对于未合并的数据库节点查询结果,将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并。 [0012] wherein, when a plurality of query results are merged database nodes, which can be merged database query results of at least two nodes, and the combined results within the query stored in a predetermined storage area; and not for the combined results of the database query nodes, unconsolidated query results are merged with the database node query result in the storage area.

[0013] 可选地,在将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果依次与存储区域内的查询结果进行合并。 [0013] Alternatively, when the unconsolidated query results are merged with the database node query result in the storage area, merging the query results in the database is not the case where a plurality of nodes, the plurality of unconsolidated database query results node in turn merged with the query results in the storage area.

[0014] 可选地,在将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果以批量的方式与存储区域内的查询结果进行合并。 [0014] Alternatively, when the unconsolidated query results are merged with the database node query result in the storage area, merging the query results in the database is not the case where a plurality of nodes, the plurality of unconsolidated query results database nodes in a batch manner combined with the query results in the storage area.

[0015] 可选地,在将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,预先将存储区域内的剩余空间划分为多个对应的存储子区域,并将多个未合并的数据库节点的查询结果按照预定的对应关系存储至对应的存储子区域,并在多个未合并的数据库节点的查询结果存储至对应存储子区域后,将存储子区域内的查询结果与存储区域内的原查询结果进行合并。 [0015] Alternatively, when the unconsolidated query results are merged with the database node query result in the storage area, merging the query results is not within the database node is a case where a plurality of storage areas in advance the remaining space is divided into a plurality of sub-regions corresponding to the memory, and a plurality of unconsolidated query results database node according to a predetermined correspondence relationship stored in the corresponding memory sub-regions, and a plurality of query results are stored in the database node unconsolidated corresponding to the storage area to the rear sub-query result in the storage sub-areas are combined with the original query result in the storage area.

[0016] 可选地,对于每个数据库节点,在将去重操作得到的结果作为该数据库节点的查询结果后,按照预定策略对该数据库节点的查询结果进行范围划分,得到多个查询结果组,并且,将每个查询结果组分别发送给与该数据库节点对应的其他数据库节点,以便其他数据库节点根据接收到的查询结果组对其他数据库节点的查询结果进行去重操作。 [0016] Alternatively, for each database node, after the deduplication operation results obtained as a result of the database query node performs range division node of the database results in a predetermined strategy, to obtain a plurality of query results groups , and the groups were each query result is sent to a node of the other databases corresponding to the node database to another database node to re-operate according to the received query result set of query results to other nodes in the database.

[0017] 其中,在按照预定策略对数据库节点的查询结果进行范围划分时,可按照预定的排序方式对数据库节点的查询结果进行排序,并且对排序后的查询结果进行等分。 [0017] wherein when the results of the database query node will be divided according to a predetermined strategy range, the query results may be sorted according to a predetermined database nodes are sorted, and the query results sorted aliquoted.

[0018] 可选地,在对多个数据库节点的查询结果进行合并后,可对合并的查询结果进行去重操作。 [0018] Alternatively, a plurality of query results in the database nodes are combined, the operation may be performed to the combined weight of the query results.

[0019] 根据本发明的另一方面,提供了一种数据去重查询的实现装置。 [0019] According to another aspect of the present invention, there is provided an apparatus for reproducing data query to achieve.

[0020] 该数据去重查询的实现装置包括: [0020] The device for implementing data deduplication query comprises:

[0021] 第一去重模块,用于对多个数据库节点的每个数据库节点进行查询,得到查询结果,其中,对于每个查询到多个查询结果的数据库节点,对从该数据库节点查询得到的多个查询结果进行去重操作,并将去重操作得到的结果作为数据库节点的查询结果; [0021] The first de-emphasis means for the plurality of nodes for each database query of database node, to obtain a query result, wherein, for each of the plurality of queries to query results database nodes, obtained from the database query node a plurality of query results to re-operation, and the operation result to the weight obtained as result of a database query node;

[0022] 结果合并模块,用于对多个数据库节点的查询结果进行合并。 [0022] Results merge module, a plurality of databases for query results are combined nodes.

[0023] 其中,第一去重模块包括第一排序子模块和第一去重子模块,第一排序子模块,用于根据预定的排序方式对需要去重的查询结果进行排序;第一去重子模块,用于基于排序后的查询结果进行去重操作。 [0023] wherein the first de-emphasis module comprises a first sub-module and a first sorting sub-module de-emphasis, a first sorting sub-module, configured to sort the query results need to re-sorted according to a predetermined manner; first sub-deduplication module, for performing operations to re-sorted based on the query results.

[0024] 其中,结果合并模块包括第一合并子模块和第二合并子模块,第一合并子模块,用于对至少两个数据库节点的查询结果进行合并,并且,将合并后的查询结果存储至预定的存储区域内;第二合并子模块,用于对于未合并的数据库节点的查询结果,将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并。 [0024] wherein a first result of merging comprises merging sub-module and a second module merging sub-module, a first combined sub-module for merging results of at least two database query nodes, and the combined query results are stored a second merging sub-module, configured to query results to the database node unconsolidated query result database node unconsolidated query results are merged in a storage area; into the predetermined storage area.

[0025] 可选地,第二合并子模块还用于将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果依次与存储区域内的查询结果进行合并。 [0025] Alternatively, the second sub-module is further used for merging the query results in the query result storage area unconsolidated database merge nodes, the query result is not merged database nodes for the case where a plurality of query result plurality of unconsolidated database nodes sequentially combined with the query result in the storage area.

[0026] 可选地,第二合并子模块还用于将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果以批量的方式与存储区域内的查询结果进行合并。 [0026] Alternatively, the second sub-module is further used for merging the query results in the query result storage area unconsolidated database merge nodes, the query result is not merged database nodes for the case where a plurality of the plurality of unconsolidated query results database node to the query results in the batch mode and the storage area are combined.

[0027] 可选地,第二合并子模块还用于将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,预先将存储区域内的剩余空间划分为多个存储子区域,并将多个未合并的数据库节点的查询结果按照预定的对应关系存储至对应的存储子区域内,并在多个未合并的数据库节点的查询结果存储至对应存储子区域后,将存储子区域内的查询结果与存储区域内的原查询结果进行合并。 [0027] Alternatively, the second sub-module is further used for merging the query results in the query result storage area unconsolidated database merge nodes, the query result is not merged database nodes for the case where a plurality of , previously stored in the remaining space in the storage area is divided into a plurality of sub-regions, and the plurality of the combined database query results are not stored in the storage node to the corresponding sub-area according to a predetermined correspondence relationship, and a plurality of unconsolidated after the database results to the corresponding node stored in memory sub-regions, the query results in the storage sub-areas are combined with the original query result in the storage area.

[0028] 可选地,数据去重查询的实现装置还包括第二去重模块,用于对每个数据库节点,在将去重操作得到的结果未做该数据库节点的查询结果后,按照预定策略对该数据库节点的查询结果进行范围划分,得到多个查询结果组,并且将每个查询结果组分别发给与该数据库节点对应的其他数据库节点,以便其他数据库节点根据接收到的查询结果组对其他数据库节点的查询结果进行去重操作。 [0028] Alternatively, to achieve the device data further comprising a second weight to the query module weight, for each database node, to result in the retry query result obtained without making the database node, in accordance with a predetermined the result of a database query strategy range division node to obtain a plurality of query results groups, and the groups were each query result available to other database nodes corresponding to the node database to another database node according to the received query result set other database query results node performs deduplication operation.

[0029] 其中,第二去重模块包括第二排序子模块和等分模块,第二排序子模块用于按照预定的排序方式对数据库节点的查询结果进行排序;等分模块,用于对排序后的查询结果进行等分。 [0029] wherein the second de-emphasis module comprises a second sub-module and aliquots sorting module, a second sub-module for performing sorting according to a predetermined ordering of search results ranking database nodes; aliquots module is configured to sort after the results were equally divided.

[0030] 可选地,数据去重查询的实现装置还包括第三去重模块,用于在对多个数据库节点的查询结果进行合并后,对合并的查询结果进行去重操作。 [0030] Alternatively, the device for implementing data deduplication query further comprises a third de-emphasis module for a plurality of query results in the database nodes are combined, the combined results of the query to retry.

[0031] 本发明通过以数据库集群中的每个数据库节点为目标,对数据库节点内的所有数据进行统一的去重查询,然后将数据库集群中所有的数据库节点的去重查询结果进行合并,从而实现了在数据库集群中进行大规模数据的去重查询操作,避免了在进行去重查询操作时局限于去重划分列,实现了对任意数据列的去重查询,另外,由于本发明的去重查询操作时对数据库节点内的所有数据进行统一性的操作,从而避免了在去重查询操作时进行重复计算的问题,提高了去重查询的效率。 [0031] In the present invention, each database node by the cluster as a target database, all data within the database node to re-unified query, the query results to the weight of all the nodes in the database cluster database and then combined to to achieve a large-scale operation to re-query the database data in a cluster, to avoid the re-division of the column limited to heavy during query, realized for any data deduplication query column, in addition, since the present invention to for unity of all the data within the database query node weight, thereby avoiding the problem of double-counting when to re-search operation, to improve the efficiency of re-query.

附图说明 BRIEF DESCRIPTION

[0032] 为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。 [0032] In order to more clearly illustrate the technical solutions in the embodiments or the prior art embodiment of the present invention, the drawings are briefly introduced as required for use in the embodiments describing the embodiments. Apparently, the accompanying drawings described below are merely Some embodiments of the invention, those of ordinary skill in the art is concerned, without creative efforts, can derive from these drawings other drawings.

[0033] 图1是根据本发明实施例的数据去重查询的实现方法的流程示意图; [0033] FIG. 1 is a schematic flow diagram of a method implemented according to an embodiment of the present invention, the data de-duplication query;

[0034] 图2是根据本发明实施例的数据库集群去重查询处理时的原理示意图; [0034] FIG. 2 is a schematic diagram according to the principles of the embodiment when the database cluster to re-query processing embodiment of the present invention;

[0035] 图3是根据本发明实施例的采用批量方式进行合并后的性能效果示意图; [0035] FIG. 3 is a schematic view of the combined performance results in batches of embodiment according to the present embodiment of the invention;

[0036] 图4是根据本发明实施例的同时采用划分存储区域和批量方式进行合并后的性能效果不意图; [0036] Figure 4 is divided into a storage area at the same time and batch mode according to an embodiment of the present invention, the combined effect of performance not intended;

[0037] 图5是根据本发明实施例的对查询结果进行范围划分时的示意图; [0037] FIG. 5 is a schematic view of the embodiment of the present embodiment range into the query results in accordance with the invention;

[0038] 图6是根据本发明实施例的数据去重查询的实现装置的结构示意图。 [0038] FIG. 6 is a schematic diagram of the apparatus to achieve a weight according to the data query in an embodiment of the present invention.

具体实施方式[0039] 下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。 DETAILED DESCRIPTION [0039] below in conjunction with the present invention in the accompanying drawings, technical solutions of embodiments of the present invention are clearly and completely described, obviously, the described embodiments are merely part of embodiments of the present invention, rather than all embodiments. 基于本发明中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本发明保护的范围。 Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art fall within the scope of protection of the present invention.

[0040] 根据本发明的实施例,提供了一种数据去重查询的实现方法。 [0040] According to an embodiment of the present invention, there is provided a method for implementing data deduplication query.

[0041] 如图1所示,根据本发明实施例的数据去重查询的实现方法包括: [0041] As shown in FIG. 1, a data-implemented method according to an embodiment of the present invention, the weight of the query comprises:

[0042] 步骤S101,对多个数据库节点中的每个数据库节点进行查询,得到查询结果,其中,对于每个查询到多个查询结果的数据库节点,对从该数据库节点查询得到的多个查询结果进行去重操作,并将去重操作得到的结果作为该数据库节点的查询结果。 [0042] In step S101, the database for each node of the plurality of nodes in the database query to obtain the query result, wherein, for each of the plurality of query results to query the database nodes, the plurality of queries from the query node from the database deduplication operation result, and the result of the operation to the re obtained as a result of the database query node.

[0043] 步骤S103,对多个数据库节点的查询结果进行合并。 [0043] Step S103, the plurality of database query results are combined nodes.

[0044] 上述步骤中,在对多个查询结果进行去重操作时,可根据预定的排序方式对需要去重的结果进行排序,并基于排序后的查询结果进行去重操作。 [0044] In the above step, when a plurality of query results to re-operation, the results need to be sorted according to a predetermined weight are sorted, and the operation to re-sorted based on the query results.

[0045] 这里的排序方式可以自行设置,例如,在查询通话记录时,可以按照通话时间的优先进行排序,也可以按照通话时间的长短进行排序,只要将相同的记录排序在一起,方便对相同的记录进行去重操作即可。 [0045] Here sort can set their own, for example, when querying call records can be sorted according to the priority of the talk time can be sorted according to the time length of the call, as long as the same sort records together, the same easy to re-recording operation can be performed.

[0046] 当然,在对排序后的查询结果进行去重操作时,也可以根据需要自行采用去重操作的方法,例如,将多个数据进行统一的、直观的对比去重,也可以将多个数据进行逐条或依次的对比去重。 [0046] Of course, when the query results to be re-sorted after operation, the method may be required to operate their own weight, for example, a plurality of data are unified, intuitive comparison to the weight, may be multiple data sequentially one by one or in contrast to the weight.

[0047] 此外,上述步骤中,在多个数据库节点的查询结果进行合并时,可对至少两个数据库节点的查询结果进行合并,并且,将合并后的查询结果存储至预定的存储区域内,并对于未合并的数据库节点的查询结果,将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并。 [0047] Further, the above step, when a plurality of database query results are merged nodes can be merged database query results of at least two nodes, and the combined results within the query stored in a predetermined storage area, and for the combined database query result is not a node, the unconsolidated query results are merged with the database node query result in the storage area.

[0048] 这里的这种合并方式可以使得在进行查询结果合并时,不需要等到所有的数据库节点都执行完成,只要有至少两个数据库节点返回了查询结果,就可以首先启动部分计算,当有新的数据库节点返回了查询结果后,再将新的查询结果与已经部分计算(即合并)后的查询结果进行合并,从而能够大幅度提高数据的合并性能。 [0048] This embodiment herein may be combined so that the combined result of performing the query, the database does not need to wait until all nodes are executed, as long as there are at least two database query results returned nodes, can start the first part of the calculation, when there is after the new node database query results returned, then the query result after the new query results already part of the calculation (i.e., combined) are combined, it is possible to greatly improve the performance of combined data.

[0049] 在实际操作的时候,新的数据库节点(即未合并的查询结果的数据库节点)的数量可以是一个,也可以是多个,当新的数据库节点的数量为一个时,直接将新的数据库节点的查询结果与存储区域内的查询结果进行合并,当新的数据库节点的数量为多个的时候,可以通过以下方式将多个新的数据库节点的查询结果与存储区域内的查询结果进行合并。 [0049] In the actual operation time, the number of new database nodes (i.e., nodes that are not combined database query results) may be one, or may be a plurality of, when the number of the new node as a database, a new direct the results of the database query with the query node is the result in the storage region merging, when the number of the new database for a plurality of nodes, they can by way of the query result in the query result of the plurality of nodes and the new database storage area merger.

[0050] 在将未合并的数据库节点的查询结果与存储区内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,可将多个未合并的数据库节点的查询结果依次与存储区域内的查询结果进行合并。 [0050] When merging the query results in the query result are not merged with nodes database storage area, merging the query results in the database is not the case where a plurality of nodes can be a plurality of unconsolidated database node query results in turn merged with the query results in the storage area. 例如,当有两个数据库节点返回查询结果时,此时将两个数据库节点返回的查询结果进行合并,并且,将合并后的查询结果存储至预定的存储区域(例如,共享的数据区域)内,此时,若有多个新的数据库节点返回了查询结果,那么可以根据返回的时间顺序,将第一个返回查询结果的数据库节点的查询结果与存储区域内的查询结果进行合并,合并成新的查询结果,然后将第二返回查询结果的数据库节点的查询结果与合并后的新的查询结果进行合并,以此类推。 For example, when there are two nodes database query result is returned at this time the two nodes returned by the database query results are combined, and the combined results of the query stored in a predetermined storage area (e.g., shared data area) results in this case, if a plurality of new node returns database query results, it can go back in time in the first database node to return query results are merged, merged into the query result in the storage area the new results, then the results of the second query returns the query result of the database nodes are combined with the new query results after the merger, and so on.

[0051] 当然,在将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,也可以将多个未合并的数据库节点的查询结果以批量的方式与存储区域内的查询结果进行合并。 In the case [0051] Of course, when the unconsolidated query results are merged with the database node query result in the storage area, the query results in unconsolidated database for a plurality of nodes, may be a plurality of unconsolidated query results database nodes in a batch manner combined with the query results in the storage area. 例如,当有两个数据库节点返回查询结果时,此时将两个数据库节点返回的查询结果进行合并,并且,将合并后的查询结果存储至预定的存储区域(例如,共享的数据区域)内,此时,若有多个新的数据库节点返回了查询结果,可以将多个新的数据库节点的查询结果放在一起统一的与存储区域内的查询结果进行合并。 For example, when there are two nodes database query result is returned at this time the two nodes returned by the database query results are combined, and the combined results of the query stored in a predetermined storage area (e.g., shared data area) in this case, if more than one new database node to return query results, query results more new database nodes can be placed together to merge with the unified query results in the storage area.

[0052] 另外,在将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,还可以预先将存储区域内的剩余空间划分为多个对应的存储子区域,并将多个未合并的数据库节点的查询结果按照预定的对应方式存储至对应的存储子区域,并在多个未合并的数据库节点的查询结果存储至对应存储子区域后,将存储子区域内的查询结果与存储区域内的原查询结果进行合并。 In the case [0052] Further, when the unconsolidated query results are merged with the database node query result in the storage area, the query result is not merged into a plurality of database nodes, you can also advance in the storage area the remaining space is divided into a plurality of sub-regions corresponding to the memory, and a plurality of unconsolidated query results database node in a predetermined manner corresponding to the respective memory storage sub-region, and stored in a plurality of unconsolidated query results database node corresponding to the storage area to the rear sub-query result in the storage sub-areas are combined with the original query result in the storage area.

[0053] 在上述三种合并的方式中,不管是采用批量的方式进行合并,还是采用将存储区域进行划分的方式进行合并,都能减小实际操作时的合并冲突。 [0053] In the three combined manner, regardless of the way in batches were combined, or the use of the storage area-wise divided merge, merge conflicts can be reduced during actual operation.

[0054] 为了提高去重查询的精确度,本发明在将多个数据库节点中的每个数据库节点进行去重操作后,还对多个数据库节点之间进行了去重操作,本发明中,对于多个数据库节点之间的去重操作有以下两种方式。 After [0054] In order to improve the accuracy of the weight of the query, the present invention each database node in the plurality of nodes in the database to perform a retry, but also between a plurality of nodes databases deduplication operation, the present invention, for de-duplication operation between the plurality of nodes databases there are two ways.

[0055] 第一种是对于每个数据库节点,在将去重操作得到的结果作为该数据库节点的查询结果后,按照预定策略对数据库节点的查询结果进行范围划分,得到多个查询结果组,并且,将每个查询结果组分别发送给与该数据库节点对应的其他数据库节点,以便其他的数据库节点根据接收到的查询结果组对自身的查询结果进行去重操作。 [0055] The first database is for each node, after the deduplication operation results obtained as a result of the database query node, node database query result is divided according to a predetermined range of policy groups to obtain a plurality of query results, and the groups were each query result is sent to a node of the other databases corresponding to the node database to another database node to re-operate according to the received query result of the query result set itself.

[0056] 其中,在按照预定策略对数据库节点的查询结果进行范围划分时,可以按照预定的排序方式对数据库节点的查询结果进行排序,并且对排序后的查询结果进行等分。 [0056] wherein when the results of the database query node will be divided according to a predetermined range of policies, the query results may be sorted in a predetermined database nodes are sorted, and the query results sorted aliquoted.

[0057] 当然,这里的排序方式是可以自行指定的,而且这里的划分方式也不局限与等分划分这种,也可以根据需要,采用其他的划分方式,在此就不对其他的划分方式进行一一说明了。 [0057] Of course, the sort can be assigned its own, but is not limited herein division manner and aliquots This division may be required, to use other division manner, no other way this division setting them out.

[0058] 第二种是在对多个数据库节点的查询结果进行合并后,对合并的查询结果进行去重操作。 [0058] After the second query result is a plurality of database nodes are combined, the combined results of the query to retry.

[0059] 以下通过具体实例对本发明的上述技术方案进行详细说明。 [0059] The following detailed description of the above technical solution of the present invention by way of specific examples.

[0060] 在实际使用时,本发明的技术方案如图2所示,图2是数据库集群去重查询处理时的原理示意图,从图2中可以看出,本发明的主要思想是将查询分为一个主控节点(Master)和多个从动节点(Slave),其中,主控节点是对查询的整个过程进行控制,并对最终的查询结果进行汇总,而从动节点则在各个数据库节点上进行去重查询,因此,对于一个任意的去重查询来说,其执行的过程主要分为两个步骤:1)在各个数据库节点上进行去重查询;2)对所有数据库节点返回的结果进行合并。 [0060] In actual use, the technical solution of the present invention shown in FIG. 2, FIG. 2 is a schematic of the time to re-cluster database query processing diagram can be seen in FIG. 2, the main idea of ​​the present invention is to query points a master node (Master) and a plurality of slave nodes (the slave), wherein the master node is a query of the entire process is controlled and the final query result Summarizing, the driven node at each node database to re-query performed on, therefore, for an arbitrary query, to weight, its execution is divided into two steps: 1) to re-query the database in each node; 2) results for all the nodes returned by the database merger.

[0061] 由于重复记录可能分布在不同的数据库节点上,因此,在每个数据库节点消除重复之后,还必须对所有数据库节点的数据进行再次去重。 [0061] Since the duplicate records may be distributed on different database nodes, therefore, after each database node eliminate duplication, must also be re again to all database data nodes. 以避免结果汇总成为系统的瓶颈,例如,数据量太大,或者各数据库节点执行的时间差异较大等。 The results are summarized in order to avoid a bottleneck in the system, for example, too much data, or time of each node to perform quite different databases and the like.

[0062] 为了避免因数据量太大,导致结果汇总成为系统的瓶颈,可以在各个数据库节点上采用归并排序的方式,对查询结果进行消除重复数据,例如,图2中,从动节点中的合并线程则表示在查询结果上对结果进行去重合并,形成每个数据库节点的局部结果集,另外,还需在排序的结果之上,主控节点采用多路归并的方法,进行结果的合并。 [0062] In order to avoid too much data, leading to become a bottleneck of the system the results are summarized, may be used in the merge sort of way on each database node, query results deduplication, for example, in FIG. 2, the slave node the combined results of the said thread on the query results to the combined weight, the result set is formed for each local node database, in addition, the need to sort the results above, the master node using the method of multiple merged, the merged results .

[0063] 为了提高合并的效率,避免对数据库节点之间的相互等待,可以在有两个数据库节点返回结果时,就进行数据的合并处理(合并的结果依然是有序的),将之写入到一个共享的数据区域内,之后,每当有一个数据库节点返回结果时,则可以创建一个合并线程,将共享的数据区域的数据和新的返回结果进行合并。 [0063] In order to improve the combined efficiency and avoid the wait for each other between the database nodes, there may be two databases when node returns a result, the data are merged (combined result is still ordered), I will write into a shared data area, and then, whenever there is a database node returns a result, you can create a merged thread, data will be shared data area and the new return results are combined.

[0064] 而由于在合并的时候采用的是一个共享的数据区域进行保存,因此每个合并线程将数据写入到该数据区域时,必须加锁,而当数据库节点数越多,则对该数据区域加锁的次数也越多,发送冲突的概率也就越大。 [0064] As a result of the merger is when a shared data storage area, and therefore each merged threads to write data to the data area, must be locked, and when more database nodes, then the the number of data locked area the more, the greater the probability of collision sent. 在实现中,为了减小这些数据写入的冲突,可以采用批量写入方式和共享区域分区的方式来减小锁的次数。 In an implementation, in order to reduce the data write conflicts, it can be used to write a batch mode and the shared mode to reduce the number of area sections of the lock.

[0065] 这里所说的采用批量写入方式来减小锁的次数,即为每个合并线程并不是合并完一条结果,就写入到共享区域,而是采用批量的方式,一次写入多条记录。 [0065] times in batches to reduce the writing method mentioned here the lock, i.e. each thread is not combined finish a combined result is written to the shared area, instead of using the batch mode, a write-once multi- Records. 采用批量写入的方式,可以大幅减小加锁的次数,有效的减小了锁的冲突。 By way of batch write, you can significantly reduce the number of locked, effectively reducing lock conflicts. 采用批量写入的方式所达到的性能效果可以如图3所示,图3是采用批量方式进行合并后的性能效果示意图,在图3中,横坐标是批量大小参数,纵坐标是写入的时间,从图3中可以看出,当批量大小为100时,和单条记录与入相比,性能提闻90%以上。 Batch manner using write performance results can be achieved as shown in FIG. 3, FIG. 3 is a schematic view of performance results in batches of the merger, in FIG. 3, the abscissa is the batch size parameter ordinate is written time, can be seen in FIG. 3, when the batch size is 100, and compared with the single record, smell than 90% performance increase.

[0066] 这里所说的采用共享区域分区的方式来减小锁的次数,即对共享区域进行划分,分成几个不相干的小的共享区域,这样在数据写入时,多个区域可以分别加锁,减小了锁的力度。 [0066] the shared area using the number of partitions to reduce the lock manner mentioned herein, i.e. the shared area divided into a small number of unrelated shared region, so that when writing data, the plurality of regions may be separately lock, lock of reduced intensity. 这种方式可以允许多个合并线程对不同的共享区域同时进行加锁,有效的减小了锁的冲突。 This may allow a plurality of threads of different combined simultaneously locking the shared area, effectively reduces the lock conflict.

[0067] 当然,在实际运用中,可以将上述的两种方式叠加在一起使用,将两种方式叠加在一起使用后的性能效果如图4所示,图4是同时采用划分存储区域和批量方式进行合并后的性能效果示意图,在图4中,横坐标是哈希散列分区号码,纵坐标是写入的时间,从图4中可以看出,采用批量写入后,当共享区划分为32个小的区域时,性能可以再次提升15%以上。 [0067] Of course, in practice, the two methods described above may be added together using the two methods is superimposed after performance results used together. 4, FIG. 4 is a storage area while using division and batch FIG. after performance results combined schematic manner, in FIG. 4, the abscissa hashes partition number, the ordinate is the time of writing, it can be seen from Figure 4, using the batch is written, when the shared zoning region 32 is small, and can improve the performance more than 15% again.

[0068] 另外,从图2中,可以看出,每个数据库节点只对本节点的数据进行去重合并,之后对多个节点的结果合并则放到了主控节点进行,当结果较大时,虽然采用了多线程并行的方式,但主控节点的结果合并依然会成为一个瓶颈,因此,可对结果合并进行进一步优化,具体方法如下: [0068] Further, from FIG. 2, it can be seen, each database node to the data node only to be re-combined and then merge the results into a plurality of nodes is a master node when the result is large, Although the use of multi-threaded parallel manner, but the combined results of the master node will still be a bottleneck, therefore, it can be combined to further optimize the results, as follows:

[0069] I)每个数据库节点执行完去重查询后,不直接将结果发送给主控各节点,而是将结果按照数据的范围切分后,分别发送给其他所有的从动节点。 [0069] I) each database node to re After executing the query, the results are not directly sent to the master to each node, but the segmentation results were sent to all the other slave nodes according to the range data. 如图5所示,图5是对查询结果进行范围划分时的示意图。 As shown in FIG. 5, FIG. 5 is a schematic view of the result of the query scoping performed.

[0070] 2)每个从动节点在接收到其他从动节点的结果后,则在本节点进行去重合并,完成后再将结果返回给主控节点。 [0070] 2) each slave node after receiving the results of other slave nodes to be re-incorporated in the present node, and then returns the result to the completion of the master node.

[0071] 3)主控节点接收到任意一个节点的去重结果后,无需再进行合并,即可将高结果返回给用户。 After [0071] 3) received by the master node to any node of the result of weight, no longer need to be combined, the result can be returned to the user high.

[0072] 由于每个从动节点在查询时,首先进行了归并排序的方式,因此对结果进行范围切分是比较容易的,通常可以采用等分的方式,比如结果中最大值为Max,最小值为Min,节点数为n,则可将每个节点的结果切分为如下的方式:[0073] Slavel [Min, Min+(Max-Min)/n); [0072] Since each slave node in the query, first, a merge sort of way, and therefore the range of the results of the segmentation is relatively easy, usually equally divided may be employed, such as the maximum value of the result Max, a minimum value Min, the number of nodes is n, the result may be cut into each node in the following manner: [0073] Slavel [Min, Min + (Max-Min) / n);

[0074] Slave2 [Min+(Max-Min)/n, Min+2*(Max-Min)/n); [0074] Slave2 [Min + (Max-Min) / n, Min + 2 * (Max-Min) / n);

[0075]...[0076] Slaven [Min+(n_l)(Max-Min)/n, Max); [0075] ... [0076] Slaven [Min + (n_l) (Max-Min) / n, Max);

[0077] 由于每个从动节点的数据为没有重叠,因此,主控节点在接收到各从动节点的合并结果之后,无需再进行去重合并,可直接返回给用户。 [0077] Since the data of each slave node is not overlapped, therefore, the master node after receiving the result of the merger of the respective slave node, no longer need to be re-combined, can be returned directly to the user. 这个优化是将由一个主控节点执行的去重合并分发给多个从动节点上并行执行,可进一步将去重合并的性能提高数倍。 This optimization is performed to re-combined by a master node on a distributed parallel execution of a plurality of slave nodes, the weight can be further combined to improve performance times.

[0078] 根据本发明的实施例,还提供了一种数据去重查询的实现装置。 [0078] According to an embodiment of the present invention, there is provided an apparatus for reproducing data query to achieve.

[0079] 如图6所示,根据本发明实施例的数据去重查询的实现装置包括: [0079] As shown in FIG. 6, the device realized according to the present invention, an embodiment of the data to re-query comprises:

[0080] 第一去重模块61,用于对多个数据库节点的每个数据库节点进行查询,得到查询结果,其中,对于每个查询到多个查询结果的数据库节点,对从该数据库节点查询得到的多个查询结果进行去重操作,并将去重操作得到的结果作为数据库节点的查询结果; [0080] The first de-emphasis module 61, a plurality of nodes for each database query of the database node, to obtain a query result, wherein, for each of the plurality of query results to query the database node to node from the database query a plurality of query results to be re-operation, and the operation result to the weight obtained as result of a database query node;

[0081 ] 结果合并模块62,用于对多个数据库节点的查询结果进行合并。 [0081] Results merge module 62, a plurality of database query results are combined nodes.

[0082] 上述结构中,第一去重模块61包括第一排序子模块(未不出)和第一去重子模块(未示出),第一排序子模块,用于根据预定的排序方式对需要去重的查询结果进行排序;第一去重子模块,用于基于排序后的查询结果进行去重操作。 [0082] In the above structure, a first weight to a first sorting module 61 comprises a sub-module (not not) and the first de-emphasis submodule (not shown), a first sorting sub-module, configured to according to a predetermined ordering query results need to re-sort; a first sub-module to weight, based on the query results to be sorted retry.

[0083] 上述结构中,结果合并模块62包括第一合并子模块(未示出)和第二合并子模块(未示出),第一合并子模块,用于对至少两个数据库节点的查询结果进行合并,并且,将合并后的查询结果存储至预定的存储区域内;第二合并子模块,用于对于未合并的数据库节点的查询结果,将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并。 [0083] In the above configuration, the results of the merge module 62 includes a first combined sub-module (not shown) and a second merging sub-module (not shown), a first combined sub-module, configured to query the database for the at least two nodes the results were combined and the combined result of the query stored in a predetermined storage area; and a second merging sub-module, configured to query results to the database node unconsolidated query results are not merged with nodes database storage area query results in the merger.

[0084] 在实际操作的时候,未合并的查询结果的数据库节点的数量可以是一个,也可以是多个,当未合并的查询结果的数据库节点的数量为一个时,直接将未合并的查询结果的数据库节点的查询结果与存储区域内的查询结果进行合并,当未合并的查询结果的数据库节点的数量为多个的时候,根据实际情况,第二合并子模块还可具备以下功能。 [0084] In the actual operation time, the number of nodes of unconsolidated database query result may be one or may be plural, when the number of nodes of unconsolidated database query result is one, directly to the query unconsolidated node database query results are combined with the results of the query result in the storage area, when the number of nodes of unconsolidated database query results into a plurality of time according to the actual situation, a second merging sub-module may also have the following functions.

[0085] 第二合并子模块还用于将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果依次与存储区域内的查询结果进行合并。 [0085] When the second merging sub-module is further configured to query results database node unconsolidated query results are merged in a storage area, in the case of a database query result is not merged into a plurality of nodes, a plurality of unconsolidated query results database nodes sequentially combined with the query result in the storage area.

[0086] 或者,第二合并子模块还用于将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果以批量的方式与存储区域内的查询结果进行合并。 When the case where the [0086] Alternatively, a second merging sub-module is further configured to query results database node unconsolidated query results are merged in a storage area in unconsolidated query results database for a plurality of nodes, the a plurality of unconsolidated query results database node in a batch manner with the query results are combined in the storage area.

[0087] 或者,第二合并子模块还用于将未合并的数据库节点的查询结果与存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,预先将存储区域内的剩余空间划分为多个存储子区域,并将多个未合并的数据库节点的查询结果按照预定的对应关系存储至对应的存储子区域内,并在多个未合并的数据库节点的查询结果存储至对应存储子区域后,将存储子区域内的查询结果与存储区域内的原查询结果进行合并。 When [0087] Alternatively, a second merging sub-module is further configured to query results database node unconsolidated query results are combined in the storage area, under the combined database query result is not the node for the case where a plurality of previously the remaining space in the storage area is divided into a plurality of sub-storage areas, and the plurality of unconsolidated query results database node according to a predetermined correspondence relationship stored in the corresponding memory sub-regions, and a plurality of node database unmerged the query results are stored to a corresponding memory sub-area after the query result in the storage sub-areas are combined with the original query result in the storage area.

[0088] 为了提高去重查询的精确度,本发明的数据去重查询的实现装置还可以包括以下结构。 [0088] In order to improve the accuracy of the weight of the query, the data device for implementing the present invention is to further re-query may include the following structure.

[0089] 数据去重查询的实现装置还包括第二去重模块(未示出),用于对每个数据库节点,在将去重操作得到的结果未做该数据库节点的查询结果后,按照预定策略对该数据库节点的查询结果进行范围划分,得到多个查询结果组,并且将每个查询结果组分别发给与该数据库节点对应的其他数据库节点,以便其他数据库节点根据接收到的查询结果组对其他数据库节点的查询结果进行去重操作。 Means to achieve [0089] deduplication data query module further comprises a second weight to the rear (not shown), for each database node, to the results obtained without making the retry query result of the database node, in accordance with the predetermined database results strategies range division node, query results to obtain a plurality of groups, and the groups were each query result available to other database nodes corresponding to the node database to another database node according to the received query result group query results from other database nodes to be re-operation.

[0090] 其中,第二去重模块包括第二排序子模块(未示出)和等分模块(未示出),第二排序子模块用于按照预定的排序方式对数据库节点的查询结果进行排序;等分模块,用于对排序后的查询结果进行等分。 [0090] wherein the second module comprises a second re-ordering to sub-module (not shown) and aliquots module (not shown), a second sub-module for performing sorting according to a predetermined sort query results database node sorting; aliquoted module, configured to query results sorted aliquoted.

[0091] 或者,数据去重查询的实现装置还包括第三去重模块(未示出),用于在对多个数据库节点的查询结果进行合并后,对合并的查询结果进行去重操作。 Means to achieve [0091] Alternatively, the data de-duplication query further comprises a third de-emphasis module (not shown) for merging the query results in multiple databases of nodes, the combined results of the query to retry.

[0092] 综上所述,借助于本发明的上述技术方案,通过以数据库集群中的每个数据库节点为目标,对数据库节点内的所有数据进行统一的去重查询,然后将数据库集群中所有的数据库节点的去重查询结果进行合并,从而实现了在数据库集群中进行大规模数据的去重查询操作,避免了在进行去重查询操作时局限于去重划分列,实现了对任意数据列的去重查询,另外,由于本发明的去重查询操作时对数据库节点内的所有数据进行统一性的操作,从而避免了在去重查询操作时进行重复计算的问题,提高了去重查询的效率。 [0092] In summary, by means of the above technical solutions of the present invention, all the data in the unified database node to re-query the database by each node in the cluster as the target database, then the database cluster all deduplication node database query results are merged, in order to achieve a large-scale operation to re-query the data in the database cluster, avoiding confined to the rezoning to heavy columns during query operations, the realization of any of the data columns deduplication queries, Further, since all the data in the database node to the weight of unity operation of the present invention, query operation, thereby avoiding the problem of double-counting when to re-search operation, to increase the weight of the query effectiveness.

[0093] 此外,借助于本发明的上述技术方案,通过直接将每个数据库节点的查询结果进行范围划分,并将对应的范围中的查询结果发送到其他的数据库节点上进行再次去重,或者通过对多个数据库节点合并后的查询结果进行再次去重,从而提高了在数据库集群中进行去重查询的精确率,大大的保证了去重查询的质量。 [0093] Further, by means of the above technical solution of the present invention, the scope of the query result by dividing each database node will directly, and sends the query result corresponding to the range of another node to re database again, or through the query results after multiple database nodes to re-again merger, thereby improving the accuracy rate to re-query the database cluster, greatly to ensure the quality of heavy queries.

[0094] 以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 [0094] The foregoing is only preferred embodiments of the present invention but are not intended to limit the present invention, any modifications within the spirit and principle of the present invention, the, equivalent substitutions, improvements should be included in the within the scope of the present invention.

Claims (10)

1.一种数据去重查询的实现方法,其特征在于,包括: 对多个数据库节点中的每个数据库节点进行查询,得到查询结果,其中,对于每个查询到多个查询结果的数据库节点,对从该数据库节点查询得到的多个查询结果进行去重操作,并将去重操作得到的结果作为该数据库节点的查询结果; 对多个数据库节点的查询结果进行合并。 A method to achieve weight data query, wherein, comprising: a plurality of nodes for each database query of the database node, to obtain a query result, wherein, for each of the plurality of queries to query results database nodes , a plurality of query results obtained from the database query node to re-operation, and the result of the operation to the re obtained as a result of the database query node; database query results plurality of nodes are combined.
2.根据权利要求1所述的实现方法,其特征在于,对多个查询结果进行去重操作包括: 根据预定的排序方式对需要去重的查询结果进行排序; 基于排序后的查询结果进行去重操作。 The realization method according to claim 1, wherein the plurality of query results to a weight comprises: sorting the need to re-query results according to a predetermined ordering; based on the query results to be sorted re-operation.
3.根据权利要求1所述的实现方法,其特征在于,对多个数据库节点的查询结果进行合并包括: 对至少两个数据库节点的查询结果进行合并,并且,将合并后的查询结果存储至预定的存储区域内; 对于未合并的数据库节点的查询结果,将未合并的数据库节点的查询结果与所述存储区域内的查询结果进行合并。 3. The method of claim 1 implemented according to the preceding claims, characterized in that a plurality of database query results are combined node comprising: at least two database query results are merged nodes, and the query result to the combining storage in a predetermined storage area; unconsolidated query results to the database nodes, the query results in the query result are not merged with the database node storage region merging.
4.根据权利要求3所述的实现方法,其特征在于,将未合并的数据库节点的查询结果与所述存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果依次与所述存储区域内的查询结果进行合并。 4. The method of claim 3 implemented according to the preceding claims, characterized in that the query results in the query result are not merged with the database of node storage region when combined, results in unconsolidated query database for a plurality of nodes in the case where the plurality of unconsolidated query results database nodes sequentially combined with the result of the query in the storage area.
5.根据权利要求3所述的实现方法,其特征在于,将未合并的数据库节点的查询结果与所述存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,将多个未合并的数据库节点的查询结果以批量的方式与所述存储区域内的查询结果进行合并。 The realization of the method according to claim 3, characterized in that the query results in the query result are not merged with the database of node storage region when combined, results in unconsolidated query database for a plurality of nodes in the case where the plurality of unconsolidated query results database node to the query results in batch mode and the storage area are combined.
6.根据权利要求3所述的实现方法,其特征在于,将未合并的数据库节点的查询结果与所述存储区域内的查询结果进行合并时,在未合并的数据库节点的查询结果为多个的情况下,预先将所述存储区域内的剩余空间划分为多个存储子区域,并将多个未合并的数据库节点的查询结果按照预定的对应关系存储至对应的存储子区域内,并在多个未合并的数据库节点的查询结果存储至对应存储子区域后,将存储子区域内的查询结果与所述存储区域内的原查询结果进行合并。 6. The method of claim 3 implemented according to the preceding claims, characterized in that the query results in the query result are not merged with the database of node storage region when combined, results in unconsolidated query database for a plurality of nodes in the case where, in advance of the remaining space within the storage area is divided into a plurality of sub-storage areas, and the plurality of unconsolidated query results database node according to a predetermined correspondence relationship stored in the corresponding memory sub-regions, and after a plurality of unconsolidated query results database nodes stored in the corresponding memory sub-region, the query results in the storage sub-areas are combined with the original query result in the storage area.
7.根据权利要求1所述的实现方法,其特征在于,对于每个数据库节点,在将去重操作得到的结果作为该数据库节点的查询结果后,按照预定策略对该数据库节点的查询结果进行范围划分,得到多个查询结果组,并且将每个查询结果组分别发送给与该数据库节点对应的其他数据库节点,以便所述其他数据库节点根据接收到的查询结果组对所述其他数据库节点的查询结果进行去重操作。 The implementation method according to claim 1, characterized in that, for each database node, after the deduplication operation results obtained as a result of the database query node, the database results for nodes according to a predetermined strategy range is divided, the plurality of query results to give groups, and the groups were each query result is sent to a database to the other database nodes corresponding to the node, the other to the other database node database node according to the received query result set query results to re-operate.
8.根据权利要求7所述的实现方法,其特征在于,按照预定策略对数据库节点的查询结果进行范围划分包括: 按照预定的排序方式对数据库节点的查询结果进行排序; 对排序后的查询结果进行等分。 8. The method of realization according to claim 7, wherein the range is divided for the database query results comprises node according to a predetermined strategy: sort the database results in a predetermined node ordering; query results after sorting aliquoted.
9.根据权利要求1所述的实现方法,其特征在于,在对多个数据库节点的查询结果进行合并后,对合并的查询结果进行去重操作。 9. The method of realization according to claim 1, wherein the plurality of query results in the database nodes are combined, the combined results of the query to retry.
10.一种数据去重查询的实现装置,其特征在于,包括: 第一去重模块,用于对多个数据库节点中的每个数据库节点进行查询,得到查询结果,其中,对于每个查询到多个查询结果的数据库节点,对从该数据库节点查询得到的多个查询结果进行去重操作,并将去重操作得到的结果作为该数据库节点的查询结果; 结果合并模块,用于对多个数据库节点的查询结果进行合并。 10. An apparatus for implementing data deduplication query, wherein, comprising: a first de-duplication module, for each node of the plurality of database query database nodes, to obtain the query result, wherein, for each query a plurality of nodes to a database query results, a plurality of query results obtained from the database query node to re-operation, and the result of the operation to the re obtained as a result of the database query node; result of merging module for multi- the results of a database query nodes are combined.
CN2013103539781A 2013-08-14 2013-08-14 Implementation method and implementation device for data duplication elimination query CN103399944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103539781A CN103399944A (en) 2013-08-14 2013-08-14 Implementation method and implementation device for data duplication elimination query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103539781A CN103399944A (en) 2013-08-14 2013-08-14 Implementation method and implementation device for data duplication elimination query

Publications (1)

Publication Number Publication Date
CN103399944A true CN103399944A (en) 2013-11-20

Family

ID=49563572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103539781A CN103399944A (en) 2013-08-14 2013-08-14 Implementation method and implementation device for data duplication elimination query

Country Status (1)

Country Link
CN (1) CN103399944A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156420A (en) * 2014-08-06 2014-11-19 曙光信息产业(北京)有限公司 Method and device for managing transaction journal
CN104778266A (en) * 2015-04-22 2015-07-15 无锡天脉聚源传媒科技有限公司 Multi-data source searching method and device
CN104850618A (en) * 2015-05-18 2015-08-19 北京京东尚科信息技术有限公司 System and method for providing sorted data
CN105512268A (en) * 2015-12-03 2016-04-20 曙光信息产业(北京)有限公司 Data query method and device
CN105550236A (en) * 2015-11-27 2016-05-04 广州华多网络科技有限公司 Distributed data deduplication processing method and apparatus
CN105654259A (en) * 2015-12-25 2016-06-08 中国民航信息网络股份有限公司 Mass agent freight rate search compression method
CN105740264A (en) * 2014-12-10 2016-07-06 北大方正集团有限公司 Distributed XML database sorting method and apparatus
CN106649689A (en) * 2016-12-16 2017-05-10 天脉聚源(北京)科技有限公司 Full-service user statistical method and apparatus
CN106909624A (en) * 2017-01-19 2017-06-30 中国科学院信息工程研究所 Real-time ranking and optimization method of mass data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061503A (en) * 1996-11-06 2000-05-09 Zenith Electronics Corporation Method for resolving conflicts among time-based data
US20090228433A1 (en) * 2008-03-07 2009-09-10 International Business Machines Corporation System and method for multiple distinct aggregate queries
CN102521406A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Distributed query method and system for complex task of querying massive structured data
CN102799622A (en) * 2012-06-19 2012-11-28 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061503A (en) * 1996-11-06 2000-05-09 Zenith Electronics Corporation Method for resolving conflicts among time-based data
US20090228433A1 (en) * 2008-03-07 2009-09-10 International Business Machines Corporation System and method for multiple distinct aggregate queries
CN102521406A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Distributed query method and system for complex task of querying massive structured data
CN102799622A (en) * 2012-06-19 2012-11-28 北京大学 Distributed structured query language (SQL) query method based on MapReduce expansion framework

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨春明,何天翔: "元搜索引擎的结果去重及排序研究", <<软件杂志>> *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156420A (en) * 2014-08-06 2014-11-19 曙光信息产业(北京)有限公司 Method and device for managing transaction journal
CN104156420B (en) * 2014-08-06 2017-10-03 曙光信息产业(北京)有限公司 Management method and apparatus of the transaction log
CN105740264A (en) * 2014-12-10 2016-07-06 北大方正集团有限公司 Distributed XML database sorting method and apparatus
CN104778266A (en) * 2015-04-22 2015-07-15 无锡天脉聚源传媒科技有限公司 Multi-data source searching method and device
CN104850618B (en) * 2015-05-18 2018-06-01 北京京东尚科信息技术有限公司 Method of providing a system and method of ordered data
CN104850618A (en) * 2015-05-18 2015-08-19 北京京东尚科信息技术有限公司 System and method for providing sorted data
CN105550236B (en) * 2015-11-27 2019-03-01 广州华多网络科技有限公司 A kind of distributed data duplicate removal treatment method and device
CN105550236A (en) * 2015-11-27 2016-05-04 广州华多网络科技有限公司 Distributed data deduplication processing method and apparatus
CN105512268A (en) * 2015-12-03 2016-04-20 曙光信息产业(北京)有限公司 Data query method and device
CN105512268B (en) * 2015-12-03 2019-05-10 曙光信息产业(北京)有限公司 A kind of data query method and device
CN105654259A (en) * 2015-12-25 2016-06-08 中国民航信息网络股份有限公司 Mass agent freight rate search compression method
CN106649689A (en) * 2016-12-16 2017-05-10 天脉聚源(北京)科技有限公司 Full-service user statistical method and apparatus
CN106909624A (en) * 2017-01-19 2017-06-30 中国科学院信息工程研究所 Real-time ranking and optimization method of mass data

Similar Documents

Publication Publication Date Title
Bu et al. HaLoop: efficient iterative data processing on large clusters
Huang et al. Scalable SPARQL querying of large RDF graphs
US8478790B2 (en) Mechanism for co-located data placement in a parallel elastic database management system
Schneider et al. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment
Nykiel et al. MRShare: sharing across multiple queries in MapReduce
US8285709B2 (en) High-concurrency query operator and method
JP4777972B2 (en) Non-shared parallel database system and database management method
US6609131B1 (en) Parallel partition-wise joins
US9495427B2 (en) Processing of data using a database system in communication with a data processing framework
US8001109B2 (en) System and method for automating data partitioning in a parallel database
Dittrich et al. Progressive merge join: A generic and non-blocking sort-based join algorithm
Gurajada et al. TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing
Bernstein et al. Hyder-A Transactional Record Manager for Shared Flash.
Jindal et al. Trojan data layouts: right shoes for a running elephant
US6665684B2 (en) Partition pruning with composite partitioning
CN101996250B (en) Hadoop-based mass stream data storage and query method and system
CN102467570B (en) Connection query system and method for distributed data warehouse
US6618729B1 (en) Optimization of a star join operation using a bitmap index structure
US20110302226A1 (en) Data loading systems and methods
Lee et al. Scaling queries over big RDF graphs with semantic hash partitioning
WO2011157442A1 (en) Parallel processing of continuous queries on data streams
US10025822B2 (en) Optimizing execution plans for in-memory-aware joins
Karun et al. A review on hadoop—HDFS infrastructure extensions
CN103678665A (en) Heterogeneous large data integration method and system based on data warehouses
US9710511B2 (en) Dynamic table index mapping

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
RJ01