WO2020215438A1 - 电子地图空间关键字查询分布式索引系统和方法 - Google Patents

电子地图空间关键字查询分布式索引系统和方法 Download PDF

Info

Publication number
WO2020215438A1
WO2020215438A1 PCT/CN2019/088772 CN2019088772W WO2020215438A1 WO 2020215438 A1 WO2020215438 A1 WO 2020215438A1 CN 2019088772 W CN2019088772 W CN 2019088772W WO 2020215438 A1 WO2020215438 A1 WO 2020215438A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
partition
index
node
data partition
Prior art date
Application number
PCT/CN2019/088772
Other languages
English (en)
French (fr)
Inventor
姚斌
过敏意
陈�全
林昊
张建锋
Original Assignee
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海交通大学 filed Critical 上海交通大学
Publication of WO2020215438A1 publication Critical patent/WO2020215438A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Definitions

  • the invention belongs to the technical field of positioning, and specifically relates to a distributed index system for electronic map spatial keyword query based on the Spark platform, and a distributed index method implemented based on the system.
  • Spatial keyword query takes the user's geographic location information and multiple query keywords as parameters, and returns spatial objects that have spatial and text relevance to these parameters.
  • constructing an effective index structure can greatly improve query efficiency.
  • an index in a space it refers to a data structure that arranges the position information, size and shape of the object in a certain structure.
  • the existing spatial keyword query system has a small query throughput, and the index cost of text data will increase rapidly when the data size increases. Therefore, how to develop a new type of spatial keyword query distributed index system can increase the throughput of keyword query, reduce index cost, and reduce the response delay of the system.
  • R-tree Another form of B-tree development towards multi-dimensional space, which divides space objects into ranges, and each node corresponds to a region and a disk page , The non-leaf node's disk page stores the area range of all its child nodes, and the area of all child nodes of the non-leaf node falls within its area range.
  • IR-tree Based on the inverted index and the R-tree index, the calculation model of the text similarity through the inverted index.
  • BFIR-tree IR-tree based on massive data processing
  • CBFIR-tree dynamic BFIR-tree
  • S2I-V structure model structure that should be processed differently for keywords of different frequencies
  • eBRQ based on keywords contained Range query
  • aBRQ k nearest neighbor query based on approximate keywords
  • falsepositive false detection rate
  • KNN algorithm Proximity algorithm, is one of the simplest methods in data mining classification technology.
  • I-Node A leaf R-tree node, which stores an inverted list that maps each keyword to a spatial keyword object.
  • the technical problem to be solved by the present invention is to provide a distributed indexing system of electronic map space keyword query based on Spark platform, which can increase the throughput of keyword query, reduce index cost, and reduce the response delay of the system.
  • An electronic map space keyword query distributed index method which includes the following steps: S1, partition: the original data is split through the data partition abstract interface of the Spark platform and then mapped to each node of the cluster, and a data partition is formed on each node; S2, local index construction: build an index file in each data partition, and collect statistics of each data partition at the same time; S3, global index construction: use the statistics collected by local index construction to build a global index on the master node.
  • Step S1 includes the following steps: S11: Perform data segmentation on the original data based on the spatial partition, and determine the minimum bounding rectangle of each data partition; S12: Based on S11 The smallest bounding rectangle of the data partition constructs a temporary R-tree, maps each data object to the corresponding cluster node, and forms a data partition at each node.
  • the statistical information in step S2 includes spatial statistical information and text statistical information in the form of (id, MBR, ⁇ ), and the id is a data partition Identify that the MBR is the smallest bounding rectangle of the data partition.
  • the ⁇ is the text summary data of the data partition.
  • step S3 a Bloom filter is used as the text summary.
  • the present invention also provides a distributed index system applied to electronic maps.
  • a distributed indexing system for keyword query in electronic map space comprising: a master node, multiple slave nodes, original data source, partition module, local index module and global index module; the partition module is used for connection and reading The original data source, the original data is divided and mapped to each slave node, and each slave node forms a data partition; the local index module connects each slave node separately, is used to construct an index file for each data partition, and collects each Statistical information of data partitions; the global index module connects the local index module and the main node, and is used to read the statistical information of each data partition collected by the local index module and form a global index on the main node.
  • the present invention can increase the throughput of keyword query, reduce the index cost, and reduce the response delay of the system.
  • Figure 1 is a schematic structural diagram of Embodiment 1;
  • FIG. 2 is a schematic diagram of the working process of Embodiment 1.
  • a spatial keyword query distributed indexing system which includes: a master node 1, multiple slave nodes 2, an original data source 3, a partition module 4, a local index module 5 and a global index module 6; the partition module 4 is used for To connect and read the original data source 3, divide the original data and map it to each slave node 2, and form a data partition in each slave node 2.
  • the local index module 5 connects each slave node 2 respectively to each The data partition constructs an index file and collects the statistical information of each data partition;
  • the global index module 6 connects the local index module 5 and the main node 1, reads the statistical information of each data partition collected by the local index module 5 and sends it to the main node 1 constitutes a global index.
  • S11 Perform data segmentation on the original data based on the space partition, and determine the minimum bounding rectangle of each data partition;
  • S12 Construct a temporary R-tree based on the smallest bounding rectangle of each data partition obtained in S11, map each data object to the corresponding cluster node, and form a data partition at each node;
  • S2 local index construction: an index file is constructed in each data partition, and statistical information of each data partition is collected at the same time.
  • the statistical information includes spatial statistical information and text statistical information in the form of (id, MBR, ⁇ ), the id is used to identify data partitions, and the MBR is the smallest bounding rectangle of each data partition.
  • the average processing delay based on the TX-CA data set is as follows:
  • the technical solution of the present invention is suitable for service applications based on geographic location such as public comment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种电子地图空间关键字查询分布式索引系统和方法,该方法包括如下步骤:S1,分区:通过Spark平台的数据分区抽象接口将原始数据进行分割后映射到集群的各节点,在各节点形成数据分区;S2,局部索引构建:在各个数据分区分别构建一个索引文件,同时收集各个数据分区的统计信息;S3,全局索引构建:使用局部索引构建收集的统计信息,在主节点构建全局索引。本发明能够增加关键字查询的吞吐量,降低索引成本,减少系统的响应延迟。

Description

电子地图空间关键字查询分布式索引系统和方法 技术领域
本发明属于定位技术领域,具体来说涉及一种基于Spark平台的电子地图空间关键字查询分布式索引系统,以及基于该系统所实现的一种分布式索引方法。
背景技术
近年来随着通信技术的发展和移动终端的广泛使用,基于位置的社会服务层出不穷。空间关键字查询是以用户的地理位置信息和多个查询关键字作为参数,返回和这些参数有着空间和文本相关度的空间对象。在一个查询中,构建有效的索引结构,可以极大地提高查询效率。对于一个空间中的索引,是指将对象的位置信息,大小形状等按照一定结构排列的一种数据结构。现有的空间关键字查询系统,其查询吞吐量较小,在数据大小增加时文本数据的索引成本会快速增长的问题。因此,如何开发出一种新型的空间关键字查询分布式索引系统,能够增加关键字查询的吞吐量,降低索引成本,减少系统的响应延迟。是本领域技术人员需要研究的方向。以下为本申请中所涉及的字母缩写的注释:R-tree:B-tree向多维空间发展的另一种形式,它将空间对象按范围划分,每个结点都对应一个区域和一个磁盘页,非叶结点的磁盘页中存储其所有子结点的区域范围,非叶结点的所有子结点的区域都落在它的区域范围之内。IR-tree:以倒排索引和R-tree索引为基础,通过倒排索引解决文本相似度的计算模型。BFIR-tree:基于海量数据处理实现的IR-tree;CBFIR-tree:动态的BFIR-tree;S2I-V结构:对不同频率的关键字应被区别处理的模型结构;eBRQ:基于关键字包含的范围查询;aBRQ:基于近似关键字包含的k最近邻查询;falsepositive:误检率;。KNN算法:即临近算法,是数据挖掘分类技术中最简单的方法之一。I-Node:一个叶子R树节点,它存储了将每个关键字映射到空间关键字对象的倒排列表。
发明内容
本发明要解决的技术问题是提供了一种基于Spark平台的电子地图空间关键字查询分布式索引系统,能够增加关键字查询的吞吐量,降低索引成本,减少系统的响应延迟。
其采用的技术方案如下:
一种电子地图空间关键字查询分布式索引方法,其包括如下步骤:S1,分区:通过Spark平台的数据分区抽象接口将原始数据进行分割后映射到集群的各节点,在各节点形成数据 分区;S2,局部索引构建:在各个数据分区分别构建一个索引文件,同时收集各个数据分区的统计信息;S3,全局索引构建:使用局部索引构建收集的统计信息,在主节点构建全局索引。
优选的是,上述电子地图空间关键字查询分布式索引方法中:步骤S1包括如下步骤:S11:基于空间分区对原始数据进行数据分割、确定各数据分区的最小边界矩形;S12:基于S11所得各数据分区的最小边界矩形构建一个临时的R-tree,将各个数据对象映射到对应的集群节点上,在各节点处构成数据分区。
更优选的是,上述电子地图空间关键字查询分布式索引方法中:步骤S2中所述统计信息包括采用(id,MBR,β)形式的空间统计信息和文本统计信息,所述id为数据分区标识,所述MBR为数据分区的最小边界矩形。所述β为数据分区的文本摘要数据。
进一步优选的是,上述电子地图空间关键字查询分布式索引方法中:所述步骤S3采用布隆过滤器作为文本摘要。
通过采用上述方案:基于现有技术中广泛使用的Spark平台实现对内存计算的分布式环境的支持。构建了两级索引框架,在实际的关键字查询工作中,首先利用全局索引对不相关的分区进行剪枝处理、实现对关键字的初步过滤,随后在指定的数据分区中进行二次精确查询。从而为其他查询释放CPU资源,显著提高空间关键字查询的吞吐量,降低索引成本,减少系统的响应延迟。
为实现上述分布式索引系统,本发明还提供了一种应用于电子地图的分布式索引系统。
其采用的方案如下:
一种电子地图空间关键字查询分布式索引系统,其包括:一个主节点,多个从节点,原始数据源,分区模块,局部索引模块和全局索引模块;所述分区模块用于连接和读取原始数据源、将原始数据进行分割后映射到各个从节点,在各个从节点分别形成数据分区;所述局部索引模块分别连接各个从节点、用于对各个数据分区构建一个索引文件,并收集各个数据分区的统计信息;所述全局索引模块连接局部索引模块和主节点,用于读取局部索引模块收集的各个数据分区的统计信息并在主节点构成全局索引。
与现有技术相比,本发明能够增加关键字查询的吞吐量,降低索引成本,减少系统的响应延迟。
附图说明
下面结合附图与具体实施方式对本发明作进一步详细的说明:
图1为实施例1的结构示意图;
图2为实施例1的工作流程示意图。
各附图标记与部件名称对应关系如下:
1、主节点;2、从节点;3、原始数据源;4、分区模块;5、局部索引模块;6、全局索引模块。
具体实施方式
为了更清楚地说明本发明的技术方案,下面将结合各个实施例作进一步描述。
如图1-2所示为实施例1:
一种空间关键字查询分布式索引系统,其包括:一个主节点1,多个从节点2,原始数据源3,分区模块4,局部索引模块5和全局索引模块6;所述分区模块4用于连接和读取原始数据源3、将原始数据进行分割后映射到各个从节点2,在各个从节点2分别形成数据分区;所述局部索引模块5分别连接各个从节点2、用于对各个数据分区构建一个索引文件,并收集各个数据分区的统计信息;所述全局索引模块6连接局部索引模块5和主节点1、读取局部索引模块5收集的各个数据分区的统计信息并在主节点1构成全局索引。
实践中,其工作过程如图2所示:
S11:基于空间分区对原始数据进行数据分割、确定各数据分区的最小边界矩形;
S12:基于S11所得各数据分区的最小边界矩形构建一个临时的R-tree,将各个数据对象映射到对应的集群节点上,在各节点处构成数据分区;
S2,局部索引构建:在各个数据分区分别构建一个索引文件,同时收集各个数据分区的统计信息。其中,所述统计信息包括采用(id,MBR,β)形式的空间统计信息和文本统计信息,所述id用于表示标识数据分区,所述MBR为各数据分区的最小边界矩形。
S3,全局索引构建:使用局部索引构建收集的统计信息,在主节点构建全局索引,采用技术过滤器(布隆过滤器)作为文本摘要。
基于空间分布式系统Simba系统进行扩展支持了相应的空间关键字查询,并作为实验对比系统。在查询中,基于海量的TX-CA数据集(2600万条数据)进行了实验对比。通过开启多线程对500条测试查询进行了并发执行,实验对比主要关注平均处理延时和吞吐量两个指标。其中,平均处理延时为500条查询总计耗时除以500得到,吞吐量为每分钟执行的查询数目。实验对比数据如下:
基于TX-CA数据集的平均处理延时如下表1:
Figure PCTCN2019088772-appb-000001
Figure PCTCN2019088772-appb-000002
表1
基于TX-CA数据集的吞吐量(对查询范围百分比进行变化)如下表2:
Figure PCTCN2019088772-appb-000003
表2
因此,本发明的技术方案适用于大众点评等基于地理位置的服务应用。
以上所述,仅为本发明的具体实施例,但本发明的保护范围并不局限于此,任何熟悉本领域技术的技术人员在本发明公开的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。本发明的保护范围以权利要求书的保护范围为准。

Claims (5)

  1. 一种电子地图空间关键字查询分布式索引方法,其特征在于,包括如下步骤:
    S1,分区:通过Spark平台的数据分区抽象接口将原始数据进行分割后映射到集群的各节点,在各节点形成数据分区;
    S2,局部索引构建:在各个数据分区分别构建一个索引文件,同时收集各个数据分区的统计信息;
    S3,全局索引构建:使用局部索引构建收集的统计信息,在主节点构建全局索引。
  2. 如权利要求1所述空间关键字查询分布式索引方法,其特征在于:步骤S1包括如下步骤:
    S11:基于空间分区对原始数据进行数据分割、确定各数据分区的最小边界矩形;
    S12:基于S11所得各数据分区的最小边界矩形构建一个临时的R-tree,将各个数据对象映射到对应的集群节点上,在各节点处构成数据分区。
  3. 如权利要求1所述电子地图空间关键字查询分布式索引方法,其特征在于:步骤S2中所述统计信息包括采用(id,MBR,β)形式的空间统计信息和文本统计信息,所述id为数据分区标识,所述MBR为数据分区的最小边界矩形,所述β为数据分区的文本摘要数据。
  4. 如权利要求3所述电子地图空间关键字查询分布式索引方法,其特征在于:所述步骤S3采用布隆过滤器作为文本摘要。
  5. 一种电子地图空间关键字查询分布式索引系统,其特征在于,包括:一个主节点(1),多个从节点(2),原始数据源(3),分区模块(4),局部索引模块(5)和全局索引模块(6);所述分区模块(4)用于连接和读取原始数据源(3)、将原始数据进行分割后映射到各个从节点(2),在各个从节点(2)分别形成数据分区;所述局部索引模块(5)分别连接各个从节点(2)、用于对各个数据分区构建一个索引文件,并收集各个数据分区的统计信息;所述全局索引模块(6)连接局部索引模块(5)和主节点(1),用于读取局部索引模块(5)收集的各个数据分区的统计信息并在主节点(1)构成全局索引。
PCT/CN2019/088772 2019-04-24 2019-05-28 电子地图空间关键字查询分布式索引系统和方法 WO2020215438A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910333878.X 2019-04-24
CN201910333878.XA CN110059149A (zh) 2019-04-24 2019-04-24 电子地图空间关键字查询分布式索引系统和方法

Publications (1)

Publication Number Publication Date
WO2020215438A1 true WO2020215438A1 (zh) 2020-10-29

Family

ID=67320479

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088772 WO2020215438A1 (zh) 2019-04-24 2019-05-28 电子地图空间关键字查询分布式索引系统和方法

Country Status (2)

Country Link
CN (1) CN110059149A (zh)
WO (1) WO2020215438A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597935A (zh) * 2019-08-05 2019-12-20 北京云和时空科技有限公司 一种空间分析方法和装置
CN111026750B (zh) * 2019-11-18 2023-06-30 中南民族大学 用AIR树解决SKQwhy-not问题的方法及系统
CN111708851A (zh) * 2020-04-26 2020-09-25 上海容易网电子商务股份有限公司 一种2d地图数据动态解析缓存方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9081854B2 (en) * 2012-07-06 2015-07-14 Hewlett-Packard Development Company, L.P. Multilabel classification by a hierarchy
CN108804556A (zh) * 2018-05-22 2018-11-13 上海交通大学 基于时间旅行和时态聚合查询的分布式处理框架系统
CN108932347A (zh) * 2018-08-03 2018-12-04 东北大学 一种分布式环境下基于社会感知的空间关键字查询方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9081854B2 (en) * 2012-07-06 2015-07-14 Hewlett-Packard Development Company, L.P. Multilabel classification by a hierarchy
CN108804556A (zh) * 2018-05-22 2018-11-13 上海交通大学 基于时间旅行和时态聚合查询的分布式处理框架系统
CN108932347A (zh) * 2018-08-03 2018-12-04 东北大学 一种分布式环境下基于社会感知的空间关键字查询方法

Also Published As

Publication number Publication date
CN110059149A (zh) 2019-07-26

Similar Documents

Publication Publication Date Title
Bouros et al. Spatio-textual similarity joins
Hariharan et al. Processing spatial-keyword (SK) queries in geographic information retrieval (GIR) systems
US9442905B1 (en) Detecting neighborhoods from geocoded web documents
WO2020215438A1 (zh) 电子地图空间关键字查询分布式索引系统和方法
WO2017096892A1 (zh) 索引构建方法、查询方法及对应装置、设备、计算机存储介质
CN103631909B (zh) 对大规模结构化和非结构化数据联合处理的系统及方法
JP7407209B2 (ja) 情報プッシュ方法及び装置
US20170337229A1 (en) Spatial indexing for distributed storage using local indexes
CN105468605A (zh) 一种实体信息图谱生成方法及装置
WO2014113709A2 (en) Searching and determining active area
TW201905733A (zh) 多源資料融合方法和裝置
Hsu et al. Key formulation schemes for spatial index in cloud data managements
US9529823B2 (en) Geo-ontology extraction from entities with spatial and non-spatial attributes
Mahmood et al. FAST: frequency-aware indexing for spatio-textual data streams
CN108932347A (zh) 一种分布式环境下基于社会感知的空间关键字查询方法
Lu et al. Efficient indexing and retrieval of large-scale geo-tagged video databases
US20140370920A1 (en) Systems and methods for generating and employing an index associating geographic locations with geographic objects
Christen et al. A probabilistic geocoding system based on a national address file
CN111723161A (zh) 一种数据处理方法、装置及设备
CN112765405A (zh) 空间数据搜索结果的聚类和查询的方法及系统
Li et al. Efficient subspace skyline query based on user preference using MapReduce
CN114741570A (zh) 图数据库的查询方法、索引创建方法及相关设备
CN104111942A (zh) 维吾尔医药古籍资源网络检索平台
Li et al. Distributed spatio-temporal k nearest neighbors join
Tang et al. Skewness‐aware clustering tree for unevenly distributed spatial sensor nodes in smart city

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925743

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.02.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19925743

Country of ref document: EP

Kind code of ref document: A1