CN109033204A

CN109033204A - A kind of level integration histogram Visual Inquiry method based on WWW

Info

Publication number: CN109033204A
Application number: CN201810698579.1A
Authority: CN
Inventors: 陈为; 梅鸿辉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-12-18
Anticipated expiration: 2038-06-29
Also published as: CN109033204B

Abstract

The invention discloses a kind of Visual Inquiry methods of level integration histogram, comprising the following steps: step 1: configuring to raw data set, including discretization interval number, crosses the condition of filter data and need to carry out the dimension of aggregate statistics；Step 2: with the building of offline pretreatment mode and storage hierarchy partition tree, wherein data are divided into multiple data subsets by level partition tree, and the statistical nature of each data subset is expressed by integration histogram；Step 3: visualization space uniform is discretized into specific zonule, distinguishing hierarchy tree in the coordinate input step 2 of zonule is subjected to range query, the range query is the process found the data subset for having intersection with target area and remove estimation target area statistical nature with the integration histogram of the intersection, and all zonules obtain a matrix about statistical nature after being all performed range query；Step 4: visual element binding being carried out to the matrix of statistical nature, carries out visualization request.

Description

A Visual Query Method of Hierarchical Integral Histogram Based on World Wide Web

技术领域technical field

本发明涉及快速可视查询领域，特别涉及一种层次积分直方图的快速查询方法。The invention relates to the field of fast visual query, in particular to a fast query method for hierarchical integral histograms.

背景技术Background technique

在大规模结构化数据的可视分析场景中，人们需要从数据的统计特征中了解和研究数据的分布，通过分布特点总结规律、进行决策。最常见的聚合运算(指从一组值中计算出一个值)一般通过直方图或离散化散点图等进行可视表达。当数据量足够大时，直接遍历数据项计算统计特征的方法将无法满足交互式可视化探索的实时性需求。如何在大规模结构化数据中快速查询得到指定范围的数据，例如交通资源的实时管理调度、金融交易的实时监测等，成为了互联网，交通，航天，商业等领域的热门课题。In the visual analysis scenario of large-scale structured data, people need to understand and study the distribution of data from the statistical characteristics of the data, summarize the laws and make decisions through the distribution characteristics. The most common aggregation operation (referring to calculating a value from a set of values) is generally expressed visually through a histogram or a discretized scatter plot. When the amount of data is large enough, the method of directly traversing data items to calculate statistical features will not be able to meet the real-time requirements of interactive visual exploration. How to quickly query and obtain data in a specified range from large-scale structured data, such as real-time management and scheduling of transportation resources, real-time monitoring of financial transactions, etc., has become a hot topic in the fields of the Internet, transportation, aerospace, and commerce.

对于现实中的大规模结构化数据，其维度高，数据项多，数据模态和格式多种多样，数据分布独特。在如此庞大复杂的数据集上执行可视查询操作，会存在无法及时响应甚至耗时太长的问题。许多现有的方法都是在数据库层面进行查询优化，它们为了得到精确的结果,需要在数据集上设置考虑同时构造有利于用户理解的外部表达；另外，一些工作基于近似结果的目标采用了一系列的近似查询策略(近似查询指为了减少查询的响应时间，用近似的策略对数据进行查询)，例如基于抽样算法的，基于直方图表达的和基于小波变换的技术。For large-scale structured data in reality, it has high dimensionality, many data items, various data modes and formats, and unique data distribution. Executing visual query operations on such a large and complex data set will have the problem of not being able to respond in a timely manner or even taking too long. Many existing methods perform query optimization at the database level. In order to obtain accurate results, they need to set considerations on the data set and construct external expressions that are easy for users to understand; in addition, some works based on the goal of approximate results adopt a A series of approximate query strategies (approximate query refers to querying data with an approximate strategy in order to reduce the response time of the query), such as techniques based on sampling algorithm, histogram expression and wavelet transform.

上述近似技术有的使用了固定的预计算模式，局限于特定统计特征，不能应用于多种类型的数据，如动态数据和流数据；有的仅限于低维情况，高维数据集计算所需的内存过大。Some of the above approximation techniques use a fixed precomputation mode, which is limited to specific statistical characteristics and cannot be applied to various types of data, such as dynamic data and streaming data; some are limited to low-dimensional situations, and are required for high-dimensional dataset calculations The memory is too large.

发明内容Contents of the invention

本发明提供了一种层次积分直方图的可视查询方法，把搜索时间降低到500毫秒以内，达到交互级别同时显著减少对存储的需求。The invention provides a visual query method of the hierarchical integral histogram, which reduces the search time to less than 500 milliseconds, reaches the level of interaction and significantly reduces the demand for storage.

一种层次积分直方图的可视查询方法，包括以下步骤：A visual query method for a hierarchical integral histogram, comprising the following steps:

步骤1：对原始数据集进行配置，包括离散化区间数、过滤数据的条件和需要进行聚合统计的维度；Step 1: Configure the original data set, including the number of discretized intervals, the conditions for filtering data, and the dimensions that need to be aggregated and counted;

步骤2：基于步骤1中的配置处理得到的数据，以离线预处理方式构建并存储层次划分树，其中数据被层次划分树分割为多个数据子集，每个数据子集的统计特征由积分直方图进行表达；Step 2: Based on the data processed in step 1, build and store a hierarchical partition tree in an offline preprocessing manner, where the data is divided into multiple data subsets by the hierarchical partition tree, and the statistical characteristics of each data subset are determined by the integral Histogram for expression;

步骤3：通过步骤1中的配置将可视化空间均匀离散化成特定的小区域，对于每一块小区域，将小区域的坐标输入步骤2中的层次划分树进行范围查询，所述范围查询是寻找与目标区域有交集的数据子集并用该交集的积分直方图去估计目标区域统计特征的过程，所有小区域都被执行范围查询后得到一个关于统计特征的矩阵；Step 3: Uniformly discretize the visualization space into specific small areas through the configuration in step 1. For each small area, input the coordinates of the small area into the hierarchical division tree in step 2 to perform range query. The range query is to find and The target area has an intersection data subset and uses the integral histogram of the intersection to estimate the statistical characteristics of the target area. All small areas are searched for a range to obtain a matrix of statistical characteristics;

步骤4：对步骤3的统计特征的矩阵进行视觉元素绑定，进行可视化请求。Step 4: Perform visual element binding on the matrix of statistical features in step 3, and perform a visualization request.

本方法把时间损耗转移到预处理阶段，对查询结果进行误差允许范围内的近似计算，与现有方法相比，本查询方法可以显著降低存储成本，并且时间复杂度与数据点的数量无关，可以进行高效的在线可视查询。This method transfers the time loss to the preprocessing stage, and performs approximate calculations on the query results within the allowable error range. Compared with existing methods, this query method can significantly reduce storage costs, and the time complexity has nothing to do with the number of data points. Efficient online visual query is possible.

本发明基于用户的配置参数对原始数据集和目标可视空间进行预处理，并通过层次划分算法对数据集进行层次划分，从而实现对不同分布的区域采用不同精度和尺度的表达。对于每一个子区域，用积分直方图去近似该区域的统计特征，在可视查询时，系统利用层次划分树快速有效地遍历查找目标区域集合并返回近似值，从而得到目标区域的近似统计特征。The present invention preprocesses the original data set and the target visual space based on the user's configuration parameters, and divides the data set into layers through a layered division algorithm, so as to realize expressions with different precisions and scales for differently distributed areas. For each sub-area, the integral histogram is used to approximate the statistical characteristics of the area. During visual query, the system uses the hierarchical division tree to quickly and effectively traverse the set of target areas and return the approximate value, so as to obtain the approximate statistical features of the target area.

与现有的方法相比，本方法把时间损耗转移到数据预处理阶段，对可能需要可视查询的数据集提前离线预处理，得到对数据集的一种高效近似表达，进而可用于后续的在线可视查询。本方法基于近似再逐渐细化的构想，只需要存储数据被统计后的积分直方图，其它的许多可视查询方法需要存储原始数据，需要较大的时间和空间损耗，同时不能较好地捕捉数据的分布，因此本方法的应用更广。Compared with the existing methods, this method transfers the time loss to the data preprocessing stage, preprocesses the data sets that may require visual query in advance offline, and obtains an efficient approximate expression of the data set, which can then be used in subsequent Online visual query. This method is based on the idea of approximation and gradual refinement. It only needs to store the integral histogram after the data is counted. Many other visual query methods need to store the original data, which requires a large loss of time and space, and cannot capture the data well. The distribution of data, so the application of this method is wider.

为了提高本发明的适用范围和智能化，优选的，步骤4中，还包括交互式地调节可视化的参数，并在可视化过程中得到即时的可视反馈结果。In order to improve the scope of application and intelligence of the present invention, preferably, step 4 also includes interactively adjusting visualization parameters, and obtaining instant visual feedback results during the visualization process.

为了进一步提高计算效率，优选的，步骤2中，原始数据集为具有n个维度D＝{D₁,…,D_n}的高维数据集V，每个维度的域分别表示为{[a₁,b₁],…,[a_n,b_n]}。In order to further improve computational efficiency, preferably, in step 2, the original data set is a high-dimensional data set V with n dimensions D={D ₁ ,...,D _n }, and the domains of each dimension are represented as {[a ₁ ,b ₁ ],...,[a _n ,b _n ]}.

为了进一步提高计算效率，优选的，步骤2中，数据被层次划分树分割为多个数据子集具体过程为：将整个数据空间进行递归划分，产生一个分层的树结构，数据空间被重构为V’＝{v′₁,…v′_i…v′_p}，其中每一个v′_i∈V′对应于树的一个叶节点。In order to further improve computing efficiency, preferably, in step 2, the data is divided into multiple data subsets by a hierarchical partition tree. The specific process is: the entire data space is recursively divided to generate a hierarchical tree structure, and the data space is reconstructed. V'={v' ₁ ,...v' _i ...v' _p }, where each v' _i ∈V' corresponds to a leaf node of the tree.

为了进一步提高计算效率，优选的，步骤2中，所述积分直方图是求和表的一种扩展，表格中每个网格的值都等于其左上角所有的值的总和，于是每个网格中的值可以由四个值的加减获得。求和表的英文名叫summed area table，是一张二维的表格。In order to further improve the calculation efficiency, preferably, in step 2, the integral histogram is an extension of the summation table, and the value of each grid in the table is equal to the sum of all the values in the upper left corner of the table, so each grid The value in the cell can be obtained by adding and subtracting four values. The English name of the summation table is summed area table, which is a two-dimensional table.

为了进一步提高计算效率，优选的，步骤2中，计算积分直方图的具体过程如下：对于由N₁×…N_d网格进行分箱的d维数据集,并通过带有b个分箱数的直方图进行汇总，叶节点的积分直方图定义为：In order to further improve the calculation efficiency, preferably, in step 2, the specific process of calculating the integral histogram is as follows: For a d-dimensional data set binned by N ₁ ×...N _d grids, and with b binning numbers The histogram of the leaf node is summarized, and the integral histogram of the leaf node is defined as:

其中，x₁,…,x_d是d个维度上的分箱的索引，b是直方图中分箱的索引，h(x₁,…,x_d,·)表示每个网格中值的直方图；Among them, x ₁ ,…,x _d is the index of binning in d dimensions, b is the index of binning in the histogram, h(x ₁ ,…,x _d ,·) represents the value of each grid histogram;

数据空间中任何矩形区域的积分直方图可以由以下方式计算：An integral histogram for any rectangular region in data space can be computed by:

其中x^p是矩形区域的角点，p∈{0,1}^d。where ^{xp is the corner point of the rectangular area, p∈{0,1}d} ^.

本发明的有益效果：Beneficial effects of the present invention:

本发明的层次积分直方图的可视查询方法，实现对不同分布的区域采用不同精度和尺度的表达，得到目标区域的近似统计特征，把时间损耗转移到数据预处理阶段，对可能需要可视查询的数据集提前离线预处理，得到对数据集的一种高效近似表达，基于近似再逐渐细化的构想，只需要存储数据被统计后的积分直方图，降低时间和空间损耗，同时较好地捕捉数据的分布，应用更广。The visual query method of the hierarchical integral histogram of the present invention realizes the expression of different precision and scales for different distribution areas, obtains the approximate statistical characteristics of the target area, and transfers the time loss to the data preprocessing stage, which may need to be visualized The queried data set is preprocessed offline in advance to obtain an efficient approximate expression of the data set. Based on the idea of approximation and gradual refinement, it is only necessary to store the integral histogram after the data is counted, reducing time and space loss, and better Accurately capture the distribution of data and have wider applications.

附图说明Description of drawings

图1为本发明的层次积分直方图的可视查询方法的流程示意图。Fig. 1 is a schematic flow chart of the visual query method of the hierarchical integral histogram of the present invention.

图2为地图上的POI数据集被层次划分树划分为多个子集后的结果示意图。Fig. 2 is a schematic diagram of the result after the POI dataset on the map is divided into multiple subsets by a hierarchical partition tree.

图3为图2的密集区域放大后的结果示意图。FIG. 3 is a schematic diagram of the enlarged result of the dense area in FIG. 2 .

具体实施方式Detailed ways

如图1所示，本实施例的层次积分直方图的可视查询方法包括以下步骤：As shown in Figure 1, the visual query method of the hierarchical integral histogram of the present embodiment includes the following steps:

步骤1：对于一个具有n个维度D＝{D₁,…,D_n}的高维数据集V，其每个维度的域分别表示为{[a₁,b₁],…,[a_n,b_n]}，术语分箱是一种用户定义的用于聚合数据空间的比例尺，用户从高维数据集中指定进行分箱、过滤和聚合的维度，如图1中线框a所示。Step 1: For a high-dimensional data set V with n dimensions D={D ₁ ,…,D _n }, the domains of each dimension are expressed as {[a ₁ ,b ₁ ],…,[a _n ,b _n ]}, the term binning is a user-defined scale for aggregating the data space, and the user specifies the dimension for binning, filtering and aggregation from a high-dimensional dataset, as shown in wireframe a in Figure 1.

步骤2：基于步骤1中的配置处理得到的数据，系统首先采用R树的空间划分算法，本实施例具体过程中采用了R树的变体R*树，将整个数据空间进行递归划分，从而产生一个分层的树结构，如图1中线框b所示，如图2和图3所示，图3可以看到密集区域被划分为更多的子空间，且划分结果较好。数据空间被重构为V’＝{v′₁,…v′_p}，其中每一个v′_i∈V′对应于R树的一个叶节点。接着在所有的叶节点上计算积分直方图，如图1中线框c所示，它是求和表的一种扩展，求和表的英文名叫summed area table，是一张二维的表格，表格中每个网格的值都等于其左上角所有的值的总和，于是每个网格中的值可以由四个值的加减获得。Step 2: based on the data obtained in the configuration process in step 1, the system first adopts the space division algorithm of the R tree. In the specific process of this embodiment, the variant R* tree of the R tree is used to recursively divide the entire data space, thereby A hierarchical tree structure is generated, as shown in the wireframe b in Figure 1, as shown in Figure 2 and Figure 3, and Figure 3 shows that the dense area is divided into more subspaces, and the division results are better. The data space is reconstructed as V'={v' ₁ ,...v' _p }, where each v' _i ∈ V' corresponds to a leaf node of the R-tree. Then calculate the integral histogram on all the leaf nodes, as shown in the line box c in Figure 1, it is an extension of the summation table, the English name of the summation table is summed area table, it is a two-dimensional table, in the table The value of each grid is equal to the sum of all values in its upper left corner, so the value in each grid can be obtained by adding and subtracting four values.

与原始求和表在每个网格中存储单个标量值不同，积分直方图汇总了落在每个网格中的数据点的分布，计算叶节点上每个网格范围内的所有数据点的直方图，并类似于通过求和表计算矩形区域值的方式返回查询的结果。Unlike the original sum table, which stored a single scalar value in each grid, the integral histogram summarizes the distribution of data points falling in each grid, counting all data points within the range of each grid on the leaf nodes , and returns the result of the query in a manner similar to calculating the rectangular area value through a summation table.

对于由N₁×…N_d网格进行分箱的d维数据集，并通过带有b个分箱数的直方图进行汇总，叶节点的积分直方图定义为：For a d-dimensional dataset binned by an N ₁ ×…N _d grid and summarized by a histogram with b number of bins, the integral histogram of a leaf node is defined as:

其中，x₁,…,x_d是d个维度上的分箱的索引，b是直方图中分箱的索引，h(x₁,…,x_d,·)表示每个网格中值的直方图，所以数据空间中任何矩形区域的积分直方图可以由以下方式计算：Among them, x ₁ ,…,x _d is the index of binning in d dimensions, b is the index of binning in the histogram, h(x ₁ ,…,x _d ,·) represents the value of each grid histogram, so the integral histogram of any rectangular region in data space can be computed by:

步骤3：用户定义一个查询范围和一个聚合函数A，两者组成一个聚合查询，表示为Q(R,A)，该查询将分别聚合位于范围内的数据点。Step 3: The user defines a query range and an aggregate function A, the two form an aggregate query, denoted as Q(R,A), which aggregates the range data points within.

得到了每个查询区域的范围，可以通过步骤2中的积分直方图在常数时间内对每个分箱区域的值进行查询，如图1中线框e所示，并返回一个结果直方图，从而可以估计近似的聚合结果。After obtaining the range of each query area, the value of each binning area can be queried in constant time through the integral histogram in step 2, as shown in the wireframe e in Figure 1, and a result histogram is returned, so that Approximate aggregation results can be estimated.

步骤4：得到近似聚合结果后，用户可以对其进行一些可视化操作请求，如图1中线框d所示，并且由于预计算的层次积分直方图是保存在内存中的，所以可视化和聚合查询的构建都是在线执行的，用户可以交互式地调节可视化的参数，并在可视化过程中得到即时的可视反馈结果。Step 4: After obtaining the approximate aggregated results, the user can perform some visualization operations on it, as shown in the wireframe d in Figure 1, and since the pre-computed hierarchical integral histogram is stored in memory, the visualization and aggregation query The construction is performed online, and users can interactively adjust the parameters of the visualization and get instant visual feedback during the visualization process.

Claims

1. a kind of Visual Inquiry method of level integration histogram, which comprises the following steps:

Step 1: raw data set being configured, including discretization interval number, the condition of filter data is crossed and is polymerize The dimension of statistics；

Step 2: the data handled based on the configuration in step 1, with the building of offline pretreatment mode and storage hierarchy divides Tree, wherein data are divided into multiple data subsets by level partition tree, and the statistical nature of each data subset is by integration histogram It is expressed；

Step 3: specific zonule is discretized into for space uniform is visualized by the configuration in step 1, it is small for each piece Distinguishing hierarchy tree in the coordinate input step 2 of zonule is carried out range query by region, and the range query is searching and mesh There is the data subset of intersection in mark region and goes the process of estimation target area statistical nature with the integration histogram of the intersection, owns Zonule obtains a matrix about statistical nature after being all performed range query；

Step 4: visual element binding being carried out to the matrix of the statistical nature of step 3, carries out visualization request.

2. the Visual Inquiry method of level integration histogram as described in claim 1, which is characterized in that in step 4, further include Visual parameter is interactively adjusted, and obtains instant visible feedback result in visualization process.

3. the Visual Inquiry method of level integration histogram as described in claim 1, which is characterized in that in step 2, original number According to collection for n dimension D={ D₁..., D_nHigh Dimensional Data Set V, the domain of each dimension is expressed as { [a₁, b₁] ..., [a_n, b_n]}。

4. the Visual Inquiry method of level integration histogram as described in claim 1, which is characterized in that in step 2, data quilt Distinguishing hierarchy tree is divided into multiple data subset detailed processes are as follows: entire data space is carried out recurrence division, generates one point The tree construction of layer, data space are reconfigured as V '={ v '₁... v '_i...v′_p, wherein each v '_i∈ V ' corresponds to tree One leaf node.

5. the Visual Inquiry method of level integration histogram as claimed in claim 4, which is characterized in that in step 2, the product Dividing histogram is a kind of extension of summation table, and the value of each grid is equal to the summation of all values in its upper left corner in table, in It is that value in each grid can be obtained by the plus-minus of four values.

6. the Visual Inquiry method of level integration histogram as claimed in claim 5, which is characterized in that in step 2, calculate product Dividing histogram, detailed process is as follows: for by N₁×...N_dGrid carries out the d dimension data collection of branch mailbox, and by with b points The histogram of case number is summarized, the integration histogram of leaf node is defined as:

Wherein, x₁..., x_dIt is the index of the branch mailbox in d dimension, b is the index of branch mailbox in histogram, h (x₁..., x_d) Indicate the histogram of each grid intermediate value；

The integration histogram of any rectangular area can be calculated by following manner in data space:

Wherein x^pIt is the angle point of rectangular area, p ∈ { 0,1 }^d。