CN103218404B - A kind of multi-dimensional metadata management method based on associate feature and system - Google Patents

A kind of multi-dimensional metadata management method based on associate feature and system Download PDF

Info

Publication number
CN103218404B
CN103218404B CN201310090042.4A CN201310090042A CN103218404B CN 103218404 B CN103218404 B CN 103218404B CN 201310090042 A CN201310090042 A CN 201310090042A CN 103218404 B CN103218404 B CN 103218404B
Authority
CN
China
Prior art keywords
metadata
meta data
query
data server
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310090042.4A
Other languages
Chinese (zh)
Other versions
CN103218404A (en
Inventor
华宇
黄大彰
冯丹
刘进军
聂振华
蔡娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201310090042.4A priority Critical patent/CN103218404B/en
Publication of CN103218404A publication Critical patent/CN103218404A/en
Application granted granted Critical
Publication of CN103218404B publication Critical patent/CN103218404B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于关联特性的多维元数据管理方法,包括:在元数据服务器集群中,对每台元数据服务器上的元数据根据关联特性进行划分,以生成元数据集合与集合统计文件,根据集合统计文件,对元数据集群进行分组操作,以生成多个元数据服务器分组与分组配置文件;根据集合统计文件,分别在每台元数据服务器上建立本地索引表,根据分组配置文件与集合统计文件,分别在每个元数据服务器分组内建立组索引表,根据组索引表建立元数据服务器集群的顶层索引表,接收来自用户的查询请求,并根据查询请求依次查询顶层索引表、组索引表与本地索引表。本发明能够充分地利用元数据的多维属性间的关联特性,满足复杂查询需求,并具有良好的可扩展性。

The invention discloses a multi-dimensional metadata management method based on association characteristics, which includes: in the metadata server cluster, dividing the metadata on each metadata server according to the association characteristics, so as to generate metadata collections and collection statistics files , according to the collection statistics file, the metadata cluster is grouped to generate multiple metadata server groups and group configuration files; according to the collection statistics file, a local index table is established on each metadata server, and according to the group Collect statistical files, establish a group index table in each metadata server group, build the top-level index table of the metadata server cluster according to the group index table, receive query requests from users, and query the top-level index table, group Index tables and local index tables. The invention can make full use of the association characteristics among the multi-dimensional attributes of the metadata, meet the requirement of complex query, and has good scalability.

Description

一种基于关联特性的多维元数据管理方法和系统A method and system for managing multi-dimensional metadata based on association characteristics

技术领域technical field

本发明属于计算机数据存储领域,更具体地,涉及一种基于关联特性的多维元数据管理方法和系统。The invention belongs to the field of computer data storage, and more specifically relates to a multi-dimensional metadata management method and system based on association characteristics.

背景技术Background technique

随着云计算、云存储时代的到来,信息存储系统中的数据规模的几何级数式增长使得对数据的高效存储、管理与查询等问题也变得越来越困难。海量数据规模的不断增长导致了数据存储和维护的难度不断增加,研究表明,实际中海量存储系统的文件数据具有显著的关联特征。关联特性是指文件在其属性空间中具有的聚集现象,其本质上体现了文件之间的相关性。通常情况下,我们经常使用的是文件间的时间关联性与空间关联性,时间关联性表现在时间相近的文件在一段时间内会被立即访问,而空间关联性表现在位于相邻位置的文件具有很大可能性被后继请求访问。除了时间关联性与空间关联性之外,还有众多的关联性体现在文件与文件之间,比如文件大小、文件的访问频率、文件的创建者等。但是已有的研究成果明显缺乏对文件在更多属性上关联性的研究。考虑更多属性上的关联性,有助于更加准确地区分文件之间的相关性,基于多维属性空间中的距离度量,两个文件间的相关性可以明确的计算出来。面对海量数据处理,运用一定的方法来度量数据之间的关联性,并由此将数据划分成多个聚集的空间,将为后继处理带来明显的好处。With the advent of the era of cloud computing and cloud storage, the geometric progression of the data scale in the information storage system makes efficient storage, management and query of data more and more difficult. The continuous growth of massive data scale has led to the increasing difficulty of data storage and maintenance. Studies have shown that the file data of massive storage systems in practice has significant correlation characteristics. Correlation feature refers to the aggregation phenomenon of files in their attribute space, which essentially reflects the correlation between files. Usually, we often use temporal correlation and spatial correlation between files. Time correlation means that files with similar time will be accessed immediately within a period of time, while spatial correlation means that files located in adjacent locations It is highly likely to be accessed by subsequent requests. In addition to temporal correlation and spatial correlation, there are many correlations between files, such as file size, file access frequency, file creator, and so on. However, the existing research results obviously lack the research on the relevance of files in more attributes. Considering the correlation of more attributes helps to distinguish the correlation between files more accurately. Based on the distance measure in the multidimensional attribute space, the correlation between two files can be clearly calculated. In the face of massive data processing, using certain methods to measure the correlation between data, and thus divide the data into multiple aggregated spaces, will bring obvious benefits to subsequent processing.

然而,现有的元数据管理方法存在以下问题:However, existing metadata management methods have the following problems:

(1)没有充分利用元数据的多维属性间的关联特性,表现在现有方法往往只利用了元数据的时间属性与空间属性,没有充分地挖掘元数据之间的关联特性。(1) The correlation characteristics between the multi-dimensional attributes of metadata are not fully utilized. The existing methods often only use the time attributes and spatial attributes of metadata, and do not fully exploit the correlation characteristics between metadata.

(2)不能有效的支持复杂的查询请求,对于涉及元数据多维属性的查询请求,如范围查询、TopK查询等,现有方法不能有效地处理;(2) It cannot effectively support complex query requests. Existing methods cannot effectively handle query requests involving multi-dimensional attributes of metadata, such as range queries and TopK queries;

(3)可扩展性差,当元数据数目随着系统的扩展而变多时,现有方法的查询响应时间将显著增加。(3) The scalability is poor. When the number of metadata increases with the expansion of the system, the query response time of the existing method will increase significantly.

发明内容Contents of the invention

针对现有技术的缺陷,本发明的目的在于提供一种基于关联特性的多维元数据管理方法,旨在解决海量存储系统中的元数据管理问题,其能够充分地利用元数据的多维属性间的关联特性,满足复杂查询需求,并具有良好的可扩展性。Aiming at the defects of the prior art, the purpose of the present invention is to provide a multi-dimensional metadata management method based on association characteristics, aiming at solving the problem of metadata management in mass storage systems, which can make full use of the relationship between the multi-dimensional attributes of metadata Correlation features meet complex query requirements and have good scalability.

为实现上述目的,本发明提供了一种基于关联特性的多维元数据管理方法,包括以下步骤:In order to achieve the above object, the present invention provides a multi-dimensional metadata management method based on association characteristics, comprising the following steps:

(1)在元数据服务器集群中,对每台元数据服务器上的元数据根据关联特性进行划分,以生成元数据集合与集合统计文件;(1) In the metadata server cluster, the metadata on each metadata server is divided according to the associated characteristics to generate metadata collections and collection statistics files;

(2)根据集合统计文件,对元数据集群进行分组操作,以生成多个元数据服务器分组与分组配置文件;(2) According to the collection statistics file, the metadata cluster is grouped to generate multiple metadata server groups and group configuration files;

(3)根据集合统计文件,分别在每台元数据服务器上建立本地索引表;本地索引表用于管理每台元数据服务器上的元数据集合,索引表中每一项记录了集合统计文件中的元数据集合编号,以及该元数据集合编号对应的元数据集合在磁盘中的存储地址;(3) According to the collection statistics file, establish a local index table on each metadata server; the local index table is used to manage the metadata collection on each metadata server, and each item in the index table records the data in the collection statistics file. The metadata collection number of the metadata collection number, and the storage address of the metadata collection corresponding to the metadata collection number in the disk;

(4)根据分组配置文件与集合统计文件,分别在每个元数据服务器分组内建立组索引表;(4) according to the group configuration file and the collection statistics file, set up a group index table in each metadata server group respectively;

(5)根据组索引表,建立元数据服务器集群的顶层索引表;(5) according to the group index table, establish the top-level index table of the metadata server cluster;

(6)接收来自用户的查询请求,并根据查询请求依次查询顶层索引表、组索引表与本地索引表,并返回查询结果;其中用户查询请求包括点查询、范围查询和TopK查询。(6) Receive the query request from the user, and query the top-level index table, group index table and local index table sequentially according to the query request, and return the query result; wherein the user query request includes point query, range query and TopK query.

步骤(1)包括以下子步骤:Step (1) includes the following sub-steps:

(1-1)确定表示每台元数据服务器上元数据之间关联特性的多维属性;(1-1) Determining the multi-dimensional attributes representing the associated characteristics between metadata on each metadata server;

(1-2)将元数据的多维属性构造成固定长度的输入向量,该输入向量作为位置灵敏哈希函数的输入值;(1-2) Constructing the multi-dimensional attribute of metadata into a fixed-length input vector, which is used as the input value of the position-sensitive hash function;

(1-3)使用相同的位置灵敏哈希函数对输入向量进行哈希计算,得到的哈希值作为该输入向量对应的元数据的唯一标识;(1-3) Use the same position-sensitive hash function to perform hash calculation on the input vector, and the obtained hash value is used as the unique identifier of the metadata corresponding to the input vector;

(1-4)将具有相同哈希值的元数据划分到同一元数据集合中,并以该哈希值作为该元数据集合的编号;(1-4) Divide the metadata with the same hash value into the same metadata set, and use the hash value as the number of the metadata set;

(1-5)统计元数据集合中元数据的划分情况,以生成集合统计文件;该集合统计文件包括元数据集合编号、元数据数目、各维属性平均值、各维属性范围,其中元数据集合编号范围为1,2,3,…,N,N表示位置灵敏哈希函数中哈希表的长度。(1-5) Statistics on the division of metadata in the metadata collection to generate a collection statistics file; the collection statistics file includes the number of the metadata collection, the number of metadata, the average value of each dimension attribute, and the range of each dimension attribute, among which the metadata The set numbers range from 1, 2, 3, ..., N, where N represents the length of the hash table in the position-sensitive hash function.

步骤(2)具体为,在每台元数据服务器上构建一个位向量,该位向量的长度与步骤(1)中位置灵敏哈希函数使用的哈希表长度相同,其后,根据所有元数据服务器的位向量两两之间的海明距离并利用层次聚类算法在元数据服务器之间进行聚类操作,以得到元数据服务器的分组,当聚类形成的分组数目达到下限,或者分组之间的距离到达了上限,则停止聚类操作,从而得到多个元数据服务器组,并将结果保存在分组配置文件中。Step (2) is specifically to construct a bit vector on each metadata server, the length of the bit vector is the same as the length of the hash table used by the location-sensitive hash function in step (1), and then, according to all metadata The Hamming distance between the bit vectors of the servers and the hierarchical clustering algorithm are used to perform clustering operations between the metadata servers to obtain the grouping of the metadata servers. When the number of groups formed by clustering reaches the lower limit, or the grouping If the distance between them reaches the upper limit, the clustering operation will be stopped to obtain multiple metadata server groups, and the results will be saved in the grouping configuration file.

步骤(4)具体为,对于分组配置文件中的每个分组,分别构建对应的组索引表,组索引表中的每一项记录该分组中所有元数据服务器上元数据集合的信息,包括元数据集合编号、元数据集合所在元数据服务器的IP地址、元数据数目、各维属性平均值、各维属性范围。Step (4) is specifically, for each group in the group configuration file, construct a corresponding group index table, and each item in the group index table records the information of metadata sets on all metadata servers in the group, including metadata Data collection number, IP address of the metadata server where the metadata collection is located, number of metadata, average value of attributes of each dimension, and range of attributes of each dimension.

步骤(6)中的点查询操作具体包括以下步骤:The point query operation in step (6) specifically includes the following steps:

(6-1-1)接收点查询请求,确定该查询请求对应的元数据的多维属性,并利用位置灵敏哈希函数计算多维属性的哈希值,该哈希值即为需要查询的元数据集合的编号;(6-1-1) Receive a point query request, determine the multidimensional attribute of the metadata corresponding to the query request, and use the location-sensitive hash function to calculate the hash value of the multidimensional attribute, which is the metadata to be queried the number of the collection;

(6-1-2)在顶层索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器分组的IP地址;(6-1-2) Query the entry corresponding to the metadata collection number in the top-level index table, to obtain the IP address of the metadata server group where the metadata is located;

(6-1-3)根据元数据服务器分组的IP地址确定对应的元数据服务器,并在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器的IP地址;(6-1-3) Determine the corresponding metadata server according to the IP address of the metadata server group, and query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata where the metadata is located. IP address of the data server;

(6-1-4)根据查找到的元数据服务器的IP地址,在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据集合在磁盘中的存储地址;(6-1-4) According to the found IP address of the metadata server, query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata set where the metadata is located in the disk storage address;

(6-1-5)根据查找到的元数据集合在磁盘中的存储地址,查询对应的元数据集合,并返回查询结果。(6-1-5) According to the storage address of the found metadata set in the disk, query the corresponding metadata set and return the query result.

步骤(6)中的范围查询操作具体包括以下步骤:The range query operation in step (6) specifically includes the following steps:

(6-2-1)接收范围查询请求,确定待查的多维属性范围,计算各属性范围的中位值,并由各属性范围的中位值构造输入向量,利用位置灵敏哈希函数计算输入向量的哈希值,该哈希值即为需要查询的元数据集合的编号;(6-2-1) Receive a range query request, determine the multi-dimensional attribute range to be checked, calculate the median value of each attribute range, and construct an input vector from the median value of each attribute range, and use the position-sensitive hash function to calculate the input The hash value of the vector, which is the number of the metadata set to be queried;

(6-2-2)在顶层索引表中查询元数据集合编号对应的表项,将查询请求中的多维属性范围与表项中保存的多维属性范围作对比,如果两个范围不相交,直接返回结果为空;如果两个范围相交,得到包含待查元数据的元数据服务器分组的IP地址;(6-2-2) Query the table item corresponding to the metadata set number in the top-level index table, compare the multi-dimensional attribute range in the query request with the multi-dimensional attribute range saved in the table item, if the two ranges do not intersect, directly The returned result is empty; if the two ranges intersect, the IP address of the metadata server group containing the metadata to be checked is obtained;

(6-2-3)根据元数据服务器分组的IP地址确定对应的元数据服务器,并在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器的IP地址;(6-2-3) Determine the corresponding metadata server according to the IP address of the metadata server group, and query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata where the metadata is located. IP address of the data server;

(6-2-4)根据查找到的元数据服务器的IP地址,在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据集合在磁盘中的存储地址;(6-2-4) According to the found IP address of the metadata server, query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata set where the metadata is located in the disk storage address;

(6-2-5)根据查找到的元数据集合在磁盘中的存储地址,查询对应的元数据集合,并返回所有满足查询请求中多维属性范围的元数据。(6-2-5) According to the storage address of the found metadata set in the disk, query the corresponding metadata set, and return all the metadata that meet the multi-dimensional attribute range in the query request.

步骤(6)中的TopK查询操作具体包括以下步骤:The TopK query operation in step (6) specifically includes the following steps:

(6-3-1)接收TopK查询请求,确定该查询请求对应的元数据的多维属性及K值,并利用位置灵敏哈希函数计算多维属性的哈希值,该哈希值即为需要查询的元数据集合的编号;其中K表示与查询请求中的元数据多维属性最相近的元数据的数量;(6-3-1) Receive the TopK query request, determine the multi-dimensional attribute and K value of the metadata corresponding to the query request, and use the location-sensitive hash function to calculate the hash value of the multi-dimensional attribute, which is the required query The number of the metadata collection; where K represents the number of metadata that is closest to the multidimensional attribute of the metadata in the query request;

(6-3-2)在顶层索引表中查询元数据集合编号对应的表项,如果表项中记录的元数据数目小于K值,则将该表项左右两边的表项也纳入查询范围,直到表项中元数据数目之和大于或等于K值,最后得到多个元数据服务器分组的IP地址;(6-3-2) Query the table item corresponding to the metadata collection number in the top-level index table. If the number of metadata recorded in the table item is less than the K value, the table items on the left and right sides of the table item are also included in the query range. Until the sum of the metadata numbers in the entry is greater than or equal to the K value, finally obtain the IP addresses of multiple metadata server groups;

(6-3-3)根据元数据服务器分组的IP地址确定对应的元数据服务器,并在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器的IP地址;(6-3-3) Determine the corresponding metadata server according to the IP address of the metadata server group, and query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata where the metadata is located. IP address of the data server;

(6-3-4)根据查找到的元数据服务器的IP地址,在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据集合在磁盘中的存储地址;(6-3-4) According to the found IP address of the metadata server, query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata set where the metadata is located in the disk storage address;

(6-3-5)根据查找到的元数据集合在磁盘中的存储地址,查询对应的元数据集合,并返回与查询请求中的元数据的多维属性最相近的K条元数据。(6-3-5) Query the corresponding metadata set according to the storage address of the found metadata set in the disk, and return K pieces of metadata that are most similar to the multidimensional attributes of the metadata in the query request.

通过本发明所构思的以上技术方案,与现有技术相比,本方法具有以下的有益效果:Through the above technical solutions conceived by the present invention, compared with the prior art, this method has the following beneficial effects:

1、充分利用了元数据的多维属性间的关联特性。由于采用了步骤(1)和(2),元数据根据其多维属性间的关联特性被划分到了多个元数据集合,具有相同或相近多维属性的元数据被划分到了相同的元数据集合之中,从而可以有效地以元数据集合为单位来管理所有的元数据。1. Make full use of the association characteristics between multi-dimensional attributes of metadata. Due to the adoption of steps (1) and (2), metadata are divided into multiple metadata sets according to the correlation characteristics between their multi-dimensional attributes, and metadata with the same or similar multi-dimensional attributes are divided into the same metadata set. , so that all metadata can be effectively managed in units of metadata collections.

2、有效地支持复杂的查询请求,如范围查询、TopK查询。由于采用了步骤(3)、步骤(4)和步骤(5),对于每种查询请求,都将依次查询顶层索引表、组索引表和本地索引表,并最终将查询定位到元数据集合之中。查询元数据集合中具有相同或相近多维属性的元数据,可以快速,准确地返回结果。2. Effectively support complex query requests, such as range query and TopK query. Due to the adoption of step (3), step (4) and step (5), for each query request, the top-level index table, the group index table and the local index table will be queried in turn, and finally the query will be located among the metadata collections middle. Query the metadata with the same or similar multi-dimensional attributes in the metadata collection, and the results can be returned quickly and accurately.

3、满足了可扩展性的要求。由于元数据根据关联特性被划分成了多个元数据集合进行管理,系统元数据数目的迅速增长只会引起元数据集合的缓慢增长,从而保证了元数据管理的效果与效率。3. Meet the scalability requirements. Because the metadata is divided into multiple metadata collections for management according to the associated characteristics, the rapid growth of the system metadata will only cause the slow growth of the metadata collection, thus ensuring the effectiveness and efficiency of metadata management.

本发明的另一目的在于提供一种基于关联特性的多维元数据管理系统,旨在解决海量存储系统中的元数据管理问题,其能够充分地利用元数据的多维属性间的关联特性,满足复杂查询需求,并具有良好的可扩展性。Another object of the present invention is to provide a multi-dimensional metadata management system based on association characteristics, aiming at solving the problem of metadata management in mass storage systems, which can make full use of the association characteristics between multi-dimensional attributes of metadata to meet complex Query requirements and have good scalability.

为实现上述目的,本发明提供了一种基于关联特性的多维元数据管理系统,包括元数据集合生成模块、元数据服务器分组模块、本地索引生成模块、组索引生成模块、顶层索引生成模块、查询模块,元数据集合生成模块对每台元数据服务器上的元数据根据关联特性进行划分,以生成元数据集合与集合统计文件,元数据服务器分组模块根据集合统计文件,对元数据集群进行分组操作,以生成多个元数据服务器分组与分组配置文件,本地索引生成模块,根据集合统计文件,分别在每台元数据服务器上建立本地索引表,组索引生成模块根据分组配置文件与集合统计文件,分别在每个元数据服务器分组内建立组索引表,顶层索引生成模块根据组索引表,建立元数据服务器集群的顶层索引表,查询模块接收来自用户的查询请求,并根据查询请求依次查询顶层索引表、组索引表与本地索引表,并返回查询结果。In order to achieve the above object, the present invention provides a multi-dimensional metadata management system based on association characteristics, including a metadata set generation module, a metadata server grouping module, a local index generation module, a group index generation module, a top-level index generation module, a query Module, the metadata collection generation module divides the metadata on each metadata server according to the associated characteristics to generate metadata collections and collection statistics files, and the metadata server grouping module performs grouping operations on metadata clusters according to the collection statistics files , to generate multiple metadata server groups and group configuration files, the local index generation module builds local index tables on each metadata server according to the collection statistics file, and the group index generation module according to the group configuration files and collection statistics files, Create a group index table in each metadata server group, the top-level index generation module builds the top-level index table of the metadata server cluster according to the group index table, and the query module receives query requests from users, and queries the top-level indexes in turn according to the query requests table, group index table, and local index table, and return query results.

元数据集合生成模块包括多维属性确定模块、输入向量构造模块、哈希函数计算模块、关联特性划分模块和输出模块,多维属性确定模块确定表示元数据之间关联特性的多维属性,输入向量构造模块将元数据的多维属性构造成固定长度的输入向量,该输入向量作为位置灵敏哈希函数的输入值,哈希函数计算模块使用相同的位置灵敏哈希函数对输入向量进行哈希计算,得到的哈希值作为该输入向量对应的元数据的唯一标识,关联特性划分模块将具有相同哈希值的元数据划分到同一元数据集合中,并以该哈希值作为该元数据集合的编号,输出模块统计元数据集合中元数据的划分情况,以生成集合统计文件。The metadata set generation module includes a multidimensional attribute determination module, an input vector construction module, a hash function calculation module, an association characteristic division module and an output module. The multidimensional attribute determination module determines the multidimensional attributes representing the association characteristics between metadata, and the input vector construction module Construct the multi-dimensional attributes of metadata into a fixed-length input vector, which is used as the input value of the position-sensitive hash function, and the hash function calculation module uses the same position-sensitive hash function to perform hash calculation on the input vector, and the obtained The hash value is used as the unique identifier of the metadata corresponding to the input vector, and the associated feature division module divides the metadata with the same hash value into the same metadata set, and uses the hash value as the number of the metadata set, The output module counts the division of metadata in the metadata collection to generate collection statistics files.

查询模块具体包括点查询子模块、范围查询子模块、TopK查询子模块,点查询子模块处理用户的点查询请求,给定某个元数据的多维属性,查询结果返回元数据的具体信息,范围查询子模块处理用户的范围查询请求,给定多维属性的范围,查询结果返回整个系统中满足范围的所有元数据信息,TopK查询子模块处理用户的TopK查询请求,给定一组多维属性,并指定K值,查询结果返回整个系统中与给定多维属性最相近的K条数据。The query module specifically includes a point query sub-module, a range query sub-module, and a TopK query sub-module. The point query sub-module processes the user's point query request, and given the multi-dimensional attributes of a certain metadata, the query result returns the specific information of the metadata, the range The query sub-module processes the user's range query request, given the range of multi-dimensional attributes, the query result returns all metadata information in the entire system that meets the range, the TopK query sub-module processes the user's TopK query request, given a set of multi-dimensional attributes, and Specify the K value, and the query result returns the K pieces of data that are closest to the given multidimensional attribute in the entire system.

通过本发明所构思的以上技术方案,与现有技术相比,本系统具有以下的有益效果:Through the above technical solutions conceived by the present invention, compared with the prior art, this system has the following beneficial effects:

1、充分利用了元数据的多维属性间的关联特性。由于采用了元数据集合生成模块、元数据服务器分组模块,元数据根据其多维属性间的关联特性被划分到了多个元数据集合,具有相同或相近多维属性的元数据被划分到了相同的元数据集合之中,从而可以有效地以元数据集合为单位来管理所有的元数据。1. Make full use of the association characteristics between multi-dimensional attributes of metadata. Due to the adoption of the metadata set generation module and the metadata server grouping module, metadata is divided into multiple metadata sets according to the correlation characteristics between their multidimensional attributes, and metadata with the same or similar multidimensional attributes are divided into the same metadata In the collection, all metadata can be effectively managed in units of metadata collections.

2、有效地支持复杂的查询请求,如范围查询、TopK查询。由于采用了本地索引生成模块、组索引生成模块、顶层索引生成模块,对于每种查询请求,都将依次查询顶层索引表、组索引表和本地索引表,并最终将查询定位到元数据集合之中。查询元数据集合中具有相同或相近多维属性的元数据,可以快速,准确地返回结果。2. Effectively support complex query requests, such as range query and TopK query. Due to the use of the local index generation module, group index generation module, and top-level index generation module, for each query request, the top-level index table, group index table, and local index table will be queried in turn, and finally the query will be located among the metadata collections. middle. Query the metadata with the same or similar multi-dimensional attributes in the metadata collection, and the results can be returned quickly and accurately.

3、满足了可扩展性的要求。由于元数据根据关联特性被划分成了多个元数据集合进行管理,系统元数据数目的迅速增长只会引起元数据集合的缓慢增长,从而保证了元数据管理的效果与效率。3. Meet the scalability requirements. Because the metadata is divided into multiple metadata collections for management according to the associated characteristics, the rapid growth of the system metadata will only cause the slow growth of the metadata collection, thus ensuring the effectiveness and efficiency of metadata management.

附图说明Description of drawings

图1为本发明基于关联特性的多维元数据管理方法的流程图。FIG. 1 is a flow chart of the multi-dimensional metadata management method based on association characteristics in the present invention.

图2为本发明基于关联特性的多维元数据管理系统的模块示意图。Fig. 2 is a block diagram of the multi-dimensional metadata management system based on the association feature of the present invention.

图3为本发明方法中步骤(1)的细化流程图。Fig. 3 is a detailed flowchart of step (1) in the method of the present invention.

图4为本发明元数据集合生成模块的示意框图。Fig. 4 is a schematic block diagram of a metadata set generating module of the present invention.

图5为本发明查询过程的示意图。Fig. 5 is a schematic diagram of the query process of the present invention.

图6为本发明系统中查询模块的事宜框图。FIG. 6 is a block diagram of the query module in the system of the present invention.

图7为本发明点查询过程的流程图。Fig. 7 is a flow chart of the point query process of the present invention.

图8为本发明范围查询过程的流程图。Fig. 8 is a flow chart of the scope query process of the present invention.

图9为本发明TopK查询过程的流程图。Fig. 9 is a flow chart of the TopK query process of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明为基于关联特性的多维元数据管理方法和系统,该方法利用多维元数据之间的关联特性,对元数据进行分组操作,将具有相同或相似多维属性的元数据划分到同一元数据集合之中,能够满足对于元数据的各种复杂查询操作。The present invention is a multi-dimensional metadata management method and system based on association characteristics. The method uses the association characteristics between multi-dimensional metadata to perform grouping operations on metadata, and divides metadata with the same or similar multi-dimensional attributes into the same metadata set. Among them, various complex query operations on metadata can be satisfied.

如图1所示,本发明基于关联特性的多维元数据管理方法包括以下步骤:As shown in Figure 1, the multi-dimensional metadata management method based on association characteristics of the present invention includes the following steps:

(1)在元数据服务器集群中,对每台元数据服务器上的元数据根据关联特性进行划分,以生成元数据集合与集合统计文件;(1) In the metadata server cluster, the metadata on each metadata server is divided according to the associated characteristics to generate metadata collections and collection statistics files;

如图3所示,本步骤包括以下子步骤,As shown in Figure 3, this step includes the following sub-steps,

(1-1)确定表示每台元数据服务器上元数据之间关联特性的多维属性;在本实施方式中,多维属性包括文件大小、文件创建时间、文件修改时间、文件访问次数等;(1-1) Determine the multi-dimensional attribute representing the correlation characteristics between metadata on each metadata server; in this embodiment, the multi-dimensional attribute includes file size, file creation time, file modification time, file access times, etc.;

(1-2)将元数据的多维属性构造成固定长度的输入向量,该输入向量作为位置灵敏哈希函数的输入值;具体而言,固定长度的大小等于元数据的维度;构造成的输入向量被表示为(文件大小,文件创建时间,文件修改时间,文件访问次数,…);(1-2) Construct the multi-dimensional attributes of metadata into a fixed-length input vector, which is used as the input value of the position-sensitive hash function; specifically, the size of the fixed length is equal to the dimension of the metadata; the constructed input Vectors are represented as (file size, file creation time, file modification time, number of file accesses, ...);

(1-3)使用相同的位置灵敏哈希函数对输入向量进行哈希计算,得到的哈希值作为该输入向量对应的元数据的唯一标识;(1-3) Use the same position-sensitive hash function to perform hash calculation on the input vector, and the obtained hash value is used as the unique identifier of the metadata corresponding to the input vector;

(1-4)将具有相同哈希值的元数据划分到同一元数据集合中,并以该哈希值作为该元数据集合的编号;(1-4) Divide the metadata with the same hash value into the same metadata set, and use the hash value as the number of the metadata set;

(1-5)统计元数据集合中元数据的划分情况,以生成集合统计文件;该集合统计文件包括元数据集合编号、元数据数目、各维属性平均值、各维属性范围,其中元数据集合编号范围为1,2,3,…,N,其中N表示位置灵敏哈希函数中哈希表的长度。(1-5) Statistics on the division of metadata in the metadata collection to generate a collection statistics file; the collection statistics file includes the number of the metadata collection, the number of metadata, the average value of each dimension attribute, and the range of each dimension attribute, among which the metadata The set numbers range from 1, 2, 3, ..., N, where N represents the length of the hash table in the position-sensitive hash function.

对元数据进行基于关联特性的分组有多种方法,本发明选择了位置灵敏哈希。位置灵敏哈希可以将多维数据映射到一维空间之中,同时保持多维数据之间空间关系的不变,即原本相邻的数据在经过哈希后仍然是相邻的。对于每条元数据,通过位置灵敏哈希计算得到哈希值,将具有相同哈希值的元数据聚集到同一元数据集合中,就可以达到对元数据进行分组的目的。There are many ways to group metadata based on association properties, and the present invention chooses position-sensitive hashing. Position-sensitive hashing can map multidimensional data into one-dimensional space while keeping the spatial relationship between multidimensional data unchanged, that is, originally adjacent data is still adjacent after being hashed. For each piece of metadata, the hash value is calculated by position-sensitive hashing, and the metadata with the same hash value is aggregated into the same metadata set, so that the purpose of grouping metadata can be achieved.

本步骤的优点在于:利用元数据的多维属性间的关联特性,将所有元数据划分成多个元数据集合,同一元数据集合中包含的元数据具有相同或相似的多维属性。这样,对于元数据的查询操作就可以限定在对应的元数据集合中进行,从而显著提高查询效率。The advantage of this step is that all the metadata is divided into multiple metadata sets by using the correlation feature between the multi-dimensional attributes of the metadata, and the metadata contained in the same metadata set have the same or similar multi-dimensional attributes. In this way, the query operation for metadata can be limited to the corresponding metadata collection, thereby significantly improving query efficiency.

(2)根据集合统计文件,对元数据集群进行分组操作,以生成多个元数据服务器分组与分组配置文件;(2) According to the collection statistics file, the metadata cluster is grouped to generate multiple metadata server groups and group configuration files;

具体而言,在每台元数据服务器上构建一个位向量,该位向量的长度与步骤(1)中位置灵敏哈希函数使用的哈希表长度相同,位向量的第一个位对应于元数据集合编号为1的元数据集合是否存在,若存在,则该第一个位为1,否则为0,…,位向量的第N个位对应于元数据集合编号为N的元数据集合是否存在,若存在,则该第N个位为1,否则为0;其后,根据所有元数据服务器的位向量两两之间的海明距离(Hammingdistance)并利用层次聚类算法在元数据服务器之间进行聚类操作,以得到元数据服务器的分组,当聚类形成的分组数目达到下限,或者分组之间的距离到达了上限,则停止聚类操作,从而得到多个元数据服务器组,并将结果保存在分组配置文件中;在本实施方式中,上限的取值等于位向量长度的一半,下限的取值为1。Specifically, a bit vector is constructed on each metadata server, the length of the bit vector is the same as the length of the hash table used by the position-sensitive hash function in step (1), and the first bit of the bit vector corresponds to the metadata Whether the metadata set whose data set number is 1 exists, if it exists, the first bit is 1, otherwise it is 0, ..., the Nth bit of the bit vector corresponds to whether the metadata set whose number is N exists, if it exists, the Nth bit is 1, otherwise it is 0; then, according to the Hamming distance between the bit vectors of all metadata servers and using the hierarchical clustering algorithm in the metadata server Clustering operations are performed to obtain groups of metadata servers. When the number of groups formed by clustering reaches the lower limit, or the distance between groups reaches the upper limit, the clustering operation is stopped to obtain multiple metadata server groups. And save the result in the group configuration file; in this embodiment, the value of the upper limit is equal to half the length of the bit vector, and the value of the lower limit is 1.

举例而言,假如有A、B、C、D和E共5个元数据服务器,分别对应于5个位向量,首先计算这5个位向量两两之间的海明距离,然后选出海明距离最短的两个位向量对应的元数据服务器(比如A和B)形成聚类F,以生成的每个聚类为一个元数据服务器分组,然后在F、C、D和E之间进行重复迭代操作,一旦形成的分组数目达到下限,或者分组之间的距离达到上限,则停止聚类操作。For example, if there are 5 metadata servers A, B, C, D, and E, corresponding to 5 bit vectors, first calculate the Hamming distance between any two of these 5 bit vectors, and then select the Haiming distance The metadata servers corresponding to the two bit vectors with the shortest distance (such as A and B) form a cluster F, and each cluster generated is a metadata server group, and then the clustering is performed among F, C, D, and E. The iterative operation is repeated, and once the number of groups formed reaches the lower limit, or the distance between groups reaches the upper limit, the clustering operation is stopped.

本步骤的优点在于:将元数据服务器分成多个元数据服务器分组,可以有效地将查询请求分散在多个元数据服务器分组之中,从而避免出现系统瓶颈。The advantage of this step is that dividing the metadata server into multiple metadata server groups can effectively disperse query requests among multiple metadata server groups, thereby avoiding system bottlenecks.

(3)根据集合统计文件,分别在每台元数据服务器上建立本地索引表;本地索引表用于管理每台元数据服务器上的元数据集合,索引表中每一项记录了集合统计文件中的元数据集合编号,以及该元数据集合编号对应的元数据集合在磁盘中的存储地址。(3) According to the collection statistics file, establish a local index table on each metadata server; the local index table is used to manage the metadata collection on each metadata server, and each item in the index table records the data in the collection statistics file. The metadata collection number of the metadata collection number, and the storage address of the metadata collection corresponding to the metadata collection number in the disk.

(4)根据分组配置文件与集合统计文件,分别在每个元数据服务器分组内建立组索引表;具体而言,对于分组配置文件中的每个分组,分别构建对应的组索引表,组索引表中的每一项记录该分组中所有元数据服务器上元数据集合的信息,包括元数据集合编号、元数据集合所在元数据服务器的IP地址、元数据数目、各维属性平均值、各维属性范围。(4) According to the group configuration file and the collection statistics file, establish a group index table in each metadata server group respectively; specifically, for each group in the group configuration file, build a corresponding group index table, group index Each item in the table records the information of the metadata collection on all metadata servers in the group, including the metadata collection number, the IP address of the metadata server where the metadata collection is located, the number of metadata, the average value of each dimension attribute, and the attribute range.

对于建立的组索引表,在具体实施中,可以存放在元数据服务器分组中的任一台元数据服务器上,这台元数据服务器我们称之为元数据服务器分组的组长元数据服务器;在考虑数据冗余的情况下,也可以将组索引表存放在元数据服务器分组中的多台元数据服务器上。For the established group index table, in specific implementation, it can be stored on any metadata server in the metadata server group, and this metadata server is called the group leader metadata server of the metadata server group; In consideration of data redundancy, the group index table can also be stored on multiple metadata servers in the metadata server group.

(5)根据组索引表,建立元数据服务器集群的顶层索引表;具体而言,建立顶层索引表来管理元数据服务器分组的信息,包括元数据集合编号、元数据集合所在分组的IP地址、元数据数目、各维属性平均值、各维属性范围。(5) According to the group index table, establish the top-level index table of the metadata server cluster; specifically, establish the top-level index table to manage the information of the metadata server group, including the metadata collection number, the IP address of the group where the metadata collection is located, The number of metadata, the average value of each dimension attribute, and the range of each dimension attribute.

顶层索引表可以存放在任一台选定的元数据服务器上,为了避免单点故障,也可以将其存放在多台元数据服务器上。The top-level index table can be stored on any selected metadata server, or it can be stored on multiple metadata servers to avoid a single point of failure.

步骤(3)到步骤(5)用于构造整个系统的三级索引表,通过三级索引表,对于元数据的查询请求可以快速地定位到某个元数据服务器分组中,然后再定位到具体的元数据服务器,最后找到包含该元数据的元数据集合。三级索引表用于快速地定位元数据集合,提高查询的时间效率。Steps (3) to (5) are used to construct a three-level index table for the entire system. Through the three-level index table, the query request for metadata can be quickly located in a metadata server group, and then located in a specific metadata server, and finally find the metadata collection containing the metadata. The three-level index table is used to quickly locate metadata collections and improve query time efficiency.

(6)接收来自用户的查询请求,并根据查询请求依次查询顶层索引表、组索引表与本地索引表,并返回查询结果;其中用户查询请求具体包括:点查询、范围查询和TopK查询。(6) Receive the query request from the user, and query the top-level index table, group index table and local index table sequentially according to the query request, and return the query result; the user query request specifically includes: point query, range query and TopK query.

如图6所示,步骤(6)中的点查询操作具体包括以下步骤:As shown in Figure 6, the point query operation in step (6) specifically includes the following steps:

(6-1-1)接收点查询请求,确定该查询请求对应的元数据的多维属性,并利用位置灵敏哈希函数计算多维属性的哈希值,该哈希值即为需要查询的元数据集合的编号;(6-1-1) Receive a point query request, determine the multidimensional attribute of the metadata corresponding to the query request, and use the location-sensitive hash function to calculate the hash value of the multidimensional attribute, which is the metadata to be queried the number of the collection;

(6-1-2)在顶层索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器分组的IP地址;(6-1-2) Query the entry corresponding to the metadata collection number in the top-level index table, to obtain the IP address of the metadata server group where the metadata is located;

(6-1-3)根据元数据服务器分组的IP地址确定对应的元数据服务器,并在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器的IP地址;(6-1-3) Determine the corresponding metadata server according to the IP address of the metadata server group, and query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata where the metadata is located. IP address of the data server;

(6-1-4)根据查找到的元数据服务器的IP地址,在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据集合在磁盘中的存储地址;(6-1-4) According to the found IP address of the metadata server, query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata set where the metadata is located in the disk storage address;

(6-1-5)根据查找到的元数据集合在磁盘中的存储地址,查询对应的元数据集合,并返回查询结果;(6-1-5) Query the corresponding metadata set according to the storage address of the found metadata set in the disk, and return the query result;

如图7所示,步骤(6)中的范围查询操作具体包括以下步骤:As shown in Figure 7, the range query operation in step (6) specifically includes the following steps:

(6-2-1)接收范围查询请求,确定待查的多维属性范围,计算各属性范围的中位值,并由各属性范围的中位值构造输入向量,利用位置灵敏哈希函数计算输入向量的哈希值,该哈希值即为需要查询的元数据集合的编号;(6-2-1) Receive a range query request, determine the multi-dimensional attribute range to be checked, calculate the median value of each attribute range, and construct an input vector from the median value of each attribute range, and use the position-sensitive hash function to calculate the input The hash value of the vector, which is the number of the metadata set to be queried;

(6-2-2)在顶层索引表中查询元数据集合编号对应的表项,将查询请求中的多维属性范围与表项中保存的多维属性范围作对比,如果两个范围不相交,直接返回结果为空;如果两个范围相交,得到包含待查元数据的元数据服务器分组的IP地址;(6-2-2) Query the table item corresponding to the metadata set number in the top-level index table, compare the multi-dimensional attribute range in the query request with the multi-dimensional attribute range saved in the table item, if the two ranges do not intersect, directly The returned result is empty; if the two ranges intersect, the IP address of the metadata server group containing the metadata to be checked is obtained;

(6-2-3)根据元数据服务器分组的IP地址确定对应的元数据服务器,并在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器的IP地址;(6-2-3) Determine the corresponding metadata server according to the IP address of the metadata server group, and query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata where the metadata is located. IP address of the data server;

(6-2-4)根据查找到的元数据服务器的IP地址,在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据集合在磁盘中的存储地址;(6-2-4) According to the found IP address of the metadata server, query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata set where the metadata is located in the disk storage address;

(6-2-5)根据查找到的元数据集合在磁盘中的存储地址,查询对应的元数据集合,并返回所有满足查询请求中多维属性范围的元数据;(6-2-5) According to the storage address of the found metadata set in the disk, query the corresponding metadata set, and return all the metadata that meet the multi-dimensional attribute range in the query request;

如图8所示,步骤(6)中的TopK查询操作具体包括以下步骤:As shown in Figure 8, the TopK query operation in step (6) specifically includes the following steps:

(6-3-1)接收TopK查询请求,确定该查询请求对应的元数据的多维属性及K值,并利用位置灵敏哈希函数计算多维属性的哈希值,该哈希值即为需要查询的元数据集合的编号;其中K表示与查询请求中的元数据多维属性最相近的元数据的数量;(6-3-1) Receive the TopK query request, determine the multi-dimensional attribute and K value of the metadata corresponding to the query request, and use the location-sensitive hash function to calculate the hash value of the multi-dimensional attribute, which is the required query The number of the metadata collection; where K represents the number of metadata that is closest to the multidimensional attribute of the metadata in the query request;

(6-3-2)在顶层索引表中查询元数据集合编号对应的表项,如果表项中记录的元数据数目小于K值,则将该表项左右两边的表项也纳入查询范围,直到表项中元数据数目之和大于或等于K值,最后得到多个元数据服务器分组的IP地址;(6-3-2) Query the table item corresponding to the metadata collection number in the top-level index table. If the number of metadata recorded in the table item is less than the K value, the table items on the left and right sides of the table item are also included in the query range. Until the sum of the metadata numbers in the entry is greater than or equal to the K value, finally obtain the IP addresses of multiple metadata server groups;

(6-3-3)根据元数据服务器分组的IP地址确定对应的元数据服务器,并在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据服务器的IP地址;(6-3-3) Determine the corresponding metadata server according to the IP address of the metadata server group, and query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata where the metadata is located. IP address of the data server;

(6-3-4)根据查找到的元数据服务器的IP地址,在该元数据服务器的组索引表中查询元数据集合编号对应的表项,以得到该元数据所在元数据集合在磁盘中的存储地址;(6-3-4) According to the found IP address of the metadata server, query the entry corresponding to the metadata set number in the group index table of the metadata server to obtain the metadata set where the metadata is located in the disk storage address;

(6-3-5)根据查找到的元数据集合在磁盘中的存储地址,查询对应的元数据集合,并返回与查询请求中的元数据的多维属性最相近的K条元数据;(6-3-5) According to the storage address of the found metadata collection in the disk, query the corresponding metadata collection, and return the K pieces of metadata closest to the multidimensional attributes of the metadata in the query request;

因为元数据由多维属性构成,这些属性包括文件大小、文件创建时间、文件修改时间、文件访问次数等。在文件系统中,元数据之间往往存在一定的关联特性,比如具有相似的大小与访问时间,传统的元数据查询方式只利用了元数据的空间局部性与时间局部性,而忽略了元数据其它属性之间的关联性。在面对复杂的查询请求,如TopK查询、范围查询,传统的元数据管理方式不得不遍历整个元数据集合来获得结果。在本发明中,充分的利用了元数据之间的关联特性,将具有相同或相似元数据属性的元数据聚焦成元数据集合,对于每个元数据查询请求,本发明都将首先将查询定位到某个或某些元数据集合中,从而大大减少了元数据的查询数目,可以更快的返回查询结果。Because metadata is composed of multi-dimensional attributes, these attributes include file size, file creation time, file modification time, file access times, and so on. In the file system, there are often certain correlation characteristics between metadata, such as similar size and access time. The traditional metadata query method only uses the spatial locality and temporal locality of metadata, but ignores metadata. Relationships among other attributes. In the face of complex query requests, such as TopK query and range query, traditional metadata management methods have to traverse the entire metadata collection to obtain results. In the present invention, the correlation characteristics between metadata are fully utilized, and metadata with the same or similar metadata attributes are focused into a metadata collection. For each metadata query request, the present invention will first locate the query to one or some metadata collections, thereby greatly reducing the number of metadata queries and returning query results faster.

如图2所示,本发明基于关联特性的多维元数据管理系统包括元数据集合生成模块1、元数据服务器分组模块2、本地索引生成模块3、组索引生成模块4、顶层索引生成模块5、查询模块6。As shown in Figure 2, the multi-dimensional metadata management system based on the association feature of the present invention includes a metadata collection generation module 1, a metadata server grouping module 2, a local index generation module 3, a group index generation module 4, a top-level index generation module 5, Query module 6.

元数据集合生成模块1,对每台元数据服务器上的元数据根据关联特性进行划分,以生成元数据集合与集合统计文件;如图4所示,元数据集合生成模块1包括多维属性确定模块11、输入向量构造模块12、哈希函数计算模块13、关联特性划分模块14和输出模块15。Metadata collection generation module 1, divides the metadata on each metadata server according to the associated characteristics to generate metadata collection and collection statistics files; as shown in Figure 4, metadata collection generation module 1 includes a multi-dimensional attribute determination module 11. An input vector construction module 12 , a hash function calculation module 13 , an association characteristic division module 14 and an output module 15 .

多维属性确定模块11确定表示元数据之间关联特性的多维属性;The multidimensional attribute determining module 11 determines a multidimensional attribute representing an association characteristic between metadata;

输入向量构造模块12将元数据的多维属性构造成固定长度的输入向量,该输入向量作为位置灵敏哈希函数的输入值;The input vector construction module 12 constructs the multi-dimensional attribute of the metadata into a fixed-length input vector, which is used as the input value of the position-sensitive hash function;

哈希函数计算模块13使用相同的位置灵敏哈希函数对输入向量进行哈希计算,得到的哈希值作为该输入向量对应的元数据的唯一标识;The hash function calculation module 13 uses the same position-sensitive hash function to perform hash calculation on the input vector, and the obtained hash value is used as the unique identifier of the metadata corresponding to the input vector;

关联特性划分模块14将具有相同哈希值的元数据划分到同一元数据集合中,并以该哈希值作为该元数据集合的编号;The association feature division module 14 divides the metadata with the same hash value into the same metadata set, and uses the hash value as the number of the metadata set;

输出模块15统计元数据集合中元数据的划分情况,以生成集合统计文件;The output module 15 counts the division of metadata in the metadata collection to generate a collection statistics file;

元数据服务器分组模块2根据集合统计文件,对元数据集群进行分组操作,以生成多个元数据服务器分组与分组配置文件;The metadata server grouping module 2 performs grouping operations on the metadata clusters according to the collection statistics file, so as to generate multiple metadata server groups and group configuration files;

本地索引生成模块3,根据集合统计文件,分别在每台元数据服务器上建立本地索引表;The local index generating module 3, according to the collection statistics file, respectively establishes a local index table on each metadata server;

组索引生成模块4根据分组配置文件与集合统计文件,分别在每个元数据服务器分组内建立组索引表;The group index generation module 4 establishes a group index table in each metadata server group respectively according to the group configuration file and the collection statistics file;

顶层索引生成模块5根据组索引表,建立元数据服务器集群的顶层索引表;The top-level index generation module 5 establishes the top-level index table of the metadata server cluster according to the group index table;

查询模块6接收来自用户的查询请求,并根据查询请求依次查询顶层索引表、组索引表与本地索引表,并返回查询结果;如图6所示,查询模块具体包括点查询子模块61、范围查询子模块62、TopK查询子模块63。Query module 6 receives query request from user, and inquires top-level index table, group index table and local index table successively according to query request, and returns query result; As shown in Figure 6, query module specifically includes point query submodule 61, scope Query submodule 62, TopK query submodule 63.

查询模块6的总体示意图如图5所示,查询请求先被发送到顶层索引表所在元数据服务器,通过顶层索引生成模块5将查询请求定位到某个元数据服务器分组,转发查询请求到对应的组索引生成模块4,通过查询组索引表将查询请求定位到某个或某些元数据服务器上,转发查询请求到对应的本地索引生成模块3,最后确定要查询的元数据集合,并返回满足条件的结果。The overall schematic diagram of the query module 6 is shown in Figure 5. The query request is first sent to the metadata server where the top-level index table is located, and the query request is located to a certain metadata server group through the top-level index generation module 5, and the query request is forwarded to the corresponding The group index generation module 4 locates the query request to one or some metadata servers by querying the group index table, forwards the query request to the corresponding local index generation module 3, finally determines the metadata set to be queried, and returns the result of the condition.

具体而言:点查询子模块61处理用户的点查询请求,给定某个元数据的多维属性,查询结果返回元数据的具体信息;范围查询子模块62处理用户的范围查询请求,给定多维属性的范围,查询结果返回整个系统中满足范围的所有元数据信息;TopK查询子模块63处理用户的TopK查询请求,给定一组多维属性,并指定K值,查询结果返回整个系统中与给定多维属性最相近的K条数据。Specifically: the point query sub-module 61 processes the user's point query request, given the multi-dimensional attribute of a certain metadata, and the query result returns the specific information of the metadata; the range query sub-module 62 processes the user's range query request, and given the multi-dimensional The scope of the attribute, the query result returns all metadata information satisfying the scope in the whole system; the TopK query sub-module 63 processes the user’s TopK query request, given a group of multi-dimensional attributes, and specifies the K value, the query result returns the whole system with the given K pieces of data with the closest multidimensional attributes.

为验证本发明系统的可行性和有效性,在真实环境下配置本发明系统,并进行相关查询操作来验证其效果。In order to verify the feasibility and effectiveness of the system of the present invention, the system of the present invention is configured in a real environment, and related query operations are performed to verify its effect.

本发明系统测试的硬件与软件系统如表1所示:The hardware and software system of the system test of the present invention are as shown in table 1:

表1Table 1

本发明系统的配置过程如下:首先,将测试的trace文件分发到每个节点;然后,每个节点运行元数据集合生成模块1和数据服务器分组模块2,在本测试中5个节点被分成了三个分组,分组中的节点数目分别为1,2,3;在每个分组中运行本地索引生成模块3,在每个分组中选择一个节点来存放由组索引生成模块4生成的组索引表,同时在该节点上还保存顶层索引生成模块5生成的顶层索引表。The configuration process of the system of the present invention is as follows: at first, the trace file of test is distributed to each node; Then, each node operates metadata collection generation module 1 and data server grouping module 2, and 5 nodes are divided into in this test Three groups, the number of nodes in the group is 1, 2, 3 respectively; run the local index generation module 3 in each group, select a node in each group to store the group index table generated by the group index generation module 4 , and the top-level index table generated by the top-level index generation module 5 is also saved on this node.

对于用户的查询请求,则由查询模块6处理,先查询顶层索引表,确定请求所有的元数据服务器分组,在分组中查询组索引表,确定请求所有的元数据服务器,最后通过本地索引表确定要查询的元数据集合。通过这一过程,一个请求最终被限定到某个或某些元数据集合中,从而有效的提高了查询的时间效率。表2为本发明系统与关系数据库系统的查询平均时间对比。For the user's query request, it is processed by the query module 6, first query the top-level index table, determine the request for all metadata server groups, query the group index table in the group, determine the request for all metadata servers, and finally determine through the local index table The collection of metadata to query. Through this process, a request is finally limited to one or some metadata collections, thus effectively improving the time efficiency of the query. Table 2 is a comparison of the average query time between the system of the present invention and the relational database system.

表2Table 2

本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims (8)

1. based on a multi-dimensional metadata management method for associate feature, it is characterized in that, comprise the following steps:
(1) in metadata server cluster, the metadata on every platform meta data server is divided according to associate feature, with generator data acquisition and set statistics file; This step comprises following sub-step:
(1-1) multidimensional property representing associate feature between metadata on every platform meta data server is determined;
(1-2) multidimensional property of metadata is configured to the input vector of regular length, this input vector is as the input value of position sensitive hash function;
(1-3) use identical position sensitive hash function to carry out Hash calculation to input vector, the cryptographic hash obtained is as the unique identification of metadata corresponding to this input vector;
(1-4) metadata with identical cryptographic hash is divided in same collection of metadata, and using this cryptographic hash as the numbering of this collection of metadata;
(1-5) dividing condition of metadata in collection of metadata is added up, to generate set statistics file; This set statistics file comprises collection of metadata numbering, metadata number, each dimension attribute mean value, each dimension attribute scope, and wherein collection of metadata Serial Number Range is 1,2,3 ..., N, N represent the length of Hash table in position sensitive hash function;
(2) according to set statistics file, division operation is carried out to metadata cluster, to generate the grouping of multiple meta data server and packet configuration file;
(3) according to set statistics file, on every platform meta data server, local index table is set up respectively; Local index table is for managing the collection of metadata on every platform meta data server, and in concordance list, each have recorded the collection of metadata numbering in set statistics file, and the metadata set of this collection of metadata numbering correspondence is combined in the memory address in disk;
(4) according to packet configuration file and set statistics file, in each meta data server grouping, group index table is set up respectively;
(5) according to group index table, the top layer concordance list of metadata server cluster is set up;
(6) receive the inquiry request from user, and inquire about top layer concordance list, group index table and local index table successively according to inquiry request, and return Query Result; Wherein user's inquiry request comprises an inquiry, range query and TopK inquiry.
2. multi-dimensional metadata management method according to claim 1, it is characterized in that, step (2) is specially, every platform meta data server builds a bit vector, the length of this bit vector is identical with the Hash table length that position sensitive hash function in step (1-3) uses, thereafter, hierarchical clustering algorithm is utilized to carry out cluster operation between meta data server according to the bit vector of all meta data servers Hamming distances between any two, to obtain the grouping of meta data server, the grouping number formed when cluster reaches lower limit, or the distance between grouping reaches the upper limit, then stop cluster operation, thus obtain multiple meta data server group, and result is kept in packet configuration file.
3. multi-dimensional metadata management method according to claim 1, it is characterized in that, step (4) is specially, for each grouping in packet configuration file, build corresponding group index table respectively, each in group index table records the information of collection of metadata on all meta data servers in this grouping, comprises collection of metadata numbering, the IP address of collection of metadata place meta data server, metadata number, each dimension attribute mean value, each dimension attribute scope.
4. multi-dimensional metadata management method according to claim 1, is characterized in that, the some query manipulation in step (6) specifically comprises the following steps:
(6-1-1) acceptance point inquiry request, determines the multidimensional property of the metadata that this inquiry request is corresponding, and utilizes position sensitive hash function to calculate the cryptographic hash of multidimensional property, and this cryptographic hash is the numbering of the collection of metadata needing inquiry;
(6-1-2) list item that query metadata set number is corresponding in top layer concordance list, to obtain the IP address of this metadata place meta data server grouping;
(6-1-3) corresponding meta data server is determined according to the IP address of meta data server grouping, and the list item that query metadata set number is corresponding in the group index table of this meta data server, to obtain the IP address of this metadata place meta data server;
(6-1-4) according to the IP address of meta data server found, the list item that query metadata set number is corresponding in the group index table of this meta data server, is combined in memory address in disk to obtain this metadata place metadata set;
(6-1-5) be combined in the memory address in disk according to the metadata set found, the collection of metadata that inquiry is corresponding, and return Query Result.
5. multi-dimensional metadata management method according to claim 1, is characterized in that, the range query operation in step (6) specifically comprises the following steps:
(6-2-1) range of receiving inquiry request, determine multidimensional property scope to be checked, calculate the median of each range of attributes, and construct input vector by the median of each range of attributes, utilize position sensitive hash function to calculate the cryptographic hash of input vector, this cryptographic hash is the numbering of the collection of metadata needing inquiry;
(6-2-2) list item that query metadata set number is corresponding in top layer concordance list, compares the multidimensional property scope of preserving in the multidimensional property scope in inquiry request and list item, if two scopes are non-intersect, directly returns results as sky; If two scopes intersect, obtain the IP address of the meta data server grouping comprising metadata to be checked;
(6-2-3) corresponding meta data server is determined according to the IP address of meta data server grouping, and the list item that query metadata set number is corresponding in the group index table of this meta data server, to obtain the IP address of this metadata place meta data server;
(6-2-4) according to the IP address of meta data server found, the list item that query metadata set number is corresponding in the group index table of this meta data server, is combined in memory address in disk to obtain this metadata place metadata set;
(6-2-5) be combined in the memory address in disk according to the metadata set found, the collection of metadata that inquiry is corresponding, and return all metadata meeting multidimensional property scope in inquiry request.
6. multi-dimensional metadata management method according to claim 1, is characterized in that, the TopK query manipulation in step (6) specifically comprises the following steps:
(6-3-1) receive TopK inquiry request, determine multidimensional property and the K value of the metadata that this inquiry request is corresponding, and utilize position sensitive hash function to calculate the cryptographic hash of multidimensional property, this cryptographic hash is the numbering of the collection of metadata needing inquiry; Wherein K represents the quantity of the metadata the most close with the metadata multidimensional property in inquiry request;
(6-3-2) list item that query metadata set number is corresponding in top layer concordance list, if the metadata number recorded in list item is less than K value, then also include the list item of this list item the right and left in query context, until metadata number sum is more than or equal to K value in list item, finally obtain the IP address of multiple meta data server grouping;
(6-3-3) corresponding meta data server is determined according to the IP address of meta data server grouping, and the list item that query metadata set number is corresponding in the group index table of this meta data server, to obtain the IP address of this metadata place meta data server;
(6-3-4) according to the IP address of meta data server found, the list item that query metadata set number is corresponding in the group index table of this meta data server, is combined in memory address in disk to obtain this metadata place metadata set;
(6-3-5) be combined in the memory address in disk according to the metadata set found, the collection of metadata that inquiry is corresponding, and return the K bar metadata the most close with the multidimensional property of the metadata in inquiry request.
7., based on a multidimensional meta data management system for associate feature, comprise collection of metadata generation module, meta data server grouping module, local index generation module, group index generation module, top layer index generation module, enquiry module, it is characterized in that,
Collection of metadata generation module divides according to associate feature the metadata on every platform meta data server, with generator data acquisition and set statistics file; Collection of metadata generation module comprises multidimensional property determination module, input vector constructing module, hash function computing module, associate feature division module and output module;
The multidimensional property of associate feature between multidimensional property determination module determination representation element data;
The multidimensional property of metadata is configured to the input vector of regular length by input vector constructing module, and this input vector is as the input value of position sensitive hash function;
Hash function computing module uses identical position sensitive hash function to carry out Hash calculation to input vector, and the cryptographic hash obtained is as the unique identification of metadata corresponding to this input vector;
Associate feature divides module and the metadata with identical cryptographic hash is divided in same collection of metadata, and using this cryptographic hash as the numbering of this collection of metadata;
The dividing condition of metadata in output module statistics collection of metadata, to generate set statistics file;
Meta data server grouping module, according to set statistics file, carries out division operation to metadata cluster, to generate the grouping of multiple meta data server and packet configuration file;
Local index generation module, according to set statistics file, sets up local index table respectively on every platform meta data server;
Group index generation module, according to packet configuration file and set statistics file, sets up group index table respectively in each meta data server grouping;
Top layer index generation module, according to group index table, sets up the top layer concordance list of metadata server cluster;
Enquiry module receives the inquiry request from user, and inquires about top layer concordance list, group index table and local index table successively according to inquiry request, and returns Query Result.
8. multidimensional meta data management system according to claim 7, is characterized in that,
Enquiry module specifically to comprise some an inquiry submodule, range query submodule, TopK inquire about submodule;
The point inquiry request of some inquiry submodule process user, the multidimensional property of certain metadata given, Query Result returns the specifying information of metadata;
The range query request of range query submodule process user, the scope of given multidimensional property, Query Result returns in whole system all metadata informations meeting scope;
TopK inquires about the TopK inquiry request of submodule process user, given one group of multidimensional property, and refers to defining K value, and Query Result returns K bar data the most close with given multidimensional property in whole system.
CN201310090042.4A 2013-03-20 2013-03-20 A kind of multi-dimensional metadata management method based on associate feature and system Active CN103218404B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310090042.4A CN103218404B (en) 2013-03-20 2013-03-20 A kind of multi-dimensional metadata management method based on associate feature and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310090042.4A CN103218404B (en) 2013-03-20 2013-03-20 A kind of multi-dimensional metadata management method based on associate feature and system

Publications (2)

Publication Number Publication Date
CN103218404A CN103218404A (en) 2013-07-24
CN103218404B true CN103218404B (en) 2015-11-18

Family

ID=48816191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310090042.4A Active CN103218404B (en) 2013-03-20 2013-03-20 A kind of multi-dimensional metadata management method based on associate feature and system

Country Status (1)

Country Link
CN (1) CN103218404B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424240B (en) * 2013-08-27 2019-06-14 腾讯科技(深圳)有限公司 Multilist correlating method, main service node, calculate node and system
WO2015048925A1 (en) * 2013-10-03 2015-04-09 Huawei Technologies Co., Ltd. A method of optimizing queries execution on a data store
CN104657383B (en) * 2013-11-22 2017-11-24 华中科技大学 A kind of repetition video detecting method and system based on associate feature
CN103970871B (en) * 2014-05-12 2017-06-16 华中科技大学 File metadata querying method and system based on information of tracing to the source in storage system
CN103984640B (en) * 2014-05-14 2017-06-20 华为技术有限公司 Realize data prefetching method and device
CN105956122A (en) * 2016-05-03 2016-09-21 无锡雅座在线科技发展有限公司 Object attribute determining method and device
CN107818117B (en) * 2016-09-14 2022-02-15 阿里巴巴集团控股有限公司 Data table establishing method, online query method and related device
CN107562946A (en) * 2017-09-26 2018-01-09 南京哈卢信息科技有限公司 A kind of method that concordance list is created in big data system
CN110347654B (en) * 2018-03-23 2024-06-18 北京京东尚科信息技术有限公司 Method and device for online cluster characteristics
CN109067817B (en) * 2018-05-31 2021-12-07 北京五八信息技术有限公司 Media content flow distribution method and device, electronic equipment and server
CN109143017B (en) * 2018-07-31 2021-03-30 成都天衡智造科技有限公司 Production test data processing method for semiconductor industry
CN109558404B (en) * 2018-10-19 2023-12-01 中国平安人寿保险股份有限公司 Data storage method, device, computer equipment and storage medium
CN111062751A (en) * 2019-12-12 2020-04-24 镇江市第一人民医院 Charging system and method based on automatic drug correlation consumable
CN111597148B (en) * 2020-05-14 2023-09-19 杭州果汁数据科技有限公司 Distributed metadata management method for distributed file system
CN118093686A (en) * 2022-11-21 2024-05-28 华为云计算技术有限公司 Data processing method based on data warehouse system and data warehouse system
CN119691708B (en) * 2025-02-25 2025-06-17 深圳市光大照明科技有限公司 Method and system for safety management of video and audio files

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957861A (en) * 2010-10-18 2011-01-26 江苏大学 Novel metadata server cluster and metadata management method based on reconciliation statement
CN102063486A (en) * 2010-12-28 2011-05-18 东北大学 Multi-dimensional data management-oriented cloud computing query processing method
CN102411637A (en) * 2011-12-30 2012-04-11 创新科软件技术(深圳)有限公司 Metadata management method of distributed file system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7177883B2 (en) * 2004-07-15 2007-02-13 Hitachi, Ltd. Method and apparatus for hierarchical storage management based on data value and user interest

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957861A (en) * 2010-10-18 2011-01-26 江苏大学 Novel metadata server cluster and metadata management method based on reconciliation statement
CN102063486A (en) * 2010-12-28 2011-05-18 东北大学 Multi-dimensional data management-oriented cloud computing query processing method
CN102411637A (en) * 2011-12-30 2012-04-11 创新科软件技术(深圳)有限公司 Metadata management method of distributed file system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
smartstore:a new metadata organization paradigm with semantic-awareness for next-generation file systems;yu hua et al;《SC09 conference on high performance computing networking,storage and analysis》;20091114;第2-6页 *

Also Published As

Publication number Publication date
CN103218404A (en) 2013-07-24

Similar Documents

Publication Publication Date Title
CN103218404B (en) A kind of multi-dimensional metadata management method based on associate feature and system
CN107423368B (en) Spatio-temporal data indexing method in non-relational database
CN103455531B (en) A kind of parallel index method supporting high dimensional data to have inquiry partially in real time
CN112287182B (en) Graph data storage and processing method and device and computer storage medium
CN106484875B (en) MOLAP-based data processing method and device
CN105354151B (en) Cache management method and equipment
CN103229173B (en) Metadata management method and system
WO2018177060A1 (en) Query optimization method and related device
CN105488043A (en) Data query method and system based on Key-Value data blocks
US9229961B2 (en) Database management delete efficiency
CN105404634A (en) Key-Value data block based data management method and system
CN103678550B (en) Mass data real-time query method based on dynamic index structure
CN105608224A (en) Orthogonal multilateral Hash mapping indexing method for improving massive data inquiring performance
CN114840487A (en) Metadata management method and device for distributed file system
WO2020125630A1 (en) File reading
CN106326309A (en) Data query method and device
CN104216962A (en) Mass network management data indexing design method based on HBase
CN106471501A (en) Data query method, data object storage method and data system
Liang et al. Mid-model design used in model transition and data migration between relational databases and nosql databases
CN116431726A (en) Graph data processing method, device, equipment and computer storage medium
CN104346444A (en) Optimum site selection method based on road network reverse spatial keyword query
CN116074201A (en) Method and system for publishing vector map data as vector tile map service
CN105550332A (en) Dual-layer index structure based origin graph query method
Hua et al. Br-tree: A scalable prototype for supporting multiple queries of multidimensional data
CN106096065B (en) A kind of similar to search method and device of multimedia object

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant