WO2022126839A1 - 一种基于云计算的自适应存储分层系统及方法 - Google Patents

一种基于云计算的自适应存储分层系统及方法 Download PDF

Info

Publication number
WO2022126839A1
WO2022126839A1 PCT/CN2021/074308 CN2021074308W WO2022126839A1 WO 2022126839 A1 WO2022126839 A1 WO 2022126839A1 CN 2021074308 W CN2021074308 W CN 2021074308W WO 2022126839 A1 WO2022126839 A1 WO 2022126839A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
query
model
cluster
data
Prior art date
Application number
PCT/CN2021/074308
Other languages
English (en)
French (fr)
Inventor
占绍雄
李扬
韩卿
Original Assignee
跬云(上海)信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 跬云(上海)信息科技有限公司 filed Critical 跬云(上海)信息科技有限公司
Priority to US17/615,551 priority Critical patent/US20240028604A1/en
Priority to EP21801818.2A priority patent/EP4040303A4/en
Publication of WO2022126839A1 publication Critical patent/WO2022126839A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/185Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof

Definitions

  • the invention relates to the technical field of data analysis, in particular to a cloud computing-based adaptive storage layering system and method.
  • the big data architecture system is often based on the separation of storage and computing.
  • the advantage of separating storage and computing is that it can greatly improve the cost-effectiveness of big data processing on the cloud.
  • the data has been completely saved on cloud storage.
  • users can directly stop unused machines or Delete, free up computing resources and reduce cloud costs.
  • computing resources can be scaled horizontally or dynamically reduced according to demand without affecting storage.
  • the cluster can be scaled horizontally to deal with high concurrent requests.
  • the computing nodes are dynamically reduced to reduce the usage cost.
  • Alluxio is the main distributed file caching system that can support multiple clouds in the cloud computing environment.
  • the advantage of this distributed file caching system is that it supports multi-level storage and supports multiple public clouds, but its shortcomings are also obvious.
  • the replacement algorithm is relatively simple, which is not suitable for pre-computing scenarios; elastic scaling is not supported, and when we have more files to be cached, automatic expansion is often impossible.
  • the cluster often stops when idle and starts when needed. At this time, when using the OLAP engine for the first query, it is often impossible to dynamically warm up the model index file, which will cause an initial period of time. file scans are slow. The above is what Alluxio lacks as a distributed file caching solution.
  • the present invention conceives an adaptive storage layering scheme based on cloud computing, which can significantly improve the execution efficiency of the OLAP engine on the cloud.
  • the present disclosure provides an adaptive storage layering system and method based on cloud computing, and the technical solutions are as follows:
  • the present invention provides an adaptive storage layering system based on cloud computing, including a data node management module, a metadata management module, an adaptive storage layering module and a pre-aggregation query routing module.
  • the data node management module is used for Collect the operation status of the storage cluster, expand and contract horizontally according to the predefined rules
  • the metadata management module is used to collect the model hit by the query of the OLAP query engine and the scanned file path, and aggregate and sort these data, adaptive storage
  • the hierarchical module performs hierarchical loading and preloading of files according to the ranking list of the number of model hits and the number of file scans maintained by the metadata management module.
  • the pre-aggregation query routing module automatically switches queries according to the cache conditions of models and indexes in the metadata database. storage address.
  • the storage cluster operation status data collected by the data node management module includes: the capacity of each node of the storage cluster, the used capacity of each node of the storage cluster, the cache files of each node of the storage cluster and their size.
  • the cluster of the data node management module includes a storage cluster and a computing cluster
  • the storage cluster is mainly used to store data
  • the computing cluster is mainly used to provide computing functions
  • both the storage cluster and the computing cluster have a cache function
  • the storage cluster includes: memory layer MEN, solid state disk layer SSD, and hard disk layer HDD storage.
  • expansion and contraction rules of the data node management module are: when the cache capacity in the storage cluster is less than 20% of the required capacity of the actual computing cluster, the storage cluster is horizontally expanded; when the data in the storage cluster expires or When not in use, optimize the configuration of data nodes and shrink the storage cluster.
  • the described metadata management module is connected to the log system of the OLAP query engine, analyzes the model of the query hit and the scanned file information from the log file and stores it in the metadata database, and updates the current model and the number of times of the scanned file. Order leaderboard.
  • the layered loading strategy of the adaptive storage layering module is: the files in the list are divided into three layers: very hot, relatively hot and hot, corresponding to the memory layer MEN, solid state disk layer SSD and The hard disk layer HDD storage, according to the pre-configured percentage of each layer and the storage size of each layer in the cluster, respectively loads the three layers of data into the cache.
  • the preloading strategy of the adaptive storage tiering module is: after each cluster restart, preload the very hot part of the tier into the memory through a command.
  • the strategy of automatically switching the query storage address of the pre-aggregation query routing module is: after the user query hits the model, ask the metadata management module whether the current model is in the cache, and if it has been cached, the file request will be loaded. Redirect to cache, otherwise, will load directly to source data.
  • the present invention provides a cloud computing-based adaptive storage tiering method, which is applied in the above-mentioned cloud computing-based adaptive storage tiering system, and includes the following steps:
  • Step 1 The query request submits the distributed computing task through the pre-aggregation query routing module
  • Step 2 The pre-aggregation query routing module automatically switches the query storage address according to the model and the cache condition of the index in the metadata database;
  • Step 3 The metadata management module is used to collect the model hit by the query of the OLAP query engine and the scanned file path, and aggregate and sort these data;
  • Step 4 The adaptive storage layering module performs layered loading and preloading of files according to the number of model hits maintained by the metadata management module and the ranking list of the number of file scans;
  • Step 5 The data node management module collects the operation status of the storage cluster, and performs horizontal expansion and contraction according to predefined rules; Step 5 needs to be executed during the execution of Step 2, Step 3 and Step 4;
  • Step 6 The metadata management module submits the query result after matching the query result with the query request of the pre-aggregation query routing module.
  • the present invention provides a cloud computing-based adaptive storage tiering method, comprising:
  • obtaining the operation status of the storage cluster and performing horizontal expansion and contraction according to predefined rules include:
  • An adjustment scheme for nodes is generated based on the information and preset rules, and nodes are created and/or destroyed based on the adjustment scheme.
  • Extract the log file in the log system of the OLAP query engine process the log file, obtain the model hit by the query in the log file and the scanned file information and store it in the metadata database;
  • the current model and its scanned file count ranking list is updated.
  • the hierarchical loading and preloading of the file based on the ranking list of the number of model hits and the number of file scans includes:
  • the files in the ranking list are divided into three levels: very hot, relatively hot, and hot;
  • the data stored in the memory layer, the solid state disk layer, and the hard disk layer are loaded into the cache respectively according to the preconfigured percentage of each layer and the storage size of each layer in the cluster.
  • the automatic switching of the query storage address according to the cache condition of the model and the index in the metadata database includes:
  • a cloud computing-based adaptive storage layering device includes:
  • the data node management module is used to obtain the operation status of the storage cluster and perform horizontal expansion and contraction according to predefined rules
  • the metadata management module is used to obtain the model hit by the query of the OLAP query engine and the scanned file path, and to aggregate and sort the model and the scanned file path;
  • an adaptive storage layering module used for layered loading and preloading of files based on the ranking list of the number of model hits and the number of file scans;
  • the pre-aggregation query routing module is used to automatically switch the query storage address according to the cache situation of the model and index in the metadata database.
  • the present invention provides a readable storage medium, where a computer program is stored in the readable storage medium, and the computer program is used to implement the above method when executed by a processor.
  • the present invention provides a readable storage medium, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a program executable by the at least one processor A computer program executed by the at least one processor to cause the at least one processor to perform the above-described method.
  • the present invention provides an adaptive storage layering system and method based on cloud computing, provides a performance optimization scheme for network transmission when an OLAP engine loads pre-computed content, and greatly reduces the network transmission data between object storage and OLAP engine.
  • FIG. 1 is a schematic diagram of a cloud computing-based adaptive storage layering system provided by the present invention
  • FIG. 2 is a schematic diagram of a cloud computing-based adaptive storage layering method provided by the present invention.
  • FIG. 3 is a schematic flow chart of an overall solution according to a specific embodiment of the present invention.
  • Embodiment 1 of the present invention provides an adaptive storage layering system based on cloud computing, as shown in FIG. 1 , including a data node management module, a metadata management module, an adaptive storage layering module, and a pre-aggregation query routing module , the data node management module is used to collect the operation status of the storage cluster, expand and contract horizontally according to the predefined rules, the metadata management module is used to collect the model hit by the query of the OLAP query engine and the scanned file path, and analyze the data.
  • the data node management module is used to collect the operation status of the storage cluster, expand and contract horizontally according to the predefined rules
  • the metadata management module is used to collect the model hit by the query of the OLAP query engine and the scanned file path, and analyze the data.
  • the adaptive storage layering module performs hierarchical loading and preloading of files according to the ranking list of the number of model hits and the number of file scans maintained by the metadata management module, and the pre-aggregation query routing module is based on the model and index.
  • the cache situation in the database automatically switches the query storage address.
  • the data on the running status of the storage cluster collected by the data node management module includes: the capacity of each node of the storage cluster, the used capacity of each node of the storage cluster, and the cache files and sizes of each node of the storage cluster.
  • the cluster of the data node management module includes a storage cluster and a computing cluster.
  • the storage cluster is mainly used to store data
  • the computing cluster is mainly used to provide computing functions. Both the storage cluster and the computing cluster have the cache function.
  • the storage cluster includes: memory layer MEN, solid state disk layer SSD, and hard disk layer HDD storage.
  • the expansion and contraction rules of the data node management module are: when the cache capacity in the storage cluster is much smaller than the actual computing cluster required capacity, the storage cluster is horizontally expanded; when the data in the storage cluster expires or is no longer used, optimize the data nodes Configure to shrink the storage cluster.
  • the data node management module first collects information such as node capacity, node used capacity, node cache file and size, etc., and then according to the predefined expansion and contraction rules, when the cache capacity in the storage cluster is much smaller than the actual calculation When the capacity of the cluster is required, horizontally expand the storage cluster.
  • the data in the storage cluster expires or is no longer used, optimize the data node configuration, shrink the storage cluster, generate a set of node expansion or contraction plans, and finally expand or contract according to the node.
  • the shrink plan does the actual creation and destruction of nodes.
  • the invention provides the horizontal expansion and automatic shrinkage of the distributed cache based on the capacity, which greatly improves the throughput of the distributed cache system and reduces the use cost of the user.
  • the metadata management module first connects to the log system of the OLAP query engine, then analyzes the model hit by the query and the scanned file information from the log file and stores it in the metadata database, and then updates the current model and the number of scanned files. Sequential leaderboard, maintains the current OLAP model hit count list and file scan count ranking list, prepares for the adaptive storage layering module, and provides a performance optimization scheme for network transmission when the OLAP engine loads precomputed content, greatly reducing the cost of The amount of data transferred over the network between the object store and the OLAP engine.
  • the layered loading strategy of the adaptive storage layering module is: the files in the list are divided into three layers: very hot, relatively hot and hot, corresponding to the memory layer MEN, solid state disk layer SSD and hard disk layer HDD storage on the data node. Preconfigured tier percentages and tier storage sizes within the cluster load the three tiers of data into the cache, respectively.
  • the preloading strategy of the adaptive storage tiering module is: after each cluster restart, the very hot part of the tier is preloaded into memory by command.
  • the present invention provides a warm-up solution during initialization of the distributed cache system, and combined with the characteristics of the OLAP query engine, the query performance is greatly improved, and the performance problem existing when the amount of query data is large is solved.
  • the strategy of automatically switching the query storage address of the pre-aggregation query routing module is: after the user query hits the model, ask the metadata management module whether the current model is in the cache, and if it has been cached, the file request will be loaded. Redirect to the cache, otherwise, it will load directly to the source data, support dynamic switching of loading files from different data sources, and ensure that each query can respond with the fastest speed.
  • the second embodiment of the present invention provides an adaptive storage layering method based on cloud computing, and the method is applied in the above-mentioned cloud computing-based adaptive storage layering system, as shown in FIG. 2 , including the following steps:
  • Step 1 The query request submits the distributed computing task through the pre-aggregation query routing module
  • Step 2 The pre-aggregation query routing module automatically switches the query storage address according to the model and the cache condition of the index in the metadata database;
  • the metadata management module after the user query hits the model, firstly ask the metadata management module whether the current model is in the cache, if it has been cached, redirect the file loading request to the cache, otherwise, it will directly load the source data.
  • Step 3 The metadata management module is used to collect the model hit by the query of the OLAP query engine and the scanned file path, and aggregate and sort these data;
  • Step 4 The adaptive storage layering module performs layered loading and preloading of files according to the number of model hits maintained by the metadata management module and the ranking list of the number of file scans;
  • the files in the list are divided into three levels: very hot, relatively hot and hot, corresponding to the memory layer MEN, solid state disk layer SSD and hard disk layer HDD storage on the data node, and then according to the pre-configured layers
  • the percentage and the storage size of each tier within the cluster respectively load the three tiers of data into the cache.
  • preloading the very hot part of the hierarchy will be preloaded into memory by command after each cluster restart.
  • Step 5 The data node management module collects the operation status of the storage cluster, and performs horizontal expansion and contraction according to predefined rules; Step 5 needs to be executed during the execution of Step 2, Step 3 and Step 4;
  • the storage cluster When the cache capacity in the storage cluster is less than 20% of the actual computing cluster capacity, the storage cluster is horizontally expanded. When the data in the storage cluster expires or is no longer used, the data node configuration is optimized and the storage cluster is shrunk.
  • Step 6 The metadata management module submits the query result after matching the query result with the query request of the pre-aggregation query routing module.
  • a specific embodiment of the present invention provides an adaptive storage layering system based on cloud computing.
  • the overall solution flow is shown in Figure 3.
  • a query request is submitted from the client and submitted through the pre-aggregation query routing module
  • For distributed computing tasks when a user query hits a model, it first asks the metadata module whether the current model is in the cache. If it is cached, it redirects the request to load the file to the cache, and the distributed execution engine obtains it from the distributed file cache system. Data, the distributed execution engine can directly obtain data from the object storage, otherwise the distributed execution engine directly obtains the data from the object storage, the data node management module collects the operation status of the storage cluster, and performs horizontal expansion and contraction according to predefined rules for node management.
  • the metadata management module collects the models hit by the OLAP query engine and the scanned file paths, aggregates and sorts these data, and maintains the current OLAP model hits ranking list and file scanning ranking list, which is an adaptive storage layering module. To prepare, the adaptive storage layering module performs layered loading and preloading of files according to the number of model hits maintained by the metadata management module and the ranking list of the number of file scans. After the request matches, submit the query result.
  • the present invention provides an adaptive storage layering method based on cloud computing, including:
  • the acquiring the operation status of the storage cluster and performing horizontal expansion and contraction according to a predefined rule includes:
  • An adjustment scheme for nodes is generated based on the information and preset rules, and nodes are created and/or destroyed based on the adjustment scheme.
  • this module collect information such as node capacity, node used capacity, node cache file and size, etc., then generate a set of node expansion or contraction plans according to predefined rules, and finally perform the actual node creation and destruction operations according to the node expansion or contraction plan.
  • the main purpose of this module is to horizontally expand the storage cluster when the cache capacity in the storage cluster is much smaller than the capacity required by the actual computing cluster. At the same time, when the data in the storage cluster expires or is no longer used, optimize the configuration of data nodes and shrink the storage cluster.
  • the model hit by the query of the OLAP query engine and the scanned file path are obtained, and the model and the scanned file path are aggregated and sorted;
  • Extract the log file in the log system of the OLAP query engine process the log file, obtain the model hit by the query in the log file and the scanned file information and store it in the metadata database;
  • the current model and its scanned file count ranking list is updated.
  • This module is to maintain the current OLAP model hits ranking list and file scanning times ranking list to prepare for the adaptive storage tiering module.
  • the hierarchical loading and preloading of files based on the ranking list of the number of model hits and the number of file scans includes:
  • the files in the ranking list are divided into three levels: very hot, relatively hot, and hot;
  • the data stored in the memory layer, the solid state disk layer, and the hard disk layer are loaded into the cache respectively according to the preconfigured percentage of each layer and the storage size of each layer in the cluster.
  • the files in the list are divided into three layers: very hot, relatively hot and hot, corresponding to the memory layer (MEN), solid state disk layer (SSD) and hard disk layer (HDD) storage on the data node, and then The three tiers of data are loaded into the cache according to the preconfigured tier percentages and the storage size of each tier within the cluster.
  • preloading the very hot part of the hierarchy will be preloaded into memory by command after each cluster restart. This module mainly solves the performance problems that exist when querying a large amount of data.
  • the automatic switching of the query storage address according to the cache condition of the model and the index in the metadata database includes:
  • the user query When the user query hits the model, it first asks the metadata module whether the current model is in the cache. If it has been cached, it redirects the request to load the file to the cached data cluster, otherwise, it will directly load the source data.
  • the main purpose of this module is to support the problem of dynamically switching loaded files from different data sources.
  • the present invention provides an adaptive storage layering device based on cloud computing, comprising:
  • the data node management module is used to obtain the operation status of the storage cluster and perform horizontal expansion and contraction according to predefined rules
  • the metadata management module is used to obtain the model hit by the query of the OLAP query engine and the scanned file path, and to aggregate and sort the model and the scanned file path;
  • an adaptive storage layering module used for layered loading and preloading of files based on the ranking list of the number of model hits and the number of file scans;
  • the pre-aggregation query routing module is used to automatically switch the query storage address according to the cache situation of the model and index in the metadata database.
  • This solution is mainly composed of four modules, data node management module, metadata management module, adaptive storage layering module and pre-aggregation query routing module.
  • data node management module data node management module
  • metadata management module metadata management module
  • adaptive storage layering module pre-aggregation query routing module.
  • Data node management module This module mainly collects the running status of the data cluster, and expands and contracts horizontally according to predefined rules. First, collect information such as node capacity, node used capacity, node cache file and size, etc., then generate a set of node expansion or contraction plans according to predefined rules, and finally perform the actual node creation and destruction operations according to the node expansion or contraction plan.
  • the main purpose of this module is to horizontally expand the storage cluster when the cache capacity in the storage cluster is much smaller than the capacity required by the actual computing cluster. At the same time, when the data in the storage cluster expires or is no longer used, optimize the configuration of data nodes and shrink the storage cluster.
  • Metadata management module This module mainly collects the models hit by the OLAP query engine and the scanned file paths, and aggregates and sorts these data. First, connect to the log system of the OLAP query engine, then analyze the model hit by the query and the information of the scanned files from the log file and store it in the metadata database, and finally update the current model and the number of scanned files. The main purpose of this module is to maintain the current OLAP model hits ranking list and file scanning times ranking list to prepare for the adaptive storage tiering module.
  • Adaptive storage layering module This module mainly performs layered loading and preloading of files according to the number of model hits maintained by the metadata management module and the ranking list of the number of file scans.
  • layered loading first, the files in the list are divided into three layers: very hot, relatively hot and hot, corresponding to the memory layer (MEN), solid state disk layer (SSD) and hard disk layer (HDD) storage on the data node, and then The three tiers of data are loaded into the cache according to the preconfigured tier percentages and the storage size of each tier within the cluster.
  • preloading the very hot part of the hierarchy will be preloaded into memory by command after each cluster restart. This module mainly solves the performance problems that exist when querying a large amount of data.
  • Pre-aggregation query routing module This module automatically switches the query storage address according to the model and the cache condition of the index in the metadata database to ensure that each query can respond at the fastest speed.
  • the user query hits the model, it first asks the metadata module whether the current model is in the cache. If it has been cached, it redirects the request to load the file to the cached data cluster, otherwise, it will directly load the source data.
  • the main purpose of this module is to support the problem of dynamically switching loaded files from different data sources.
  • a specific embodiment of the present invention tests the query performance of the prior art using Alluxio as a distributed cache solution in a cloud computing environment and the adaptive storage layering solution of the present invention in a common reporting system.
  • the comparison results show that the latter is better than the former.
  • the query speed is generally 2-5 times faster.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种基于云计算的自适应存储分层系统及方法,包括数据节点管理模块、元数据管理模块、自适应存储分层模块以及预聚合查询路由模块,根据预定义规则对节点容量进行扩展和收缩、对收集的查询命中的模型以及扫描文件路径进行聚合和排序、对文件进行分层加载和预加载。基于本发明,可以构建出一个高效的OLAP查询执行引擎,应对各类报表系统的复杂OLAP查询,可以显著提高云上OLAP引擎的执行效率。

Description

一种基于云计算的自适应存储分层系统及方法 技术领域
本发明涉及数据分析技术领域,尤其涉及一种基于云计算的自适应存储分层系统及方法。
背景技术
在云计算环境下,大数据架构体系往往是基于存储与计算分离的架构。将存储与计算分离的好处是可以大大提升用户在云上进行大数据处理的性价比,当一个ETL工作流完成后,数据已经完整保存在云存储上,这时候用户可以直接将不用的机器停止或者删除,释放计算资源,减少云成本。同样,存储与计算分离的情况下,可以根据需求对计算资源进行水平扩展或者动态减少而不会影响到存储,在大数据使用并发量较大时,对集群进行水平扩展来应对高并发请求,同时并发下降后动态减少计算节点来降低使用成本。然而这种架构下也存在缺陷,在计算存储分离情况下,往往计算与存储之间通过网络传输,传输的速率依赖带宽,尽管云基础设施供应商都在不断地提升网络硬件设施,但是相对于本地存储,云环境下存储与计算之间因为带宽传输速率受限,通常会成为数据分析的瓶颈。为了加速云存储的访问,我们一方面可以权衡成本尽可能选择较高带宽的机器来缓解网络传输带来的性能损耗;另一方面,可以将热数据尽量缓存在计算集群中,以达到对热数据查询的快速响应。
当前,在云计算环境下能够支持多云的分布式文件缓存系统主要有Alluxio。该分布式文件缓存系统优点是支持多级存储,同时支持多种公有云,但是其缺点也比较明显。在需要缓存的文件很多时只能进行根据访问情况进行缓存文件置换,且替换算法比较简单,不适宜预计算的场景;不支持弹性伸缩,当我们有更多文件需要缓存时,往往无法自动扩展;另外,在云上基于成本考虑,集群往往会在闲置时停止,在需要时启动,这时候在使用OLAP引擎进行初次查询时,往往会因为无法对模型索引文件动态预热会导致初始一段时间的文件扫描速度很慢。以上是以Alluxio为分布式文件缓存方案所欠缺的。
由于当前OLAP引擎与Alluxio集成方案存在的缺陷,难以支持高并发下的亚秒级查询响应。因此,本发明构思出一种基于云计算的自适应存储分层方案,可以显著提高云上OLAP引擎的执行效率。
发明内容
有鉴于此,本公开提供一种基于云计算的自适应存储分层系统及方法,技术方案如下:
一方面,本发明提供了一种基于云计算的自适应存储分层系统,包括数据节点管理模块、元数据管理模块、自适应存储分层模块以及预聚合查询路由模块,数据节点管理模块用于收集存储集群运行情况,按照预定义的规则进行水平扩展和收缩,元数据管理模块用于收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序,自适应存储分层模块根据元数据管理模块维护的模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载,预聚合查询路由模块根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
进一步地,所述的数据节点管理模块收集的存储集群运行情况数据包括:存储集群各节点的容量、存储集群各节点已使用容量、存储集群各节点缓存文件及其大小。
进一步地,所述数据节点管理模块的集群包括存储集群和计算集群,存储集群主要用于存储数据,计算集群主要用于提供计算功能,存储集群和计算集群均具备缓存功能。
进一步地,所述的存储集群包括:内存层MEN、固态硬盘层SSD、硬盘层HDD存储。
进一步地,所述的数据节点管理模块的扩展和收缩的规则是:当存储集群内缓存容量小于实际计算集群所需容量的20%时,对存储集群进行水平扩展;当存储集群内数据过期或不再使用时,优化数据节点配置,对存储集群进行收缩。
进一步地,所述的元数据管理模块通过连接到OLAP查询引擎的日志系统,从日志文件中分析出查询命中的模型及其扫描的文件信息存储到元数据库,更新当前模型及其扫描文件的次数顺序排行榜。
进一步地,所述的自适应存储分层模块的分层加载策略是:将列表中文件分为非常热、比较热以及热三个层次,对应数据节点上的内存层MEN、固态硬盘层SSD以及硬盘层HDD存储,根据预配置的各层次百分比和集群内的各层存储大小分别将这三层数据加载到缓存中。
进一步地,所述的自适应存储分层模块的预加载策略是:每次集群重启后,将层次为非常热的部分通过命令预加载到内存中。
进一步地,所述的预聚合查询路由模块的自动切换查询存储地址的策略是:当用户查询命中模型后,向元数据管理模块询问当前模型是否在缓存中,若已缓存,则将加载文件请求重定向到缓存,否则,将直接向源数据加载。
另一方面,本发明提供了一种基于云计算的自适应存储分层方法,所述方法应用在上述基于云计算的自适应存储分层系统中,包括以下步骤:
步骤1:查询请求通过预聚合查询路由模块提交分布式计算任务;
步骤2:预聚合查询路由模块根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址;
步骤3:元数据管理模块用于收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序;
步骤4:自适应存储分层模块根据元数据管理模块维护的模型命中数量以及文件扫描数量排行列表来对文件进行分层加载以及预加载;
步骤5:数据节点管理模块收集存储集群运行情况,按照预定义的规则进行水平扩展和收缩;步骤2、步骤3和步骤4执行过程中均需执行步骤5;
步骤6:元数据管理模块将查询结果与预聚合查询路由模块的查询请求匹配后,提交查询结果。
另一方面,本发明提供了一种基于云计算的自适应存储分层方法,包括:
获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩;
获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载;
根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
进一步的,所述获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩包括:
获取存储集群中各节点容量信息、节点已使用容量信息、节点缓存文件信息及节点缓存文件大小信息中的任意一种或多种;
基于所述信息以及预设的规则生成对节点的调整方案,基于所述调整方案对节点创建和\或销毁。
进一步的,获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
提取OLAP查询引擎的日志系统中的日志文件,对所述日志文件处理,获取日志文件中查询命中的模型及其扫描的文件信息并存储到元数据库中;
基于当前时刻命中的模型及其扫描的文件信息更新当前模型及其扫描文件的次数顺序排行榜。
进一步的,基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载包括:
基于预设规则将所述排行列表中的文件分为非常热、比较热以及热三个层次;
将非常热、比较热以及热对应数据节点上的内存层、固态硬盘层以及硬盘 层进行存储;
根据预配置的各层次百分比和集群内的各层存储大小分别将所述内存层、固态硬盘层以及硬盘层存储的数据加载到缓存中。
进一步的,所述根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址包括:
当用户查询命中模型时,判断当前命中模型是否在缓存中;
若已缓存,则将加载文件请求定向到缓存数据集群,获取相应模型;
否未缓存,则将直接向源数据加载。
另一方面,本发明一种基于云计算的自适应存储分层装置,包括:
数据节点管理模块,用于获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩;
元数据管理模块,用于获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
自适应存储分层模块,用于基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载;
预聚合查询路由模块,用于根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
另一方面,本发明一种可读存储介质,所述可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时用于实现上述的方法。
另一方面,本发明一种可读存储介质,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器执行上述方法。
本发明提供一种基于云计算的自适应存储分层系统及方法,提供了OLAP 引擎在加载预计算内容时网络传输的性能优化方案,极大降低了对象存储和OLAP引擎之间的网络传输数据量;提供了分布式缓存基于容量的水平扩展和自动收缩,极大的提高了分布式缓存系统的吞吐量的和降低了用户的使用成本;提供了分布式缓存系统在初始化时的预热方案,结合OLAP查询引擎的特点,极大地提高了查询性能。
附图说明
构成本申请的一部分的附图用来提供对本申请的进一步理解,使得本申请的其它特征、目的和优点变得更明显。本申请的示意性实施例附图及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1为本发明提供的一种基于云计算的自适应存储分层系统示意图;
图2为本发明提供的一种基于云计算的自适应存储分层方法示意图;
图3为本发明具体实施例的整体方案流程示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的 或对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本申请中,术语“上”、“下”、“左”、“右”、“前”、“后”、“顶”、“底”、“内”、“外”、“中”、“竖直”、“水平”、“横向”、“纵向”等指示的方位或位置关系为基于附图所示的方位或位置关系。这些术语主要是为了更好地描述本申请及其实施例,并非用于限定所指示的装置、元件或组成部分必须具有特定方位,或以特定方位进行构造和操作。
并且,上述部分术语除了可以用于表示方位或位置关系以外,还可能用于表示其他含义,例如术语“上”在某些情况下也可能用于表示某种依附关系或连接关系。对于本领域普通技术人员而言,可以根据具体情况理解这些术语在本申请中的具体含义。
另外,术语“多个”的含义应为两个以及两个以上。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
实施例一
本发明实施例一,提供了一种基于云计算的自适应存储分层系统,如图1所示,包括数据节点管理模块、元数据管理模块、自适应存储分层模块以及预聚合查询路由模块,数据节点管理模块用于收集存储集群运行情况,按照预定义的规则进行水平扩展和收缩,元数据管理模块用于收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序,自适应存储分层模块根据元数据管理模块维护的模型命中次数以及文件扫描数量次数的排行列表来对文件进行分层加载以及预加载,预聚合查询路由模块根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
数据节点管理模块收集的存储集群运行情况数据包括:存储集群各节点的容量、存储集群各节点已使用容量、存储集群各节点缓存文件及其大小。
数据节点管理模块的集群包括存储集群和计算集群,存储集群主要用于存储数据,计算集群主要用于提供计算功能,存储集群和计算集群均具备缓存功 能。
存储集群包括:内存层MEN、固态硬盘层SSD、硬盘层HDD存储。
数据节点管理模块的扩展和收缩的规则是:当存储集群内缓存容量远小于实际计算集群所需容量时,对存储集群进行水平扩展;当存储集群内数据过期或不再使用时,优化数据节点配置,对存储集群进行收缩。
具体实施时,数据节点管理模块,首先,收集如节点容量、节点已使用容量、节点缓存文件及大小等信息,然后根据预定义的扩展和收缩的规则,当存储集群内缓存容量远小于实际计算集群所需容量时,对存储集群进行水平扩展,当存储集群内数据过期或不再使用时,优化数据节点配置,对存储集群进行收缩,生成一组节点扩展或者收缩计划,最后根据节点扩展或者收缩计划执行实际的创建和销毁节点操作。本发明提供了分布式缓存基于容量的水平扩展和自动收缩,极大的提高了分布式缓存系统的吞吐量的和降低了用户的使用成本。
具体实施时,元数据管理模块首先连接到OLAP查询引擎的日志系统,然后从日志文件中分析出查询命中的模型及其扫描的文件信息存储到元数据库,之后更新当前模型及其扫描文件的次数顺序排行榜,维持当前OLAP模型命中次数排行列表以及文件扫描次数排行列表,为自适应存储分层模块做准备,提供了OLAP引擎在加载预计算内容时网络传输的性能优化方案,极大降低了对象存储和OLAP引擎之间的网络传输数据量。
自适应存储分层模块的分层加载策略是:将列表中文件分为非常热、比较热以及热三个层次,对应数据节点上的内存层MEN、固态硬盘层SSD以及硬盘层HDD存储,根据预配置的各层次百分比和集群内的各层存储大小分别将这三层数据加载到缓存中。自适应存储分层模块的预加载策略是:每次集群重启后,将层次为非常热的部分通过命令预加载到内存中。本发明提供了分布式缓存系统在初始化时的预热方案,结合OLAP查询引擎的特点,极大地提高了查询性能,解决了查询数据量大时存在的性能问题。
进一步地,所述的预聚合查询路由模块的自动切换查询存储地址的策略 是:当用户查询命中模型后,向元数据管理模块询问当前模型是否在缓存中,若已缓存,则将加载文件请求重定向到缓存,否则,将直接向源数据加载,支持从不同数据源动态切换加载文件,确保每个查询能够以最快的速度响应。
实施例二
本发明实施例二,提供了一种基于云计算的自适应存储分层方法,所述方法应用在上述基于云计算的自适应存储分层系统中,如图2所示,包括以下步骤:
步骤1:查询请求通过预聚合查询路由模块提交分布式计算任务;
步骤2:预聚合查询路由模块根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址;
具体实施时,当用户查询命中模型后,首先向元数据管理模块询问当前模型是否在缓存中,若已缓存,则将加载文件请求重定向到缓存,否则,将直接向源数据加载。
步骤3:元数据管理模块用于收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序;
具体实施时,首先,连接到OLAP查询引擎的日志系统,然后从其日志文件中分析出查询命中的模型及其扫描的文件信息存储到元数据库中,最后更新当前模型及其扫描文件的次数顺序排行榜。
步骤4:自适应存储分层模块根据元数据管理模块维护的模型命中数量以及文件扫描数量排行列表来对文件进行分层加载以及预加载;
关于分层加载,首先,将列表中文件分为非常热、比较热以及热三个层次,对应数据节点上的内存层MEN、固态硬盘层SSD以及硬盘层HDD存储,然后根据预配置的各层次百分比和集群内的各层存储大小分别将这三层数据加载到缓存中。关于预加载,将在每次集群重启后,将层次为非常热部分通过命令预加载到内存中。
步骤5:数据节点管理模块收集存储集群运行情况,按照预定义的规则进 行水平扩展和收缩;步骤2、步骤3和步骤4执行过程中均需执行步骤5;
当存储集群内缓存容量小于实际计算集群所需容量的20%时,对存储集群进行水平扩展,当存储集群内数据过期或不再使用时,优化数据节点配置,对存储集群进行收缩。
步骤6:元数据管理模块将查询结果与预聚合查询路由模块的查询请求匹配后,提交查询结果。
实施例三
本发明的一个具体实施例,提供了一种基于云计算的自适应存储分层系统,整体方案流程如图3所示,具体实施时,从客户端提交查询请求,通过预聚合查询路由模块提交分布式计算任务,当用户查询命中模型后,首先向元数据模块询问当前模型是否在缓存中,若已缓存,则将加载文件请求重定向到缓存,分布式执行引擎从分布式文件缓存系统获取数据,分布式执行引擎可以从对象存储直接获取数据,否则分布式执行引擎直接从对象存储获取数据,数据节点管理模块收集存储集群运行情况,按照预定义的规则进行水平扩展和收缩进行节点管理,元数据管理模块收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序,维持当前OLAP模型命中次数排行列表以及文件扫描次数排行列表,为自适应存储分层模块做准备,自适应存储分层模块根据元数据管理模块维护的模型命中数量以及文件扫描数量排行列表来对文件进行分层加载以及预加载,数据管理模块将查询结果与预聚合查询路由模块的查询请求匹配后,提交查询结果。
实施例四
本发明提供一种基于云计算的自适应存储分层方法,包括:
获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩;
获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层 加载以及预加载;
根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
在一个实施例中,所述获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩包括:
获取存储集群中各节点容量信息、节点已使用容量信息、节点缓存文件信息及节点缓存文件大小信息中的任意一种或多种;
基于所述信息以及预设的规则生成对节点的调整方案,基于所述调整方案对节点创建和\或销毁。
首先,收集如节点容量、节点已使用容量、节点缓存文件及大小等信息,然后根据预定义规则生成一组节点扩展或者收缩计划,最后根据节点扩展或者收缩计划执行实际的创建和销毁节点操作。此模块主要目的是在存储集群内缓存容量远小于实际计算集群所需容量时,对存储集群进行水平扩展。同时,在存储集群内数据过期或不再使用时,优化数据节点配置,对存储集群进行收缩。
在一个实施例中,获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
提取OLAP查询引擎的日志系统中的日志文件,对所述日志文件处理,获取日志文件中查询命中的模型及其扫描的文件信息并存储到元数据库中;
基于当前时刻命中的模型及其扫描的文件信息更新当前模型及其扫描文件的次数顺序排行榜。
首先,连接到OLAP查询引擎的日志系统,然后从其日志文件中分析出查询命中的模型及其扫描的文件信息存储到元数据库中,最后更新当前模型及其扫描文件的次数顺序排行榜。此模块主要目的是维持当前OLAP模型命中次数排行列表以及文件扫描次数排行列表,为自适应存储分层模块做准备。
在一个实施例中,基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载包括:
基于预设规则将所述排行列表中的文件分为非常热、比较热以及热三个层次;
将非常热、比较热以及热对应数据节点上的内存层、固态硬盘层以及硬盘层进行存储;
根据预配置的各层次百分比和集群内的各层存储大小分别将所述内存层、固态硬盘层以及硬盘层存储的数据加载到缓存中。
关于分层加载,首先,将列表中文件分为非常热、比较热以及热三个层次,对应数据节点上的内存层(MEN)、固态硬盘层(SSD)以及硬盘层(HDD)存储,然后根据预配置的各层次百分比和集群内的各层存储大小分别将这三层数据加载到缓存中。关于预加载,将在每次集群重启后,将层次为非常热部分通过命令预加载到内存中。此模块主要是解决了查询数据量大时存在的性能问题。
在一个实施例中,所述根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址包括:
当用户查询命中模型时,判断当前命中模型是否在缓存中;
若已缓存,则将加载文件请求定向到缓存数据集群,获取相应模型;
否未缓存,则将直接向源数据加载。
当用户查询命中模型后,首先向元数据模块询问当前模型是否在缓存中,若已缓存,则将加载文件请求重定向到缓存数据集群,否则,将直接向源数据加载。此模块主要目的是支持从不同数据源动态切换加载文件的问题。
实施例五
本发明提供一种基于云计算的自适应存储分层装置,包括:
数据节点管理模块,用于获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩;
元数据管理模块,用于获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
自适应存储分层模块,用于基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载;
预聚合查询路由模块,用于根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
本方案主要由四个模块组成,数据节点管理模块、元数据管理模块、自适应存储分层模块以及预聚合查询路由模块。以下是对四个模块的具体介绍:
数据节点管理模块:此模块主要收集数据集群运行情况,按照预定义的规则进行水平扩展和收缩。首先,收集如节点容量、节点已使用容量、节点缓存文件及大小等信息,然后根据预定义规则生成一组节点扩展或者收缩计划,最后根据节点扩展或者收缩计划执行实际的创建和销毁节点操作。此模块主要目的是在存储集群内缓存容量远小于实际计算集群所需容量时,对存储集群进行水平扩展。同时,在存储集群内数据过期或不再使用时,优化数据节点配置,对存储集群进行收缩。
元数据管理模块:此模块主要是收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序。首先,连接到OLAP查询引擎的日志系统,然后从其日志文件中分析出查询命中的模型及其扫描的文件信息存储到元数据库中,最后更新当前模型及其扫描文件的次数顺序排行榜。此模块主要目的是维持当前OLAP模型命中次数排行列表以及文件扫描次数排行列表,为自适应存储分层模块做准备。
自适应存储分层模块:此模块主要是根据元数据管理模块维护的模型命中数量以及文件扫描数量排行列表来对文件进行分层加载以及预加载。关于分层加载,首先,将列表中文件分为非常热、比较热以及热三个层次,对应数据节点上的内存层(MEN)、固态硬盘层(SSD)以及硬盘层(HDD)存储,然后根据预配置的各层次百分比和集群内的各层存储大小分别将这三层数据加载到缓存中。关于预加载,将在每次集群重启后,将层次为非常热部分通过命令预加载到内存中。此模块主要是解决了查询数据量大时存在的性能问题。
预聚合查询路由模块:此模块主要根据模型以及索引在元数据库中的缓存 情况自动切换查询存储地址,确保每个查询能够以最快的速度响应。当用户查询命中模型后,首先向元数据模块询问当前模型是否在缓存中,若已缓存,则将加载文件请求重定向到缓存数据集群,否则,将直接向源数据加载。此模块主要目的是支持从不同数据源动态切换加载文件的问题。
本发明的一个具体实施例,测试了现有技术在云计算环境下使用Alluxio作为分布式缓存方案和本发明自适应存储分层方案在常用报表系统下的查询性能,对比结果,后者比前者的查询速度普遍快2-5倍。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (18)

  1. 一种基于云计算的自适应存储分层系统,其特征在于,包括数据节点管理模块、元数据管理模块、自适应存储分层模块以及预聚合查询路由模块,数据节点管理模块用于收集存储集群运行情况,按照预定义的规则进行水平扩展和收缩,元数据管理模块用于收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序,自适应存储分层模块根据元数据管理模块维护的模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载,预聚合查询路由模块根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
  2. 根据权利要求1所述的一种基于云计算的自适应存储分层系统,其特征在于,所述的数据节点管理模块收集的存储集群运行情况数据包括:存储集群各节点的容量、存储集群各节点已使用容量、存储集群各节点缓存文件及其大小。
  3. 根据权利要求1所述的一种基于云计算的自适应存储分层系统,其特征在于,所述数据节点管理模块的集群包括存储集群和计算集群,存储集群主要用于存储数据,计算集群主要用于提供计算功能,存储集群和计算集群均具备缓存功能。
  4. 根据权利要求3所述的一种基于云计算的自适应存储分层系统,其特征在于,所述的存储集群包括:内存层MEN、固态硬盘层SSD、硬盘层HDD存储。
  5. 根据权利要求1所述的一种基于云计算的自适应存储分层系统,其特征在于,所述的数据节点管理模块的扩展和收缩的规则是:当存储集群内缓存容量小于实际计算集群所需容量的20%时,对存储集群进行水平扩展;当存储集群内数据过期或不再使用时,优化数据节点配置,对存储集群进行收缩。
  6. 根据权利要求1所述的一种基于云计算的自适应存储分层系统,其特征在于,所述的元数据管理模块通过连接到OLAP查询引擎的日志系统,从日志文件中分析出查询命中的模型及其扫描的文件信息存储到元数据库,更新当前模型及其扫描文件的次数顺序排行榜。
  7. 根据权利要求1所述的一种基于云计算的自适应存储分层系统,其特征在于,所述的自适应存储分层模块的分层加载策略是:将列表中文件分为非常热、比较热以及热三个层次,对应数据节点上的内存层MEN、固态硬盘层SSD以及硬盘层HDD存储,根据预配置的各层次百分比和集群内的各层存储大小分别将这三层数据加载到缓存中。
  8. 根据权利要求1所述的一种基于云计算的自适应存储分层系统,其特征在于,所述的自适应存储分层模块的预加载策略是:每次集群重启后,将层次为非常热的部分通过命令预加载到内存中。
  9. 根据权利要求1所述的一种基于云计算的自适应存储分层系统,其特征在于,所述的预聚合查询路由模块的自动切换查询存储地址的策略是:当用户查询命中模型后,向元数据管理模块询问当前模型是否在缓存中,若已缓存,则将加载文件请求重定向到缓存,否则,将直接向源数据加载。
  10. 一种基于云计算的自适应存储分层方法,其特征在于,所述方法应用在上述权利要求1-7中任一项所述的一种基于云计算的自适应存储分层系统中,包括以下步骤:
    步骤1:查询请求通过预聚合查询路由模块提交分布式计算任务;
    步骤2:预聚合查询路由模块根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址;
    步骤3:元数据管理模块用于收集OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对这些数据进行聚合和排序;
    步骤4:自适应存储分层模块根据元数据管理模块维护的模型命中数量以及文件扫描数量排行列表来对文件进行分层加载以及预加载;
    步骤5:数据节点管理模块收集存储集群运行情况,按照预定义的规则进行水平扩展和收缩;步骤2、步骤3和步骤4执行过程中均需执行步骤5;
    步骤6:元数据管理模块将查询结果与预聚合查询路由模块的查询请求匹配后,提交查询结果。
  11. 一种基于云计算的自适应存储分层方法,其特征在于,包括:
    获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩;
    获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
    基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载;
    根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址。
  12. 根据权利要求11所述的一种基于云计算的自适应存储分层方法,其特征在于,
    所述获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩包括:
    获取存储集群中各节点容量信息、节点已使用容量信息、节点缓存文件信息及节点缓存文件大小信息中的任意一种或多种;
    基于所述信息以及预设的规则生成对节点的调整方案,基于所述调整方案对节点创建和\或销毁。
  13. 根据权利要求11所述的一种基于云计算的自适应存储分层方法,其特征在于,
    获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
    提取OLAP查询引擎的日志系统中的日志文件,对所述日志文件处理,获取日志文件中查询命中的模型及其扫描的文件信息并存储到元数据库中;
    基于当前时刻命中的模型及其扫描的文件信息更新当前模型及其扫描文件的次数顺序排行榜。
  14. 根据权利要求11所述的一种基于云计算的自适应存储分层方法,其 特征在于,
    基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载包括:
    基于预设规则将所述排行列表中的文件分为非常热、比较热以及热三个层次;
    将非常热、比较热以及热对应数据节点上的内存层、固态硬盘层以及硬盘层进行存储;
    根据预配置的各层次百分比和集群内的各层存储大小分别将所述内存层、固态硬盘层以及硬盘层存储的数据加载到缓存中。
  15. 根据权利要求11所述的一种基于云计算的自适应存储分层方法,其特征在于,
    所述根据模型以及索引在元数据库中的缓存情况自动切换查询存储地址包括:
    当用户查询命中模型时,判断当前命中模型是否在缓存中;
    若已缓存,则将加载文件请求定向到缓存数据集群,获取相应模型;
    否未缓存,则将直接向源数据加载。
  16. 一种基于云计算的自适应存储分层装置,其特征在于,包括:
    数据节点管理模块,用于获取存储集群运行情况,按照预定义的规则进行水平扩展和收缩;
    元数据管理模块,用于获取OLAP查询引擎的查询命中的模型以及扫描的文件路径,并对所述模型以及扫描的文件路径进行聚合和排序;
    自适应存储分层模块,用于基于所述模型命中次数以及文件扫描次数的排行列表来对文件进行分层加载以及预加载;
    预聚合查询路由模块,用于根据模型以及索引在元数据库中的缓存情况自 动切换查询存储地址。
  17. 一种可读存储介质,其特征在于,所述可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时用于实现权利要求10至15中任意一项所述的方法。
  18. 一种可读存储介质,其特征在于,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器执行权利要求10至15中任意一项所述方法。
PCT/CN2021/074308 2020-12-15 2021-01-29 一种基于云计算的自适应存储分层系统及方法 WO2022126839A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/615,551 US20240028604A1 (en) 2020-12-15 2021-01-29 Cloud Computing-Based Adaptive Storage Layering System and Method
EP21801818.2A EP4040303A4 (en) 2020-12-15 2021-01-29 CLOUD-BASED ADAPTIVE STORAGE HIERARCHY SYSTEM AND METHOD

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011483292.0A CN112559459B (zh) 2020-12-15 2020-12-15 一种基于云计算的自适应存储分层系统及方法
CN202011483292.0 2020-12-15

Publications (1)

Publication Number Publication Date
WO2022126839A1 true WO2022126839A1 (zh) 2022-06-23

Family

ID=75063168

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074308 WO2022126839A1 (zh) 2020-12-15 2021-01-29 一种基于云计算的自适应存储分层系统及方法

Country Status (4)

Country Link
US (1) US20240028604A1 (zh)
EP (1) EP4040303A4 (zh)
CN (1) CN112559459B (zh)
WO (1) WO2022126839A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599790A (zh) * 2022-11-10 2023-01-13 星环信息科技(上海)股份有限公司(Cn) 一种数据存储系统、数据处理方法、电子设备和存储介质
CN116166691A (zh) * 2023-04-21 2023-05-26 中国科学院合肥物质科学研究院 一种基于数据划分的数据归档系统、方法、装置及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106470219A (zh) * 2015-08-17 2017-03-01 阿里巴巴集团控股有限公司 计算机集群的扩容和缩容方法及设备
US10061775B1 (en) * 2017-06-17 2018-08-28 HGST, Inc. Scalable and persistent L2 adaptive replacement cache
CN108920616A (zh) * 2018-06-28 2018-11-30 郑州云海信息技术有限公司 一种元数据访问性能优化方法、系统、装置及存储介质
CN109344092A (zh) * 2018-09-11 2019-02-15 天津易华录信息技术有限公司 一种提高冷存储数据读取速度的方法和系统
CN109947787A (zh) * 2017-10-30 2019-06-28 阿里巴巴集团控股有限公司 一种数据分层存储、分层查询方法及装置
CN110995856A (zh) * 2019-12-16 2020-04-10 上海米哈游天命科技有限公司 一种服务器扩展的方法、装置、设备及存储介质

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7779304B2 (en) * 2007-06-15 2010-08-17 International Business Machines Corporation Diagnosing changes in application behavior based on database usage
CN100492274C (zh) * 2007-08-17 2009-05-27 杭州华三通信技术有限公司 存储控制系统及其处理节点
US9418101B2 (en) * 2012-09-12 2016-08-16 International Business Machines Corporation Query optimization
US10585892B2 (en) * 2014-07-10 2020-03-10 Oracle International Corporation Hierarchical dimension analysis in multi-dimensional pivot grids
CN104219088A (zh) * 2014-08-21 2014-12-17 南京邮电大学 一种基于Hive的网络告警信息OLAP方法
CN105577806B (zh) * 2015-12-30 2019-11-12 Tcl集团股份有限公司 一种分布式缓存方法及系统
US11405423B2 (en) * 2016-03-11 2022-08-02 Netskope, Inc. Metadata-based data loss prevention (DLP) for cloud resources
US20180089288A1 (en) * 2016-09-26 2018-03-29 Splunk Inc. Metrics-aware user interface
CN107133342A (zh) * 2017-05-16 2017-09-05 广州舜飞信息科技有限公司 一种IndexR实时数据分析库
US10747723B2 (en) * 2017-11-16 2020-08-18 Verizon Digital Media Services Inc. Caching with dynamic and selective compression of content
US11210009B1 (en) * 2018-03-15 2021-12-28 Pure Storage, Inc. Staging data in a cloud-based storage system
US10812409B2 (en) * 2018-06-06 2020-10-20 Sap Se Network multi-tenancy for cloud based enterprise resource planning solutions
US11157478B2 (en) * 2018-12-28 2021-10-26 Oracle International Corporation Technique of comprehensively support autonomous JSON document object (AJD) cloud service
US10868732B2 (en) * 2019-04-02 2020-12-15 Sap Se Cloud resource scaling using programmable-network traffic statistics
US11514275B2 (en) * 2019-10-21 2022-11-29 Sap Se Database instance tuning in a cloud platform
US11263236B2 (en) * 2019-11-18 2022-03-01 Sap Se Real-time cross-system database replication for hybrid-cloud elastic scaling and high-performance data virtualization
US11790414B2 (en) * 2020-01-31 2023-10-17 Salesforce, Inc. Techniques and architectures for customizable modular line item evaluation
US11392399B2 (en) * 2020-05-13 2022-07-19 Sap Se External binary sidecar for cloud containers
US11513708B2 (en) * 2020-08-25 2022-11-29 Commvault Systems, Inc. Optimized deduplication based on backup frequency in a distributed data storage system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106470219A (zh) * 2015-08-17 2017-03-01 阿里巴巴集团控股有限公司 计算机集群的扩容和缩容方法及设备
US10061775B1 (en) * 2017-06-17 2018-08-28 HGST, Inc. Scalable and persistent L2 adaptive replacement cache
CN109947787A (zh) * 2017-10-30 2019-06-28 阿里巴巴集团控股有限公司 一种数据分层存储、分层查询方法及装置
CN108920616A (zh) * 2018-06-28 2018-11-30 郑州云海信息技术有限公司 一种元数据访问性能优化方法、系统、装置及存储介质
CN109344092A (zh) * 2018-09-11 2019-02-15 天津易华录信息技术有限公司 一种提高冷存储数据读取速度的方法和系统
CN110995856A (zh) * 2019-12-16 2020-04-10 上海米哈游天命科技有限公司 一种服务器扩展的方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4040303A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599790A (zh) * 2022-11-10 2023-01-13 星环信息科技(上海)股份有限公司(Cn) 一种数据存储系统、数据处理方法、电子设备和存储介质
CN115599790B (zh) * 2022-11-10 2024-03-15 星环信息科技(上海)股份有限公司 一种数据存储系统、数据处理方法、电子设备和存储介质
CN116166691A (zh) * 2023-04-21 2023-05-26 中国科学院合肥物质科学研究院 一种基于数据划分的数据归档系统、方法、装置及设备

Also Published As

Publication number Publication date
CN112559459B (zh) 2024-02-13
CN112559459A (zh) 2021-03-26
EP4040303A4 (en) 2023-07-19
EP4040303A1 (en) 2022-08-10
US20240028604A1 (en) 2024-01-25

Similar Documents

Publication Publication Date Title
US9489443B1 (en) Scheduling of splits and moves of database partitions
US9424274B2 (en) Management of intermediate data spills during the shuffle phase of a map-reduce job
Krish et al. hats: A heterogeneity-aware tiered storage for hadoop
US8271455B2 (en) Storing replication requests for objects in a distributed storage system
US8874505B2 (en) Data replication and failure recovery method for distributed key-value store
JP2016027476A (ja) 規模変更可能なデータ記憶サービスを実装するためのシステムおよび方法
WO2022126839A1 (zh) 一种基于云计算的自适应存储分层系统及方法
JP2016509294A (ja) 分散型データベースクエリ・エンジン用のシステムおよび方法
US11755557B2 (en) Flat object storage namespace in an object storage system
US10635650B1 (en) Auto-partitioning secondary index for database tables
CN113377868A (zh) 一种基于分布式kv数据库的离线存储系统
Moise et al. Terabyte-scale image similarity search: experience and best practice
US11132367B1 (en) Automatic creation of indexes for database tables
CN117056303A (zh) 适用于军事行动大数据的数据存储方法及装置
US11886508B2 (en) Adaptive tiering for database data of a replica group
CN110569310A (zh) 一种云计算环境下的关系大数据的管理方法
Cha et al. Effective metadata management in exascale file system
Furfaro et al. Managing multidimensional historical aggregate data in unstructured P2P networks
US11853298B2 (en) Data storage and data retrieval methods and devices
Herodotou Towards a distributed multi-tier file system for cluster computing
Sun et al. A storage computing architecture with multiple NDP devices for accelerating compaction performance in LSM-tree based KV stores
Banino-Rokkones et al. Scheduling techniques for effective system reconfiguration in distributed storage systems
CN116126209A (zh) 数据存储方法、系统、装置、存储介质及程序产品
Shaha DYNAMICALLY SCALABLE APPROACH FOR NAMENODE IN HADOOP DISTRIBUTED FILE SYSTEM USING SECONDARY STORAGE
CN117827968A (zh) 一种元数据存储分离的云分布式数据仓库数据共享方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021801818

Country of ref document: EP

Effective date: 20211118

WWE Wipo information: entry into national phase

Ref document number: 17615551

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE