CN111475506B

CN111475506B - Method, device, system, equipment and storage medium for data storage and query

Info

Publication number: CN111475506B
Application number: CN202010238641.6A
Authority: CN
Inventors: 陈剑
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2024-03-01
Anticipated expiration: 2040-03-30
Also published as: CN111475506A

Abstract

The application provides a data storage method under a Hadoop storage architecture, wherein the Hadoop storage architecture comprises a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, and the data storage method comprises the following steps: and separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity, storing the separated cold data into the Hadoop clusters storing the cold data, and modifying the storage path of the separated cold data into the Hadoop clusters pointing to the cold data. The application also provides a data query method, so that the data in the Hadoop cluster for storing hot data and the data in the Hadoop cluster for storing cold data can be queried simultaneously when the data is queried, and the mixed query of the hot and cold data is realized.

Description

Method, device, system, equipment and storage medium for data storage and query

Technical Field

The present disclosure relates to the field of big data, and in particular, to a data storage method, a data query method, a data storage device, a data query device, a system, a device, and a computer readable storage medium based on a Hadoop storage architecture.

Background

In the big data field, hadoop (distributed big data processing framework) is the basis of big data storage analysis. With the increasing data, the data become cold data increasingly, and due to the characteristic that the Hdfs (Hadoop Distributed File System, distributed storage system) of Hadoop stores data, the resources occupied by the cold data will be more and more, but in actual computing, the utilization rate of the cold data is very low, so in reality, when the current machine room limits cause that more data cannot be stored in the current machine room, many enterprises can store the cold data separately to places with lower cost, such as clouding, or deploying a cheaper cluster in the different machine room to store the cold data.

When a cluster is deployed in a different place machine room to store cold data, a problem is faced, namely, how cold data is not completely useless data and cross-cluster mixed query of cold and hot data can be realized after the cold data is migrated.

Disclosure of Invention

In view of this, the present application provides a data storage method, a data query method, a data storage device, a data query device, a system and a device based on a Hadoop storage architecture.

According to a first aspect of embodiments of the present application, there is provided a data storage method under a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, the data storage method includes:

separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity, and storing the separated cold data into the Hadoop clusters storing the cold data;

the storage path of the separated cold data is modified to point to the Hadoop cluster storing the cold data.

According to a second aspect of embodiments of the present application, there is provided a data query method under a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, the data query method including:

analyzing the acquired query command to obtain queried data, and analyzing the queried data;

determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;

and obtaining the query result from the target Hadoop cluster.

According to a third aspect of embodiments of the present application, there is provided a system based on a Hadoop storage architecture, the system comprising:

the Hadoop storage architecture comprises a Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data;

the data configuration end is used for separating cold data from the Hadoop cluster for storing the hot data according to the set separation granularity and storing the separated cold data into the Hadoop cluster for storing the cold data; modifying the storage path of the separated cold data to point to a Hadoop cluster for storing the cold data;

the Hive client is used for analyzing the acquired query command to obtain queried data and analyzing the queried data; determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster; and obtaining the query result from the target Hadoop cluster.

According to a fourth aspect of embodiments of the present application, there is provided a data storage device based on a Hadoop storage architecture including a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, the data storage device comprising:

the data separation module is used for separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity and storing the separated cold data into the Hadoop clusters storing the cold data;

and the path modification module is used for modifying the storage path of the separated cold data into a Hadoop cluster for storing the cold data.

According to a fifth aspect of embodiments of the present application, a data query device under a Hadoop storage architecture is provided, where the Hadoop storage architecture includes a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data; the data query device comprises:

the analysis module is used for analyzing the acquired query command to obtain queried data and analyzing the queried data;

the task submitting module is used for determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;

and the query task is used for acquiring a query result from the target Hadoop cluster.

According to a sixth aspect of embodiments of the present application, there is provided an apparatus comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any of the embodiments described above.

According to a seventh aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the embodiments above.

According to the method and the device, the Hadoop clusters for storing hot data and the Hadoop clusters for storing cold data are pre-configured in the Hadoop storage architecture, cold data separated from the hot data can be stored in different clusters according to the set separation granularity, and the storage paths of the separated cold data are updated in time, so that the Hadoop clusters for storing the hot data and the data in the Hadoop clusters for storing the cold data can be queried simultaneously when the data are queried, and hybrid query of the hot and cold data is realized.

Drawings

FIG. 1 is a flow chart illustrating a method of data storage according to an exemplary embodiment of the present application.

FIG. 2 is a flow chart illustrating a method of data querying according to an exemplary embodiment of the present application.

FIG. 3 is a flow chart illustrating a method of submitting a query task to a Hadoop cluster according to an exemplary embodiment of the present application.

FIG. 4 is a system diagram of a Hadoop-based storage architecture according to an exemplary embodiment of the present application.

FIG. 5 is a system diagram of a Hadoop-based storage architecture according to an exemplary embodiment of the present application.

FIG. 6 is a schematic diagram of a data storage device based on a Hadoop storage architecture according to an exemplary embodiment of the present application.

Fig. 7 is a schematic structural diagram of a data query device based on a Hadoop storage architecture according to an exemplary embodiment of the present application.

Fig. 8 is a schematic structural view of an apparatus according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The amount of data increases rapidly over time, and many hot data is also cooled down slowly to become cold data, where hot data refers to data that needs to be accessed frequently by computing nodes, and belongs to data with high daily usage, cold data refers to data that cannot be accessed frequently or cannot be accessed any more, and belongs to data with low usage, so in reality, due to limitation of a local machine room, enterprises often deploy hot data nearby computing clusters with superior performance in the local machine room, and store cold data separately to places with lower cost, such as cloud or in a place with a general and cheaper performance cluster for storing cold data, but this way has the problem that when users want to access cold data, users are required to preheat cold data, load cold data onto computing clusters, because in the big data field, storing and analyzing of big data are usually performed based on Hadoop (distributed big data processing framework).

Hadoop is a cluster formed by a plurality of computers, and is called as a Hadoop cluster in the application, two core components of Hdfs (Hadoop Distributed File System, a distributed storage system) and Yarn (Yet Another Resource Negotiator, a distributed task scheduling framework) are mainly integrated in the Hadoop cluster, the storage of data in the Hadoop cluster is realized through Hdfs, and the task scheduling and execution in the Hadoop cluster are realized through Yarn. Meanwhile, the data stored in the Hadoop clusters are queried through a Hadoop-based data warehouse tool Hive, and because now Hive queries data are usually only performed in one Hadoop cluster, and data information of other Hadoop clusters cannot be stored in a metadata database of Hive based on the Hadoop clusters, mixed queries on cold data and hot data stored in different Hadoop clusters cannot be realized, and users are required to separately query in two Hadoop clusters respectively.

In view of the foregoing problems, the present application proposes a data storage method under a Hadoop storage architecture, where the method may be applied to a device configured for data of different clusters, where the Hadoop storage architecture includes a pre-configured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, as shown in fig. 1, fig. 1 is a flowchart of a data storage method shown in an exemplary embodiment of the present application, where the flowchart includes the following steps:

s101, separating cold data from a Hadoop cluster for storing hot data according to a set separation granularity, and storing the separated cold data into the Hadoop cluster for storing cold data;

s102, modifying the storage path of the separated cold data to point to a Hadoop cluster for storing the cold data.

In S101, the separation granularity may be set according to the date, where the Hadoop cluster storing the thermal data may include a plurality of date partitions, the user may select to separate the data of the specified date partition as the cold data, for example, may set a date, identify the data stored in the date partition before the date as the cold data and separate the data stored in the date partitions, or may also self-define the data of any date partition to be separated into the Hadoop cluster storing the cold data. In some possible examples, the separation granularity may also be the access frequency of the data, and the number of times that all the data in the Hadoop cluster storing the thermal storage data in the specified time period is accessed may be counted to obtain the access frequency of all the data in the specified time period, partitions with different access frequencies are established in the Hadoop cluster storing the thermal storage data, and the data of the partition with the access frequency lower than the specified value is separated as cold data.

In one embodiment, generally, one Hadoop cluster corresponds to one Hive and its metadata base, and the Hive metadata base includes metadata, where metadata is mainly used to describe the attribute of the data stored in the Hadoop cluster, for example, the storage location, size, etc. of the data, so that one Hive metadata base usually only records the data information stored in one Hadoop cluster, while in this embodiment, the Hive metadata base includes data information of two or more different clusters at the same time, for example, in this embodiment, the Hive metadata base may include metadata that describes the attribute of the cold data stored in the cold data cluster and the attribute of the hot data stored in the hot data cluster at the same time, so that, for the cold data separated in S101, its storage path is also recorded in the metadata of the Hive metadata base, and in S102, the storage path of the cold data separated is modified to point to the Hadoop cluster storing the cold data, and the storage path indicated by the metadata corresponding to the cold data separated is modified to point to the cold data cluster. Therefore, the embodiment of the application can record the data storage information of two Hadoop clusters through one Hive metadata base, and when data migration between the clusters occurs, the metadata corresponding to the migrated data in the Hive metadata base is modified, and the storage information is updated, so that the Hadoop clusters for storing cold data and the Hadoop clusters for storing hot data can be queried simultaneously when cold and hot data query is performed on the basis of a Hive client.

The application further provides a data query method under a Hadoop storage architecture, which can be applied to a Hive client, wherein the Hadoop storage architecture comprises a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, as shown in fig. 2, fig. 2 is a flowchart of a data query method according to an exemplary embodiment of the application, and the flowchart includes the following steps:

s201, analyzing the acquired query command to obtain queried data, and analyzing the queried data;

s202, determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;

s203, obtaining the query result from the target Hadoop cluster.

The query command may be to obtain an SQL query statement input by a user, analyze the SQL query statement to obtain data that the user wants to query, and in order to make the query closest to the Hadoop cluster storing the data to be queried, avoid large data copying between different clusters to improve the efficiency of data query, analyze the data of the query, for example, analyze the storage path of the data of the query through Hive metadata base to determine whether the data of the query is hot data or cold data, thereby determining the target Hadoop cluster executing the query, generating the query task and submitting the query task to the target Hadoop cluster for execution. However, the selection of submitting a query task to a Hadoop cluster storing hot data or a Hadoop cluster storing cold data cannot be realized by means of the existing Hive, because the existing Hive is preconfigured and loaded with cluster information to be submitted by the generated query task, and cannot support dynamic modification selection again in the execution process, that is, the existing Hive submits the Hadoop cluster of the query task and is preconfigured and not supported modification, so that the existing Hive only supports the query task submission of a single Hadoop cluster and cannot select submitted objects.

In this regard, the present application modifies the source code of the Hive client to implement deep customization of the Hive client, so that the Hive client supports dynamically modifying the object submitted by the query task when submitting the query task, and implements that after analyzing the storage path of the query data and determining whether the query data is hot data or cold data, the generated query task is submitted to the corresponding Hadoop cluster.

When the queried data only has cold data or only has hot data, after the storage path of the queried data is analyzed to point to the Hadoop cluster for storing the cold data or the Hadoop cluster for storing the hot data, the Hadoop cluster for executing the query can be directly determined, a query task is generated, the query task is selected and submitted to the corresponding Hadoop cluster, and the query efficiency can be ensured. In one embodiment, a plurality of data partitions for storing data are divided in a Hadoop cluster, query data are obtained by analyzing an acquired query command through Hive, after the query data are analyzed, storage paths of the queried data are acquired to determine the data partition needing to be scanned for query and the Hadoop cluster to which the data partition belongs, the sizes of all the data partitions needing to be scanned are acquired through Hive metadata, the sum of the sizes of all the data partitions needing to be scanned, the sum of the sizes of the data partitions belonging to the Hadoop cluster for storing cold data in all the data partitions needing to be scanned and the sum of the sizes of the data partitions belonging to the Hadoop cluster for storing hot data are counted, the Hadoop cluster with the highest duty ratio in the data needing to be scanned is calculated, the Hadoop cluster with the highest duty ratio is determined as a target Hadoop cluster, and a query task is generated and submitted to the target Hadoop cluster. In the embodiment, the Hadoop cluster with the highest duty ratio in the data to be scanned for query is determined as the target Hadoop cluster for executing query, and the target Hadoop cluster can locally acquire most of the data to be scanned and only needs to copy a small amount of data from the non-local cluster for scanning, so that the problem that the data query efficiency is low due to long-time data copying is avoided. For example, when the query data includes cold data and hot data, and when it is determined that the proportion of hot data in the data to be scanned is high, the Hadoop cluster storing the hot data is determined as a target Hadoop cluster, and a query task is generated and submitted to the Hadoop cluster storing the hot data, the Hadoop cluster storing the hot data can directly acquire the hot data which needs to be scanned and takes up relatively high proportion of cold data from the Hadoop cluster storing the cold data in a local manner, scan is performed, and after the scanning is finished, the query result is fed back to the Hive client.

When the queried data comprises hot data and cold data, the target Hadoop cluster for executing the query is the Hadoop cluster for storing only the hot data or the Hadoop cluster for storing only the cold data, so that the target Hadoop cluster must read the data of the non-local Hadoop cluster to complete the query, and therefore the data is inevitably required to be copied from the non-local Hadoop cluster, the problem that the speed is too slow when the non-local Hadoop cluster is remotely read through the default Hdfs protocol can exist, and the query of big data is also characterized in that the same data can be queried for a plurality of times, the same data can be repeatedly scanned by the query for a plurality of times, the repeated scanning of the same data can cause more network resources and time consumption, and the improvement of the data loading efficiency of the cross-cluster is not facilitated. Therefore, the target Hadoop cluster can firstly scan the cache layer when inquiring the data of the non-local Hadoop cluster, acquire the data required to be inquired from the non-local Hadoop cluster to scan when the data required to be inquired does not exist in the cache layer, and store the acquired data in the cache layer. Therefore, by adopting the aluxio protocol, the embodiment can enable the target Hadoop cluster to scan the local cache layer when inquiring the data of the non-local Hadoop cluster, and if the cache layer has the data needing to be scanned, the target Hadoop cluster does not need to be acquired from the non-local Hadoop cluster; and if the local caching layer does not have the data which need to be scanned in the query, acquiring the data from the non-local Hadoop cluster, and caching the data into the caching layer so as to directly acquire the data from the caching layer when the data need to be accessed next time. Therefore, for some data which needs to be acquired from the non-local Hadoop cluster and can be repeatedly scanned in multiple queries, the data can be stored in the cache layer when being acquired for the first time, so that the situation that the data need to be loaded from the non-local Hadoop cluster to the local Hadoop cluster again when the data need to be scanned each time is avoided, and the data loading efficiency between clusters is greatly improved.

The present application also provides a method flow for submitting a query task to a Hadoop cluster, where the method may be applied to a Hive client, as shown in fig. 3, and fig. 3 is a flowchart of a method for submitting a query task to a Hadoop cluster, where the flowchart includes the following steps:

s301, acquiring SQL query sentences.

S302, analyzing SQL to obtain the storage path of the queried data.

S303, judging whether the inquired data only has cold data or only has hot data, if so, executing S304; otherwise, S306 is performed.

S304, the duty ratio of cold data and hot data to be scanned for query is counted, and a target Hadoop cluster for executing the query is determined.

S305, replacing the Hdfs protocol of the target Hadoop cluster with an alluxio protocol.

S306, constructing a query task.

S307, judging whether the Hadoop cluster for executing the query is a Hadoop cluster for storing hot data or a Hadoop cluster for storing cold data, and executing S309 if the Hadoop cluster for storing hot data is the Hadoop cluster for storing hot data; if it is a Hadoop cluster storing cold data, S308 is performed.

S308, dynamically modifying Hadoop cluster information submitted by a pre-configured query task, and modifying a target Hadoop cluster submitted by the query task into a Hadoop cluster for storing cold data.

S309, submitting the query task to the target Hadoop cluster.

In the method for submitting the query task to the Hadoop cluster shown in this embodiment, the target Hadoop cluster submitted by the query task is preconfigured as the Hadoop cluster storing hot data. After the SQL query statement input by the user is acquired, the SQL is analyzed to acquire the storage path of the queried data, so that which data the queried data is determined:

when the fact that the queried data only has hot data is determined, a query task is built, the fact that the query is executed by a Hadoop cluster storing the hot data is determined, and the query task is submitted to the Hadoop cluster storing the hot data is determined;

when the fact that the queried data only has cold data is determined, a query task is built, the fact that the query is executed by a Hadoop cluster storing the cold data is determined, hadoop cluster information submitted by a pre-configured query task is modified, and the preset Hadoop cluster submitting the query task to the storage of the cold data is modified into the Hadoop cluster submitting the query task to the storage of the cold data;

when the queried data is determined to simultaneously comprise hot data and cold data, firstly counting the duty ratio of the cold data and the hot data which need to be scanned for query, determining a target Hadoop cluster for executing the query, replacing an Hdfs protocol of the target Hadoop cluster with an alluxio protocol, constructing a query task, and submitting the query task to the Hadoop cluster for storing the hot data when determining that the query is executed by the Hadoop cluster for storing the hot data; modifying Hadoop cluster information submitted by a pre-configured query task when determining that the query is executed by the Hadoop cluster storing cold data, modifying a target Hadoop cluster submitted by the query task into the Hadoop cluster storing cold data, and submitting the query task to the Hadoop cluster storing cold data.

The present application further provides a system 4 based on a Hadoop storage architecture, as shown in fig. 4, fig. 4 is a schematic diagram of a system based on a Hadoop storage architecture according to an exemplary embodiment of the present application, where the system includes:

the Hadoop storage architecture 401 includes a Hadoop cluster 4011 that stores hot data and a Hadoop cluster 4012 that stores cold data;

the data configuration end 402 is configured to separate cold data from the Hadoop cluster 4011 storing hot data according to a set separation granularity, and store the separated cold data to the Hadoop cluster 4012 storing cold data; modifying the storage path of the separated cold data to point to a Hadoop cluster for storing the cold data;

hive client 403, configured to parse the obtained query command to obtain queried data, and analyze the queried data; determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster; and obtaining the query result from the target Hadoop cluster.

In one embodiment, the system further comprises a Hive metadata base for recording the storage path of cold or hot data stored in the Hadoop cluster by metadata.

Fig. 5 is a schematic diagram of a system based on Hadoop storage architecture according to an exemplary embodiment of the present application, and as shown in fig. 5, the system includes a cold-hot data configuration center 501, a dispatch center 502, a hot data room 503, a cold data room 504, hive and database 505, and Hive clients 506, where the hot data room 503 further includes a hot data Yarn cluster 5031, an alluxio5032, and a hot data Hdfs cluster 5033, and the cold data room 504 further includes a cold data Yarn cluster 5041, an alluxio5042, and a cold data Hdfs cluster 5043.

Wherein, an administrator can perform cold data configuration through the cold and hot data configuration center 501, generate a task of configuring cold data in the cold and hot data configuration center 501 and submit the task to the dispatching center 502, the dispatching center 502 copies the designated cold data from the hot data Hdfs cluster 5033 in the hot data room 503 according to the configuration task and stores the cold data in the cold data Hdfs cluster 5043 in the cold data room 504, and correspondingly deletes the copied cold data in the hot data Hdfs cluster 5033 in the hot data room 503, and after the copying is completed, modifies the partition path of the metadata corresponding to the copied cold data in the Hive metadata database 505, where the partition path indicates the storage location of the cold data.

The user can analyze SQL through inputting SQL query sentences at the Hive client 506, acquire the storage path of the queried data according to the Hive metadata database 506 to determine the data needing to be scanned, calculate whether the data needing to be scanned is mostly stored in a hot data Hdfs cluster 5033 or a cold data Hdfs cluster 5043, if the data needing to be scanned is all stored in the hot data Hdfs cluster 5033 or all stored in the cold data Hdfs cluster 5043, generate query tasks and submit the query tasks to the hot data Yarn cluster 5033 or the cold data Yarn cluster 5043 correspondingly, and acquire the data needing to be scanned from the hot data Hdfs cluster 5033 or the cold data Hdfs cluster 5041 correspondingly by the hot data Yarn cluster 5031 or the cold data Yarn cluster 5041 to scan; if the data to be scanned includes data stored in both the hot data Hdfs cluster 5033 and the cold data Hdfs cluster 5043, a query task is generated and submitted to a corresponding yan cluster of the Hdfs cluster with high data occupation ratio, and meanwhile, for the yan clusters of different machine rooms, the data of the Hdfs clusters of the non-local machine room can be acquired across the machine room by virtue of the aluxio and cached in the cache layer, so that when the yan clusters want to acquire the scanning data across the machine room, whether the target scanning data is cached in the cache layer can be scanned first, and repeated copying is avoided. In the whole process, the user only needs to input the SQL and wait for the Hive client 506 to output the query result, and other operations such as preheating the cold data are not needed, and all the query steps can be automatically realized through the system, so that the system not only realizes the hybrid query of the cold and hot data with high performance, but also is almost completely transparent to the user, and effectively realizes no perception of the user.

The present application further provides a data storage device based on a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, as shown in fig. 6, fig. 6 is a schematic structural diagram of a data storage device 600 according to an exemplary embodiment of the present application, where the data storage device includes:

the data separation module 601 is configured to separate cold data from the Hadoop cluster storing hot data according to a set separation granularity, and store the separated cold data to the Hadoop cluster storing cold data;

the path modification module 602 is configured to modify the storage path of the separated cold data to point to the Hadoop cluster storing the cold data.

The present application further provides a data query device based on a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, as shown in fig. 7, fig. 7 is a schematic structural diagram of a data storage device 700 according to an exemplary embodiment of the present application, where the data query device includes:

the parsing module 701 is configured to parse the obtained query command to obtain queried data, and analyze the queried data;

the task submitting module 702 is configured to determine a target Hadoop cluster for executing the query according to the analysis result, generate a query task, and submit the query task to the target Hadoop cluster;

Embodiments of the data storage device and the data querying device of the present application may be applied to a device. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking a software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of a device where the device is located for operation. In terms of hardware, as shown in fig. 8, a hardware structure diagram of a device where the data storage device and the data query device are located is shown in fig. 8, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 8, the device where the device is located in the embodiment generally includes other hardware according to the actual function of the device, which is not described herein again.

Wherein the non-volatile memory is for storing the processor-executable instructions, the processor being configured to execute the instructions to implement the method of any of the above embodiments.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of data storage and data querying of any of the embodiments described above.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing description of the preferred embodiments of the present invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. The data storage method under the Hadoop storage architecture is characterized in that the Hadoop storage architecture comprises a pre-configured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data; the storage paths of the cold data and the hot data are recorded in metadata of a Hive metadata base; the data storage method comprises the following steps:

and modifying the storage path indicated by the metadata corresponding to the separated cold data into a Hadoop cluster for storing the cold data.

2. The data storage method according to claim 1, wherein the separation granularity is set according to a date, and the Hadoop cluster for storing the hot data comprises a plurality of date partitions;

the separating cold data from the Hadoop cluster storing hot data includes:

the data for the specified date partition is separated as cold data.

3. A data query method under Hadoop storage architecture, for querying data stored by the method of claim 1, the data query method comprising:

and obtaining the query result from the target Hadoop cluster.

4. A method of querying data as in claim 3, wherein the Hadoop cluster comprises a plurality of data partitions storing data, and wherein analyzing the queried data comprises:

acquiring a storage path of the queried data;

determining a data partition to be searched and a Hadoop cluster to which the data partition belongs, and acquiring the sizes of all the data partitions to be scanned; determining a Hadoop cluster with the highest duty ratio in data to be scanned;

the determining the target Hadoop cluster for executing the query comprises the following steps:

and determining the Hadoop cluster with the highest duty ratio as a target Hadoop cluster.

5. The data query method according to claim 4, wherein a cache layer is provided on the target Hadoop cluster, and is used for storing data acquired by the target Hadoop cluster from a non-local Hadoop cluster;

and the query task indicates the target Hadoop cluster to firstly scan the cache layer when querying the data of the non-local Hadoop cluster, and acquires the data required to be scanned for query from the non-local Hadoop cluster and stores the data in the cache layer when the data required to be scanned for query does not exist in the cache layer.

6. The data query method of claim 5, further comprising:

when the data partition required to be scanned by the query is determined to simultaneously comprise the data partition belonging to the Hadoop cluster for storing hot data and the data partition belonging to the Hadoop cluster for storing cold data, the Hdfs protocol of the determined target Hadoop cluster is replaced by an alluxio protocol.

7. A system based on Hadoop storage architecture, the system comprising:

the Hive metadata base is used for recording the storage paths of the cold data or the hot data stored in the Hadoop cluster through metadata;

the data configuration end is used for separating cold data from the Hadoop cluster for storing the hot data according to the set separation granularity and storing the separated cold data into the Hadoop cluster for storing the cold data; modifying a storage path indicated by metadata corresponding to the separated cold data into a Hadoop cluster pointed to store the cold data;

8. The data storage device based on the Hadoop storage architecture is characterized in that the Hadoop storage architecture comprises a pre-configured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data; the storage paths of the cold data and the hot data are recorded in metadata of a Hive metadata base; the data storage device includes:

and the path modification module is used for modifying the storage path indicated by the metadata corresponding to the separated cold data into a Hadoop cluster for storing the cold data.

9. A data querying device in a Hadoop storage architecture for querying data stored by the device of claim 8, the data querying device comprising:

10. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any of claims 1-6.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any of claims 1-6.