CN111475506B - Method, device, system, equipment and storage medium for data storage and query - Google Patents

Method, device, system, equipment and storage medium for data storage and query Download PDF

Info

Publication number
CN111475506B
CN111475506B CN202010238641.6A CN202010238641A CN111475506B CN 111475506 B CN111475506 B CN 111475506B CN 202010238641 A CN202010238641 A CN 202010238641A CN 111475506 B CN111475506 B CN 111475506B
Authority
CN
China
Prior art keywords
data
hadoop
hadoop cluster
storing
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010238641.6A
Other languages
Chinese (zh)
Other versions
CN111475506A (en
Inventor
陈剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huya Technology Co Ltd
Original Assignee
Guangzhou Huya Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huya Technology Co Ltd filed Critical Guangzhou Huya Technology Co Ltd
Priority to CN202010238641.6A priority Critical patent/CN111475506B/en
Publication of CN111475506A publication Critical patent/CN111475506A/en
Application granted granted Critical
Publication of CN111475506B publication Critical patent/CN111475506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Abstract

The application provides a data storage method under a Hadoop storage architecture, wherein the Hadoop storage architecture comprises a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, and the data storage method comprises the following steps: and separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity, storing the separated cold data into the Hadoop clusters storing the cold data, and modifying the storage path of the separated cold data into the Hadoop clusters pointing to the cold data. The application also provides a data query method, so that the data in the Hadoop cluster for storing hot data and the data in the Hadoop cluster for storing cold data can be queried simultaneously when the data is queried, and the mixed query of the hot and cold data is realized.

Description

Method, device, system, equipment and storage medium for data storage and query
Technical Field
The present disclosure relates to the field of big data, and in particular, to a data storage method, a data query method, a data storage device, a data query device, a system, a device, and a computer readable storage medium based on a Hadoop storage architecture.
Background
In the big data field, hadoop (distributed big data processing framework) is the basis of big data storage analysis. With the increasing data, the data become cold data increasingly, and due to the characteristic that the Hdfs (Hadoop Distributed File System, distributed storage system) of Hadoop stores data, the resources occupied by the cold data will be more and more, but in actual computing, the utilization rate of the cold data is very low, so in reality, when the current machine room limits cause that more data cannot be stored in the current machine room, many enterprises can store the cold data separately to places with lower cost, such as clouding, or deploying a cheaper cluster in the different machine room to store the cold data.
When a cluster is deployed in a different place machine room to store cold data, a problem is faced, namely, how cold data is not completely useless data and cross-cluster mixed query of cold and hot data can be realized after the cold data is migrated.
Disclosure of Invention
In view of this, the present application provides a data storage method, a data query method, a data storage device, a data query device, a system and a device based on a Hadoop storage architecture.
According to a first aspect of embodiments of the present application, there is provided a data storage method under a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, the data storage method includes:
separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity, and storing the separated cold data into the Hadoop clusters storing the cold data;
the storage path of the separated cold data is modified to point to the Hadoop cluster storing the cold data.
According to a second aspect of embodiments of the present application, there is provided a data query method under a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, the data query method including:
analyzing the acquired query command to obtain queried data, and analyzing the queried data;
determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;
and obtaining the query result from the target Hadoop cluster.
According to a third aspect of embodiments of the present application, there is provided a system based on a Hadoop storage architecture, the system comprising:
the Hadoop storage architecture comprises a Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data;
the data configuration end is used for separating cold data from the Hadoop cluster for storing the hot data according to the set separation granularity and storing the separated cold data into the Hadoop cluster for storing the cold data; modifying the storage path of the separated cold data to point to a Hadoop cluster for storing the cold data;
the Hive client is used for analyzing the acquired query command to obtain queried data and analyzing the queried data; determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster; and obtaining the query result from the target Hadoop cluster.
According to a fourth aspect of embodiments of the present application, there is provided a data storage device based on a Hadoop storage architecture including a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, the data storage device comprising:
the data separation module is used for separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity and storing the separated cold data into the Hadoop clusters storing the cold data;
and the path modification module is used for modifying the storage path of the separated cold data into a Hadoop cluster for storing the cold data.
According to a fifth aspect of embodiments of the present application, a data query device under a Hadoop storage architecture is provided, where the Hadoop storage architecture includes a preconfigured Hadoop cluster storing hot data and a Hadoop cluster storing cold data; the data query device comprises:
the analysis module is used for analyzing the acquired query command to obtain queried data and analyzing the queried data;
the task submitting module is used for determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;
and the query task is used for acquiring a query result from the target Hadoop cluster.
According to a sixth aspect of embodiments of the present application, there is provided an apparatus comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of any of the embodiments described above.
According to a seventh aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the embodiments above.
According to the method and the device, the Hadoop clusters for storing hot data and the Hadoop clusters for storing cold data are pre-configured in the Hadoop storage architecture, cold data separated from the hot data can be stored in different clusters according to the set separation granularity, and the storage paths of the separated cold data are updated in time, so that the Hadoop clusters for storing the hot data and the data in the Hadoop clusters for storing the cold data can be queried simultaneously when the data are queried, and hybrid query of the hot and cold data is realized.
Drawings
FIG. 1 is a flow chart illustrating a method of data storage according to an exemplary embodiment of the present application.
FIG. 2 is a flow chart illustrating a method of data querying according to an exemplary embodiment of the present application.
FIG. 3 is a flow chart illustrating a method of submitting a query task to a Hadoop cluster according to an exemplary embodiment of the present application.
FIG. 4 is a system diagram of a Hadoop-based storage architecture according to an exemplary embodiment of the present application.
FIG. 5 is a system diagram of a Hadoop-based storage architecture according to an exemplary embodiment of the present application.
FIG. 6 is a schematic diagram of a data storage device based on a Hadoop storage architecture according to an exemplary embodiment of the present application.
Fig. 7 is a schematic structural diagram of a data query device based on a Hadoop storage architecture according to an exemplary embodiment of the present application.
Fig. 8 is a schematic structural view of an apparatus according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The amount of data increases rapidly over time, and many hot data is also cooled down slowly to become cold data, where hot data refers to data that needs to be accessed frequently by computing nodes, and belongs to data with high daily usage, cold data refers to data that cannot be accessed frequently or cannot be accessed any more, and belongs to data with low usage, so in reality, due to limitation of a local machine room, enterprises often deploy hot data nearby computing clusters with superior performance in the local machine room, and store cold data separately to places with lower cost, such as cloud or in a place with a general and cheaper performance cluster for storing cold data, but this way has the problem that when users want to access cold data, users are required to preheat cold data, load cold data onto computing clusters, because in the big data field, storing and analyzing of big data are usually performed based on Hadoop (distributed big data processing framework).
Hadoop is a cluster formed by a plurality of computers, and is called as a Hadoop cluster in the application, two core components of Hdfs (Hadoop Distributed File System, a distributed storage system) and Yarn (Yet Another Resource Negotiator, a distributed task scheduling framework) are mainly integrated in the Hadoop cluster, the storage of data in the Hadoop cluster is realized through Hdfs, and the task scheduling and execution in the Hadoop cluster are realized through Yarn. Meanwhile, the data stored in the Hadoop clusters are queried through a Hadoop-based data warehouse tool Hive, and because now Hive queries data are usually only performed in one Hadoop cluster, and data information of other Hadoop clusters cannot be stored in a metadata database of Hive based on the Hadoop clusters, mixed queries on cold data and hot data stored in different Hadoop clusters cannot be realized, and users are required to separately query in two Hadoop clusters respectively.
In view of the foregoing problems, the present application proposes a data storage method under a Hadoop storage architecture, where the method may be applied to a device configured for data of different clusters, where the Hadoop storage architecture includes a pre-configured Hadoop cluster storing hot data and a Hadoop cluster storing cold data, as shown in fig. 1, fig. 1 is a flowchart of a data storage method shown in an exemplary embodiment of the present application, where the flowchart includes the following steps:
s101, separating cold data from a Hadoop cluster for storing hot data according to a set separation granularity, and storing the separated cold data into the Hadoop cluster for storing cold data;
s102, modifying the storage path of the separated cold data to point to a Hadoop cluster for storing the cold data.
In S101, the separation granularity may be set according to the date, where the Hadoop cluster storing the thermal data may include a plurality of date partitions, the user may select to separate the data of the specified date partition as the cold data, for example, may set a date, identify the data stored in the date partition before the date as the cold data and separate the data stored in the date partitions, or may also self-define the data of any date partition to be separated into the Hadoop cluster storing the cold data. In some possible examples, the separation granularity may also be the access frequency of the data, and the number of times that all the data in the Hadoop cluster storing the thermal storage data in the specified time period is accessed may be counted to obtain the access frequency of all the data in the specified time period, partitions with different access frequencies are established in the Hadoop cluster storing the thermal storage data, and the data of the partition with the access frequency lower than the specified value is separated as cold data.
In one embodiment, generally, one Hadoop cluster corresponds to one Hive and its metadata base, and the Hive metadata base includes metadata, where metadata is mainly used to describe the attribute of the data stored in the Hadoop cluster, for example, the storage location, size, etc. of the data, so that one Hive metadata base usually only records the data information stored in one Hadoop cluster, while in this embodiment, the Hive metadata base includes data information of two or more different clusters at the same time, for example, in this embodiment, the Hive metadata base may include metadata that describes the attribute of the cold data stored in the cold data cluster and the attribute of the hot data stored in the hot data cluster at the same time, so that, for the cold data separated in S101, its storage path is also recorded in the metadata of the Hive metadata base, and in S102, the storage path of the cold data separated is modified to point to the Hadoop cluster storing the cold data, and the storage path indicated by the metadata corresponding to the cold data separated is modified to point to the cold data cluster. Therefore, the embodiment of the application can record the data storage information of two Hadoop clusters through one Hive metadata base, and when data migration between the clusters occurs, the metadata corresponding to the migrated data in the Hive metadata base is modified, and the storage information is updated, so that the Hadoop clusters for storing cold data and the Hadoop clusters for storing hot data can be queried simultaneously when cold and hot data query is performed on the basis of a Hive client.
The application further provides a data query method under a Hadoop storage architecture, which can be applied to a Hive client, wherein the Hadoop storage architecture comprises a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, as shown in fig. 2, fig. 2 is a flowchart of a data query method according to an exemplary embodiment of the application, and the flowchart includes the following steps:
s201, analyzing the acquired query command to obtain queried data, and analyzing the queried data;
s202, determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;
s203, obtaining the query result from the target Hadoop cluster.
The query command may be to obtain an SQL query statement input by a user, analyze the SQL query statement to obtain data that the user wants to query, and in order to make the query closest to the Hadoop cluster storing the data to be queried, avoid large data copying between different clusters to improve the efficiency of data query, analyze the data of the query, for example, analyze the storage path of the data of the query through Hive metadata base to determine whether the data of the query is hot data or cold data, thereby determining the target Hadoop cluster executing the query, generating the query task and submitting the query task to the target Hadoop cluster for execution. However, the selection of submitting a query task to a Hadoop cluster storing hot data or a Hadoop cluster storing cold data cannot be realized by means of the existing Hive, because the existing Hive is preconfigured and loaded with cluster information to be submitted by the generated query task, and cannot support dynamic modification selection again in the execution process, that is, the existing Hive submits the Hadoop cluster of the query task and is preconfigured and not supported modification, so that the existing Hive only supports the query task submission of a single Hadoop cluster and cannot select submitted objects.
In this regard, the present application modifies the source code of the Hive client to implement deep customization of the Hive client, so that the Hive client supports dynamically modifying the object submitted by the query task when submitting the query task, and implements that after analyzing the storage path of the query data and determining whether the query data is hot data or cold data, the generated query task is submitted to the corresponding Hadoop cluster.
When the queried data only has cold data or only has hot data, after the storage path of the queried data is analyzed to point to the Hadoop cluster for storing the cold data or the Hadoop cluster for storing the hot data, the Hadoop cluster for executing the query can be directly determined, a query task is generated, the query task is selected and submitted to the corresponding Hadoop cluster, and the query efficiency can be ensured. In one embodiment, a plurality of data partitions for storing data are divided in a Hadoop cluster, query data are obtained by analyzing an acquired query command through Hive, after the query data are analyzed, storage paths of the queried data are acquired to determine the data partition needing to be scanned for query and the Hadoop cluster to which the data partition belongs, the sizes of all the data partitions needing to be scanned are acquired through Hive metadata, the sum of the sizes of all the data partitions needing to be scanned, the sum of the sizes of the data partitions belonging to the Hadoop cluster for storing cold data in all the data partitions needing to be scanned and the sum of the sizes of the data partitions belonging to the Hadoop cluster for storing hot data are counted, the Hadoop cluster with the highest duty ratio in the data needing to be scanned is calculated, the Hadoop cluster with the highest duty ratio is determined as a target Hadoop cluster, and a query task is generated and submitted to the target Hadoop cluster. In the embodiment, the Hadoop cluster with the highest duty ratio in the data to be scanned for query is determined as the target Hadoop cluster for executing query, and the target Hadoop cluster can locally acquire most of the data to be scanned and only needs to copy a small amount of data from the non-local cluster for scanning, so that the problem that the data query efficiency is low due to long-time data copying is avoided. For example, when the query data includes cold data and hot data, and when it is determined that the proportion of hot data in the data to be scanned is high, the Hadoop cluster storing the hot data is determined as a target Hadoop cluster, and a query task is generated and submitted to the Hadoop cluster storing the hot data, the Hadoop cluster storing the hot data can directly acquire the hot data which needs to be scanned and takes up relatively high proportion of cold data from the Hadoop cluster storing the cold data in a local manner, scan is performed, and after the scanning is finished, the query result is fed back to the Hive client.
When the queried data comprises hot data and cold data, the target Hadoop cluster for executing the query is the Hadoop cluster for storing only the hot data or the Hadoop cluster for storing only the cold data, so that the target Hadoop cluster must read the data of the non-local Hadoop cluster to complete the query, and therefore the data is inevitably required to be copied from the non-local Hadoop cluster, the problem that the speed is too slow when the non-local Hadoop cluster is remotely read through the default Hdfs protocol can exist, and the query of big data is also characterized in that the same data can be queried for a plurality of times, the same data can be repeatedly scanned by the query for a plurality of times, the repeated scanning of the same data can cause more network resources and time consumption, and the improvement of the data loading efficiency of the cross-cluster is not facilitated. Therefore, the target Hadoop cluster can firstly scan the cache layer when inquiring the data of the non-local Hadoop cluster, acquire the data required to be inquired from the non-local Hadoop cluster to scan when the data required to be inquired does not exist in the cache layer, and store the acquired data in the cache layer. Therefore, by adopting the aluxio protocol, the embodiment can enable the target Hadoop cluster to scan the local cache layer when inquiring the data of the non-local Hadoop cluster, and if the cache layer has the data needing to be scanned, the target Hadoop cluster does not need to be acquired from the non-local Hadoop cluster; and if the local caching layer does not have the data which need to be scanned in the query, acquiring the data from the non-local Hadoop cluster, and caching the data into the caching layer so as to directly acquire the data from the caching layer when the data need to be accessed next time. Therefore, for some data which needs to be acquired from the non-local Hadoop cluster and can be repeatedly scanned in multiple queries, the data can be stored in the cache layer when being acquired for the first time, so that the situation that the data need to be loaded from the non-local Hadoop cluster to the local Hadoop cluster again when the data need to be scanned each time is avoided, and the data loading efficiency between clusters is greatly improved.
The present application also provides a method flow for submitting a query task to a Hadoop cluster, where the method may be applied to a Hive client, as shown in fig. 3, and fig. 3 is a flowchart of a method for submitting a query task to a Hadoop cluster, where the flowchart includes the following steps:
s301, acquiring SQL query sentences.
S302, analyzing SQL to obtain the storage path of the queried data.
S303, judging whether the inquired data only has cold data or only has hot data, if so, executing S304; otherwise, S306 is performed.
S304, the duty ratio of cold data and hot data to be scanned for query is counted, and a target Hadoop cluster for executing the query is determined.
S305, replacing the Hdfs protocol of the target Hadoop cluster with an alluxio protocol.
S306, constructing a query task.
S307, judging whether the Hadoop cluster for executing the query is a Hadoop cluster for storing hot data or a Hadoop cluster for storing cold data, and executing S309 if the Hadoop cluster for storing hot data is the Hadoop cluster for storing hot data; if it is a Hadoop cluster storing cold data, S308 is performed.
S308, dynamically modifying Hadoop cluster information submitted by a pre-configured query task, and modifying a target Hadoop cluster submitted by the query task into a Hadoop cluster for storing cold data.
S309, submitting the query task to the target Hadoop cluster.
In the method for submitting the query task to the Hadoop cluster shown in this embodiment, the target Hadoop cluster submitted by the query task is preconfigured as the Hadoop cluster storing hot data. After the SQL query statement input by the user is acquired, the SQL is analyzed to acquire the storage path of the queried data, so that which data the queried data is determined:
when the fact that the queried data only has hot data is determined, a query task is built, the fact that the query is executed by a Hadoop cluster storing the hot data is determined, and the query task is submitted to the Hadoop cluster storing the hot data is determined;
when the fact that the queried data only has cold data is determined, a query task is built, the fact that the query is executed by a Hadoop cluster storing the cold data is determined, hadoop cluster information submitted by a pre-configured query task is modified, and the preset Hadoop cluster submitting the query task to the storage of the cold data is modified into the Hadoop cluster submitting the query task to the storage of the cold data;
when the queried data is determined to simultaneously comprise hot data and cold data, firstly counting the duty ratio of the cold data and the hot data which need to be scanned for query, determining a target Hadoop cluster for executing the query, replacing an Hdfs protocol of the target Hadoop cluster with an alluxio protocol, constructing a query task, and submitting the query task to the Hadoop cluster for storing the hot data when determining that the query is executed by the Hadoop cluster for storing the hot data; modifying Hadoop cluster information submitted by a pre-configured query task when determining that the query is executed by the Hadoop cluster storing cold data, modifying a target Hadoop cluster submitted by the query task into the Hadoop cluster storing cold data, and submitting the query task to the Hadoop cluster storing cold data.
The present application further provides a system 4 based on a Hadoop storage architecture, as shown in fig. 4, fig. 4 is a schematic diagram of a system based on a Hadoop storage architecture according to an exemplary embodiment of the present application, where the system includes:
the Hadoop storage architecture 401 includes a Hadoop cluster 4011 that stores hot data and a Hadoop cluster 4012 that stores cold data;
the data configuration end 402 is configured to separate cold data from the Hadoop cluster 4011 storing hot data according to a set separation granularity, and store the separated cold data to the Hadoop cluster 4012 storing cold data; modifying the storage path of the separated cold data to point to a Hadoop cluster for storing the cold data;
hive client 403, configured to parse the obtained query command to obtain queried data, and analyze the queried data; determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster; and obtaining the query result from the target Hadoop cluster.
In one embodiment, the system further comprises a Hive metadata base for recording the storage path of cold or hot data stored in the Hadoop cluster by metadata.
Fig. 5 is a schematic diagram of a system based on Hadoop storage architecture according to an exemplary embodiment of the present application, and as shown in fig. 5, the system includes a cold-hot data configuration center 501, a dispatch center 502, a hot data room 503, a cold data room 504, hive and database 505, and Hive clients 506, where the hot data room 503 further includes a hot data Yarn cluster 5031, an alluxio5032, and a hot data Hdfs cluster 5033, and the cold data room 504 further includes a cold data Yarn cluster 5041, an alluxio5042, and a cold data Hdfs cluster 5043.
Wherein, an administrator can perform cold data configuration through the cold and hot data configuration center 501, generate a task of configuring cold data in the cold and hot data configuration center 501 and submit the task to the dispatching center 502, the dispatching center 502 copies the designated cold data from the hot data Hdfs cluster 5033 in the hot data room 503 according to the configuration task and stores the cold data in the cold data Hdfs cluster 5043 in the cold data room 504, and correspondingly deletes the copied cold data in the hot data Hdfs cluster 5033 in the hot data room 503, and after the copying is completed, modifies the partition path of the metadata corresponding to the copied cold data in the Hive metadata database 505, where the partition path indicates the storage location of the cold data.
The user can analyze SQL through inputting SQL query sentences at the Hive client 506, acquire the storage path of the queried data according to the Hive metadata database 506 to determine the data needing to be scanned, calculate whether the data needing to be scanned is mostly stored in a hot data Hdfs cluster 5033 or a cold data Hdfs cluster 5043, if the data needing to be scanned is all stored in the hot data Hdfs cluster 5033 or all stored in the cold data Hdfs cluster 5043, generate query tasks and submit the query tasks to the hot data Yarn cluster 5033 or the cold data Yarn cluster 5043 correspondingly, and acquire the data needing to be scanned from the hot data Hdfs cluster 5033 or the cold data Hdfs cluster 5041 correspondingly by the hot data Yarn cluster 5031 or the cold data Yarn cluster 5041 to scan; if the data to be scanned includes data stored in both the hot data Hdfs cluster 5033 and the cold data Hdfs cluster 5043, a query task is generated and submitted to a corresponding yan cluster of the Hdfs cluster with high data occupation ratio, and meanwhile, for the yan clusters of different machine rooms, the data of the Hdfs clusters of the non-local machine room can be acquired across the machine room by virtue of the aluxio and cached in the cache layer, so that when the yan clusters want to acquire the scanning data across the machine room, whether the target scanning data is cached in the cache layer can be scanned first, and repeated copying is avoided. In the whole process, the user only needs to input the SQL and wait for the Hive client 506 to output the query result, and other operations such as preheating the cold data are not needed, and all the query steps can be automatically realized through the system, so that the system not only realizes the hybrid query of the cold and hot data with high performance, but also is almost completely transparent to the user, and effectively realizes no perception of the user.
The present application further provides a data storage device based on a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, as shown in fig. 6, fig. 6 is a schematic structural diagram of a data storage device 600 according to an exemplary embodiment of the present application, where the data storage device includes:
the data separation module 601 is configured to separate cold data from the Hadoop cluster storing hot data according to a set separation granularity, and store the separated cold data to the Hadoop cluster storing cold data;
the path modification module 602 is configured to modify the storage path of the separated cold data to point to the Hadoop cluster storing the cold data.
The present application further provides a data query device based on a Hadoop storage architecture, where the Hadoop storage architecture includes a preconfigured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data, as shown in fig. 7, fig. 7 is a schematic structural diagram of a data storage device 700 according to an exemplary embodiment of the present application, where the data query device includes:
the parsing module 701 is configured to parse the obtained query command to obtain queried data, and analyze the queried data;
the task submitting module 702 is configured to determine a target Hadoop cluster for executing the query according to the analysis result, generate a query task, and submit the query task to the target Hadoop cluster;
and the query task is used for acquiring a query result from the target Hadoop cluster.
Embodiments of the data storage device and the data querying device of the present application may be applied to a device. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking a software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of a device where the device is located for operation. In terms of hardware, as shown in fig. 8, a hardware structure diagram of a device where the data storage device and the data query device are located is shown in fig. 8, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 8, the device where the device is located in the embodiment generally includes other hardware according to the actual function of the device, which is not described herein again.
Wherein the non-volatile memory is for storing the processor-executable instructions, the processor being configured to execute the instructions to implement the method of any of the above embodiments.
The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of data storage and data querying of any of the embodiments described above.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing description of the preferred embodiments of the present invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims (11)

1. The data storage method under the Hadoop storage architecture is characterized in that the Hadoop storage architecture comprises a pre-configured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data; the storage paths of the cold data and the hot data are recorded in metadata of a Hive metadata base; the data storage method comprises the following steps:
separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity, and storing the separated cold data into the Hadoop clusters storing the cold data;
and modifying the storage path indicated by the metadata corresponding to the separated cold data into a Hadoop cluster for storing the cold data.
2. The data storage method according to claim 1, wherein the separation granularity is set according to a date, and the Hadoop cluster for storing the hot data comprises a plurality of date partitions;
the separating cold data from the Hadoop cluster storing hot data includes:
the data for the specified date partition is separated as cold data.
3. A data query method under Hadoop storage architecture, for querying data stored by the method of claim 1, the data query method comprising:
analyzing the acquired query command to obtain queried data, and analyzing the queried data;
determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;
and obtaining the query result from the target Hadoop cluster.
4. A method of querying data as in claim 3, wherein the Hadoop cluster comprises a plurality of data partitions storing data, and wherein analyzing the queried data comprises:
acquiring a storage path of the queried data;
determining a data partition to be searched and a Hadoop cluster to which the data partition belongs, and acquiring the sizes of all the data partitions to be scanned; determining a Hadoop cluster with the highest duty ratio in data to be scanned;
the determining the target Hadoop cluster for executing the query comprises the following steps:
and determining the Hadoop cluster with the highest duty ratio as a target Hadoop cluster.
5. The data query method according to claim 4, wherein a cache layer is provided on the target Hadoop cluster, and is used for storing data acquired by the target Hadoop cluster from a non-local Hadoop cluster;
and the query task indicates the target Hadoop cluster to firstly scan the cache layer when querying the data of the non-local Hadoop cluster, and acquires the data required to be scanned for query from the non-local Hadoop cluster and stores the data in the cache layer when the data required to be scanned for query does not exist in the cache layer.
6. The data query method of claim 5, further comprising:
when the data partition required to be scanned by the query is determined to simultaneously comprise the data partition belonging to the Hadoop cluster for storing hot data and the data partition belonging to the Hadoop cluster for storing cold data, the Hdfs protocol of the determined target Hadoop cluster is replaced by an alluxio protocol.
7. A system based on Hadoop storage architecture, the system comprising:
the Hadoop storage architecture comprises a Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data;
the Hive metadata base is used for recording the storage paths of the cold data or the hot data stored in the Hadoop cluster through metadata;
the data configuration end is used for separating cold data from the Hadoop cluster for storing the hot data according to the set separation granularity and storing the separated cold data into the Hadoop cluster for storing the cold data; modifying a storage path indicated by metadata corresponding to the separated cold data into a Hadoop cluster pointed to store the cold data;
the Hive client is used for analyzing the acquired query command to obtain queried data and analyzing the queried data; determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster; and obtaining the query result from the target Hadoop cluster.
8. The data storage device based on the Hadoop storage architecture is characterized in that the Hadoop storage architecture comprises a pre-configured Hadoop cluster for storing hot data and a Hadoop cluster for storing cold data; the storage paths of the cold data and the hot data are recorded in metadata of a Hive metadata base; the data storage device includes:
the data separation module is used for separating cold data from the Hadoop clusters storing the hot data according to the set separation granularity and storing the separated cold data into the Hadoop clusters storing the cold data;
and the path modification module is used for modifying the storage path indicated by the metadata corresponding to the separated cold data into a Hadoop cluster for storing the cold data.
9. A data querying device in a Hadoop storage architecture for querying data stored by the device of claim 8, the data querying device comprising:
the analysis module is used for analyzing the acquired query command to obtain queried data and analyzing the queried data;
the task submitting module is used for determining a target Hadoop cluster for executing the query according to the analysis result, generating a query task and submitting the query task to the target Hadoop cluster;
and the query task is used for acquiring a query result from the target Hadoop cluster.
10. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of any of claims 1-6.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any of claims 1-6.
CN202010238641.6A 2020-03-30 2020-03-30 Method, device, system, equipment and storage medium for data storage and query Active CN111475506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010238641.6A CN111475506B (en) 2020-03-30 2020-03-30 Method, device, system, equipment and storage medium for data storage and query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010238641.6A CN111475506B (en) 2020-03-30 2020-03-30 Method, device, system, equipment and storage medium for data storage and query

Publications (2)

Publication Number Publication Date
CN111475506A CN111475506A (en) 2020-07-31
CN111475506B true CN111475506B (en) 2024-03-01

Family

ID=71750509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010238641.6A Active CN111475506B (en) 2020-03-30 2020-03-30 Method, device, system, equipment and storage medium for data storage and query

Country Status (1)

Country Link
CN (1) CN111475506B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463755B (en) * 2020-12-11 2023-08-18 同济大学 System and method for storing and reading big data of heterogeneous Internet of things based on HDFS
CN112650453A (en) * 2020-12-31 2021-04-13 北京千方科技股份有限公司 Method and system for storing and inquiring traffic data
CN113032430B (en) * 2021-03-25 2023-12-19 杭州网易数之帆科技有限公司 Data processing method, device, medium and computing equipment
CN114003180A (en) * 2021-11-11 2022-02-01 中国建设银行股份有限公司 Data processing method and device based on cross-machine-room Hadoop cluster
CN116821138B (en) * 2023-08-24 2023-12-15 腾讯科技(深圳)有限公司 Data processing method and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033360A (en) * 2018-07-26 2018-12-18 腾讯科技(深圳)有限公司 A kind of data query method, apparatus, server and storage medium
CN109218366A (en) * 2017-07-04 2019-01-15 北京航天长峰科技工业集团有限公司 Monitor video temperature cloud storage method based on k mean value
CN109726191A (en) * 2018-12-12 2019-05-07 中国联合网络通信集团有限公司 A kind of processing method and system across company-data, storage medium
CN109815219A (en) * 2019-02-18 2019-05-28 国家计算机网络与信息安全管理中心 Support the implementation method of the Data lifecycle management of multiple database engine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11397744B2 (en) * 2018-07-19 2022-07-26 Bank Of Montreal Systems and methods for data storage and processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109218366A (en) * 2017-07-04 2019-01-15 北京航天长峰科技工业集团有限公司 Monitor video temperature cloud storage method based on k mean value
CN109033360A (en) * 2018-07-26 2018-12-18 腾讯科技(深圳)有限公司 A kind of data query method, apparatus, server and storage medium
CN109726191A (en) * 2018-12-12 2019-05-07 中国联合网络通信集团有限公司 A kind of processing method and system across company-data, storage medium
CN109815219A (en) * 2019-02-18 2019-05-28 国家计算机网络与信息安全管理中心 Support the implementation method of the Data lifecycle management of multiple database engine

Also Published As

Publication number Publication date
CN111475506A (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN111475506B (en) Method, device, system, equipment and storage medium for data storage and query
US11422853B2 (en) Dynamic tree determination for data processing
US11921672B2 (en) Query execution at a remote heterogeneous data store of a data fabric service
US20230147068A1 (en) Management of distributed computing framework components
US11586627B2 (en) Partitioning and reducing records at ingest of a worker node
US20230144450A1 (en) Multi-partitioning data for combination operations
US11580107B2 (en) Bucket data distribution for exporting data to worker nodes
US11151137B2 (en) Multi-partition operation in combination operations
US20200050607A1 (en) Reassigning processing tasks to an external storage system
US9996593B1 (en) Parallel processing framework
US20190258631A1 (en) Query scheduling based on a query-resource allocation and resource availability
CN107818112B (en) Big data analysis operating system and task submitting method
CN108513657B (en) Data conversion method and backup server
JP5999574B2 (en) Database management system and computer system
US8356050B1 (en) Method or system for spilling in query environments
CN111901294A (en) Method for constructing online machine learning project and machine learning system
CN111324606B (en) Data slicing method and device
US9514184B2 (en) Systems and methods for a high speed query infrastructure
US20240061712A1 (en) Method, apparatus, and system for creating training task on ai training platform, and medium
CN114090580A (en) Data processing method, device, equipment, storage medium and product
CN112035555B (en) Information display method, device and equipment
JP2008225686A (en) Data arrangement management device and method in distributed data processing platform, and system and program
JP6506773B2 (en) INFORMATION PROCESSING APPARATUS, METHOD, AND PROGRAM
JP5048072B2 (en) Information search system, information search method and program
US11816088B2 (en) Method and system for managing cross data source data access requests

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant