CN110134738B

CN110134738B - Distributed storage system resource estimation method and device

Info

Publication number: CN110134738B
Application number: CN201910425874.4A
Authority: CN
Inventors: 穆纯进; 尹正军; 马骁; 王项男
Original assignee: China United Network Communications Group Co Ltd; Unicom Big Data Co Ltd
Current assignee: China United Network Communications Group Co Ltd; Unicom Big Data Co Ltd
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2021-09-10
Anticipated expiration: 2039-05-21
Also published as: CN110134738A

Abstract

The application discloses a distributed storage system resource pre-estimation method and a device, wherein the method comprises the following steps: receiving a resource occupation query request aiming at each cluster in the distributed storage system; acquiring metadata of each cluster in the distributed storage system according to the resource occupation query request; acquiring currently occupied resource parameters of each cluster according to the metadata of each cluster, wherein the currently occupied resource parameters comprise the number of data files, the data volume, the size of data blocks, the number of data blocks and the memory required for processing each task; and calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster. The resource estimation of the distributed storage system is completed according to the resource parameters currently occupied by each cluster.

Description

Distributed storage system resource estimation method and device

Technical Field

The application belongs to the field of data processing, and particularly relates to a distributed storage system resource pre-estimation method and device.

Background

In the big data era, more and more data are generated in enterprises, big data clusters are larger and larger in scale, the enterprise cost is increased sharply, and if the occupied resources can be estimated in advance under the condition of large-scale job submission, great help can be provided for optimizing the job, so that the resource consumption of the clusters is reduced, the cluster stability is ensured, and the enterprise cost is reduced.

At present, a testing environment is built for testing or online trial operation is directly carried out to estimate occupied resources, the method increases enterprise cost, and adverse effects are generated on an online cluster.

Disclosure of Invention

The method and the device for estimating the resources of the distributed storage system are provided aiming at the problems that the existing method for estimating the occupied resources by utilizing the built test environment to carry out testing or directly carrying out online trial operation increases the enterprise cost and has adverse effect on the online cluster.

The application provides a distributed storage system resource pre-estimation method, which comprises the following steps:

receiving a resource occupation query request aiming at each cluster in the distributed storage system;

acquiring metadata of each cluster in the distributed storage system according to the resource occupation query request;

acquiring currently occupied resource parameters of each cluster according to the metadata of each cluster, wherein the currently occupied resource parameters comprise the number of data files, the data volume, the size of data blocks, the number of data blocks and the memory required for processing each task;

and calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster.

Optionally, the step of obtaining metadata of each cluster in the distributed storage system according to the resource occupation query request includes:

collecting a binary format metadata file stored in the distributed storage system in a preset period, and converting the metadata file into a text format;

extracting metadata from the metadata file in the text format, wherein the metadata comprises at least one or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.

Optionally, the step of obtaining the resource parameter currently occupied by each cluster according to the metadata of each cluster includes:

analyzing the identifier of the data to be queried from the resource occupation query request, and acquiring the metadata of the data to be queried according to the identifier of the data to be queried;

determining a target cluster to which the data to be queried belongs according to the metadata of the data to be queried;

sending the resource occupation query request to the target cluster;

and receiving the currently occupied resource parameters sent by the target cluster.

Optionally, after the step of determining, according to the metadata of the data to be queried, a target cluster to which the data to be queried belongs, and before the step of sending the resource occupation query request to the target cluster, the method further includes:

judging whether the target cluster is a heterogeneous cluster, and if so, rewriting the resource occupation query request;

the sending the resource occupation query request to the target cluster includes: and sending the rewritten resource occupation query request to the target cluster.

Optionally, the step of calculating the storage resource occupation parameters of the distributed storage system according to the resource parameters currently occupied by each cluster includes:

determining the maximum value of the number of data files and the number of data blocks for each cluster, calculating the sum of the maximum value and the data amount, and calculating the ratio of the sum of the maximum value and the data amount to the size of the data blocks to obtain the task number of the cluster;

determining the task number of the distributed storage system according to the task number of each cluster;

aiming at each cluster, calculating the product of the number of tasks and the memory required for processing each task to obtain the memory occupation amount of the cluster;

and determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.

The present application further provides a device for pre-estimating resources of a distributed storage system, including:

the system comprises a receiving module, a query module and a query module, wherein the receiving module is used for receiving resource occupation query requests aiming at each cluster in the distributed storage system;

the first acquisition module is used for acquiring the metadata of each cluster in the distributed storage system according to the resource occupation query request;

a second obtaining module, configured to obtain, according to the metadata of each cluster, a currently occupied resource parameter of each cluster, where the currently occupied resource parameter includes a number of data files, a data amount, a data block size, a data block number, and a memory required for processing each task;

and the calculation module is used for calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster.

Optionally, the first obtaining module includes:

the acquisition submodule is used for acquiring the metadata file in the binary format stored in the distributed storage system in a preset period and converting the metadata file into a text format;

an extraction sub-module, configured to extract metadata from the metadata file in the text format, where the metadata includes at least one of the following items or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.

Optionally, the second obtaining module includes:

the acquisition sub-module is used for analyzing the identifier of the data to be queried from the resource occupation query request and acquiring the metadata of the data to be queried according to the identifier of the data to be queried;

the determining submodule is used for determining a target cluster to which the data to be queried belongs according to the metadata of the data to be queried;

the sending sub-module is used for sending the resource occupation query request to the target cluster;

and the receiving submodule is used for receiving the currently occupied resource parameters sent by the target cluster.

Optionally, the second obtaining module further includes:

the judging module is used for judging whether the target cluster is a heterogeneous cluster or not, and if so, rewriting the resource occupation inquiry request;

the sending submodule is specifically configured to: and sending the rewritten resource occupation query request to the target cluster.

Optionally, the calculation module includes:

the first calculation submodule is used for determining the maximum value of the number of data files and the number of data blocks aiming at each cluster, calculating the sum of the maximum value and the data volume, and calculating the ratio of the sum of the maximum value and the data volume to the size of the data blocks to obtain the task number of the cluster;

the second computing submodule is used for determining the task quantity of the distributed storage system according to the task quantity of each cluster;

the third calculation submodule is used for calculating the product of the number of tasks and the memory required by processing each task aiming at each cluster to obtain the memory occupation amount of the cluster;

and the fourth calculation submodule is used for determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.

The resource estimation method for the distributed storage system extracts metadata of each cluster in the distributed storage system by drawing a full-dimensional portrait of the distributed storage system, analyzes a submitted resource occupation query request, finds out currently occupied resource parameters of each cluster according to the metadata, and estimates the running total task number and the total memory occupation amount by combining a calculation flow according to the currently occupied resource parameters of each cluster so as to complete resource estimation of the distributed storage system.

Drawings

Fig. 1 is a flowchart of a resource estimation method for a distributed storage system according to a first embodiment of the present application;

FIG. 2 is an alternative implementation of step S2 in FIG. 1 according to the first embodiment of the present application;

FIG. 3 is an alternative implementation of step S3 in FIG. 1 according to the first embodiment of the present application;

FIG. 4 is a diagram illustrating another alternative implementation of step S4 in FIG. 1 according to the first embodiment of the present application;

fig. 5 is a schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application;

fig. 6 is another schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application;

fig. 7 is another schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application;

fig. 8 is another schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The application provides a distributed storage system resource pre-estimation method and device. The following detailed description is made with reference to the drawings of the embodiments provided in the present application, respectively.

A resource estimation method for a distributed storage system according to a first embodiment of the present application is as follows:

as shown in fig. 1, a method for predicting resources of a distributed storage system according to an embodiment of the present application is shown, which includes the following steps.

Step S1, a resource occupation query request for each cluster in the distributed storage system is received.

In this step, a resource occupation query request, that is, an SQL query request, is received for each cluster in the distributed storage system. The SQL Language is an abbreviation of Structured Query Language (Structured Query Language). The SQL language is a database query and programming language for accessing data and querying, updating, and managing relational database systems; and is also an extension of the database script file.

And step S2, acquiring the metadata of each cluster in the distributed storage system according to the resource occupation query request.

Metadata (Metadata), also called intermediary data and relay data, is data (data about data) describing data, and is mainly information describing data attribute (property) for supporting functions such as indicating storage location, history data, resource search, file record, and the like. Metadata is an electronic catalog, and in order to achieve the purpose of creating a catalog, the contents or features of data must be described and collected, so as to achieve the purpose of assisting data retrieval. Metadata is information about the organization of data, data fields, and their relationships, and in short, metadata is data about data.

Preferably, as shown in fig. 2, the step S2, obtaining metadata of each cluster in the distributed storage system according to the resource occupation query request includes:

step S201, collecting the metadata file in binary format stored in the distributed storage system in a preset period, and converting the metadata file into a text format.

The distributed storage system stores the detailed information of the directories and files of the whole system in a memory, and in order to prevent the memory data from being lost after downtime, the storage system serializes the data in the memory to a disk in a binary mode at intervals.

In this step, the binary files in the distributed storage system are periodically collected and deserialized into a text format for use in extracting metadata. The preset period is a preset value, and may be specifically set as required, and is not limited herein.

Step S202, extracting metadata from the metadata file in the text format.

In this step, metadata of each cluster is extracted through distributed computation. The extraction steps comprise steps of custom KV, secondary sorting, custom partitioning, custom merging, custom grouping and the like, and the extracted metadata comprises at least one or any combination of the following items: file system directory name, file system directory, access user (user to which folder belongs), user group, authority, file path, file modification time, and file access time. It is also possible to include other data such as directory capacity (capacity of folder), number of directory files (number of files under folder), maximum minimum average file size, file format, etc.

Specifically, in an enterprise, there is often more than one large data cluster, and possibly a plurality of clusters or even heterogeneous clusters of different types, and when we submit an SQL query, it may be necessary to combine a plurality of clusters for computation, and no matter whether a single cluster or a plurality of heterogeneous clusters need to consume resources, resource occupation estimation needs to be performed, so the system needs to collect metadata of each cluster. Each heterogeneous cluster provides an http interface to expose own metadata information, and a metadata acquirer calls the http interface in a program to acquire the metadata of each heterogeneous cluster. That is, the above steps are performed for each cluster.

Step S3, obtaining the resource parameters currently occupied by each cluster according to the metadata of each cluster.

In this step, data required for resource evaluation by each cluster is obtained according to the metadata of each cluster, that is, the resource parameters currently occupied by each cluster include the number of data files, the amount of data, the size of data blocks, the number of data blocks, and the memory required for processing each task. It should be noted here that the currently occupied resource parameter is dynamic data generated during data processing.

Preferably, as shown in fig. 3, the step S3, obtaining the currently occupied resource parameter of each cluster according to the metadata of each cluster, includes:

step S301, analyzing the identifier of the data to be queried from the resource occupation query request, and acquiring the metadata of the data to be queried according to the identifier of the data to be queried.

In this step, when the SQL job is submitted, the logical execution plan and the physical execution plan of SQL are parsed out. And acquiring the identifier of the data to be queried according to the analyzed logic execution plan and the physical execution plan, and acquiring metadata corresponding to the data to be queried according to the identifier of the data to be queried. The data to be queried has not been parsed, but only the identification of what data to query is parsed.

Step S302, according to the metadata of the data to be queried, determining a target cluster to which the data to be queried belongs.

In this step, according to the metadata acquired in the previous step, and according to the system directory name, the file system directory, and the file path in the metadata, it is determined which cluster the file path corresponds to.

Step S303, sending the resource occupation query request to the target cluster.

In the step, the cluster to which the data to be queried belongs is judged according to the metadata in the previous step, and then the resource occupation query request is routed to the target cluster.

Preferably, after the step S302 and before the step S303, the method further includes: and judging whether the target cluster is a heterogeneous cluster, and if so, rewriting the resource occupation query request. The step S303, sending the resource occupation query request to the target cluster, includes: and sending the rewritten resource occupation query request to the target cluster.

In this step, since there is a possibility that there are heterogeneous clusters in the distributed storage system, a certain degree of rewriting of SQL is necessary during routing, where the rewriting is performed for each heterogeneous cluster as a statement adapted to the corresponding heterogeneous cluster, and the specific rewriting statement is set by itself as necessary, which is not limited here. After a target heterogeneous cluster corresponding to the data to be queried is obtained, the SQL sentences aiming at the heterogeneous cluster are rewritten, and the rewritten SQL sentences are routed to the corresponding target heterogeneous cluster.

Take a real big data system as an example: SQL grammars supported by hive, ES and HBase clusters have certain difference, the cluster types can be obtained according to metadata, execution grammars are adapted according to the difference specific to the cluster types, for example, the HBase does not support SQL, and SQL statements need to be changed into built-in APIs of the HBase for calculation. An API is a calling interface that an operating system leaves to an application program, which causes the operating system to execute commands of the application program by calling the API of the operating system.

Step S304, receiving the currently occupied resource parameter sent by the target cluster.

In this step, the number of data files, the data size, the data block number, and the memory required for processing each task of each target cluster are queried from each target cluster according to the metadata. The extracted metadata mainly comprises at least one or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time. And acquiring which cluster the metadata corresponds to according to the directory name, the directory and the file path of the file system, and acquiring the number of data files, the data volume, the size of data blocks, the number of data blocks and the memory required for processing each task of each target cluster according to the access user, the user group, the authority, the file modification time and the file access time.

It should be noted that after the SQL is rewritten, the scanned data to be queried needs to be analyzed again according to the rewritten SQL, and the number of data files, the data amount, the data block size, the data block number, and the memory required for processing each task of each target heterogeneous cluster are queried from each target heterogeneous cluster according to the metadata corresponding to the data to be queried.

Step S4, calculating the storage resource occupation parameter of the distributed storage system according to the currently occupied resource parameter of each cluster.

In this step, according to the data required for resource evaluation of the distributed storage system obtained in step S3, that is, the currently occupied resource parameters, calculation is performed to obtain the storage resource occupation parameters of the final distributed storage system, including the total task number and the total memory occupation amount, so as to complete resource evaluation.

Preferably, as shown in fig. 4, the storage resource usage parameter includes a task number and a memory usage amount, and the step S4 of calculating the storage resource usage parameter of the distributed storage system according to the currently occupied resource parameter of each cluster includes:

step S401, aiming at each cluster, determining the maximum value of the number of data files and the number of data blocks, calculating the sum of the maximum value and the data volume, and calculating the ratio of the sum of the maximum value and the data volume to the size of the data block to obtain the task number of the cluster. Meanwhile, each task corresponds to a CPU core number, and the CPU core number is the task number.

In this step, the task number of one cluster is calculated, and the task number is max (number of data files, number of data blocks) + data amount/data block size. It can be seen that what the number of task is mainly reflected by what the number of files is and how large the data amount is.

In a preferred embodiment, merging is performed during the recommendation process if the average file size is smaller than the data block size set by the system, and the main adjustment direction is that the minimum number of fragments is greater than or equal to the number of data blocks, and tuning is performed specifically according to resource evaluation. For example, if the average file size is 1M, one data block size is 10M, and there are 100 files in total to be processed, and the average file size is smaller than the data block size, the minimum number of fragments is 100/10 or more, which is 10.

Step S402, determining the task quantity of the distributed storage system according to the task quantity of each cluster.

Merging the resource evaluation results of each cluster in the distributed storage system, and calculating the total task number and the total CPU core number, wherein the calculation formula is as follows:

total _ task, cluster task + cluster task.

The total number of CPU cores is the total task number.

Step S403, for each cluster, calculating a product of the number of tasks and the memory required for processing each task, to obtain the memory occupancy of the cluster.

In this step, for one cluster, the memory occupied by the cluster is calculated, and the memory required for processing each task is calculated as a task number.

And S404, determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.

In this step, the resource evaluation results of each cluster in the distributed storage system are merged, and then the total memory occupancy is calculated, wherein the calculation formula is as follows:

total memory footprint (total _ memory) ═ cluster memory + cluster memory.

In a preferred implementation, the present example also calculates the bottleneck of each cluster in the distributed storage system, which is the required resource/total resource. The required resource is the memory occupancy of a cluster, and the total resource is the total memory of the cluster.

The resource estimation method for the distributed storage system comprises the steps of performing full-dimensional portrayal on the distributed storage system, extracting metadata of each cluster in the distributed storage system, analyzing submitted SQL, finding out currently occupied resource parameters of each cluster according to a generated logic plan, a physical plan and the metadata, and estimating the total task number and the total memory occupation amount of operation by combining a calculation flow according to the currently occupied resource parameters so as to complete resource estimation of the distributed storage system.

A resource estimation apparatus of a distributed storage system according to a second embodiment of the present application is as follows:

fig. 5 is a schematic structural diagram illustrating a resource prediction method of a distributed storage system according to an embodiment of the present application, and includes the following modules.

A receiving module 11, configured to receive a resource occupation query request for each cluster in the distributed storage system;

a first obtaining module 12, configured to obtain metadata of each cluster in the distributed storage system according to the resource occupation query request;

a second obtaining module 13, configured to obtain, according to the metadata of each cluster, a resource parameter currently occupied by each cluster, where the resource parameter currently occupied includes the number of data files, the data size, the data block size, the number of data blocks, and a memory required for processing each task;

and the calculating module 14 is configured to calculate a storage resource occupation parameter of the distributed storage system according to the resource parameter currently occupied by each cluster.

Optionally, as shown in fig. 6, the first obtaining module 12 includes:

the acquisition submodule 121 is configured to acquire a metadata file in a binary format stored in the distributed storage system in a preset period, and convert the metadata file into a text format;

an extracting sub-module 122, configured to extract metadata from the metadata file in the text format, where the metadata includes at least one of the following or any combination of the following: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.

Optionally, as shown in fig. 7, the second obtaining module 13 includes:

the obtaining sub-module 131 is configured to analyze an identifier of the data to be queried from the resource occupation query request, and obtain metadata of the data to be queried according to the identifier of the data to be queried;

a determining submodule 132, configured to determine, according to the metadata of the data to be queried, a target cluster to which the data to be queried belongs;

a sending submodule 133, configured to send the resource occupation query request to the target cluster;

and the receiving submodule 134 is configured to receive the currently occupied resource parameter sent by the target cluster.

Optionally, the second obtaining module 13 (not shown in the figure) further includes:

Optionally, as shown in fig. 8, the calculating module 14 includes:

the first calculating submodule 141 is configured to determine, for each cluster, a maximum value of the number of data files and the number of data blocks, calculate a sum of the maximum value and the data amount, and calculate a ratio of the sum of the maximum value and the data amount to the size of the data block, so as to obtain the number of tasks of the cluster;

the second computing submodule 142 is configured to determine the task number of the distributed storage system according to the task number of each cluster;

the third computation submodule 143 is configured to compute, for each cluster, a product of the number of tasks and a memory required for processing each task, so as to obtain a memory occupancy amount of the cluster;

and the fourth calculating submodule 144 is configured to determine the memory occupancy of the distributed storage system according to the memory occupancy of each cluster.

It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A resource estimation method of a distributed storage system is characterized by comprising the following steps:

calculating storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster;

the step of obtaining the currently occupied resource parameters of each cluster according to the metadata of each cluster comprises:

sending the resource occupation query request to the target cluster;

receiving a currently occupied resource parameter sent by the target cluster;

the storage resource occupation parameters comprise the number of tasks and the memory occupation amount, and the step of calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster comprises the following steps:

determining the maximum value of the number of data files and the number of data blocks for each cluster, calculating the sum of the maximum value and the data amount, and calculating the ratio of the sum of the maximum value and the data amount to the size of the data blocks to obtain the task number of the cluster; if the average file size is smaller than the data block size set by the system, merging is carried out during processing, and the minimum fragment number is larger than or equal to the data block number;

2. The method for resource estimation of a distributed storage system according to claim 1, wherein the step of obtaining metadata of each cluster in the distributed storage system according to the resource occupation query request includes:

3. The method for pre-estimating the resources of the distributed storage system according to claim 1, wherein after the step of determining the target cluster to which the data to be queried belongs according to the metadata of the data to be queried, and before the step of sending the query request for resource occupancy to the target cluster, the method further comprises:

4. A distributed storage system resource pre-estimation device is characterized by comprising:

the calculation module is used for calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster;

the second obtaining module includes:

the receiving submodule is used for receiving the currently occupied resource parameters sent by the target cluster;

the calculation module comprises:

the first calculation submodule is used for determining the maximum value of the number of data files and the number of data blocks aiming at each cluster, calculating the sum of the maximum value and the data volume, and calculating the ratio of the sum of the maximum value and the data volume to the size of the data blocks to obtain the task number of the cluster; if the average file size is smaller than the data block size set by the system, merging is carried out during processing, and the minimum fragment number is larger than or equal to the data block number;

the second computing submodule is used for determining the task quantity of the distributed storage system according to the task quantity of each cluster; the third calculation submodule is used for calculating the product of the number of tasks and the memory required by processing each task aiming at each cluster to obtain the memory occupation amount of the cluster;

5. The device for resource estimation of a distributed storage system according to claim 4, wherein the first obtaining module includes:

6. The apparatus for resource estimation of a distributed storage system according to claim 4, wherein the second obtaining module further comprises: