CN110134738B - Distributed storage system resource estimation method and device - Google Patents

Distributed storage system resource estimation method and device Download PDF

Info

Publication number
CN110134738B
CN110134738B CN201910425874.4A CN201910425874A CN110134738B CN 110134738 B CN110134738 B CN 110134738B CN 201910425874 A CN201910425874 A CN 201910425874A CN 110134738 B CN110134738 B CN 110134738B
Authority
CN
China
Prior art keywords
cluster
data
resource
storage system
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910425874.4A
Other languages
Chinese (zh)
Other versions
CN110134738A (en
Inventor
穆纯进
尹正军
马骁
王项男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Unicom Big Data Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Unicom Big Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd, Unicom Big Data Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201910425874.4A priority Critical patent/CN110134738B/en
Publication of CN110134738A publication Critical patent/CN110134738A/en
Application granted granted Critical
Publication of CN110134738B publication Critical patent/CN110134738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a distributed storage system resource pre-estimation method and a device, wherein the method comprises the following steps: receiving a resource occupation query request aiming at each cluster in the distributed storage system; acquiring metadata of each cluster in the distributed storage system according to the resource occupation query request; acquiring currently occupied resource parameters of each cluster according to the metadata of each cluster, wherein the currently occupied resource parameters comprise the number of data files, the data volume, the size of data blocks, the number of data blocks and the memory required for processing each task; and calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster. The resource estimation of the distributed storage system is completed according to the resource parameters currently occupied by each cluster.

Description

Distributed storage system resource estimation method and device
Technical Field
The application belongs to the field of data processing, and particularly relates to a distributed storage system resource pre-estimation method and device.
Background
In the big data era, more and more data are generated in enterprises, big data clusters are larger and larger in scale, the enterprise cost is increased sharply, and if the occupied resources can be estimated in advance under the condition of large-scale job submission, great help can be provided for optimizing the job, so that the resource consumption of the clusters is reduced, the cluster stability is ensured, and the enterprise cost is reduced.
At present, a testing environment is built for testing or online trial operation is directly carried out to estimate occupied resources, the method increases enterprise cost, and adverse effects are generated on an online cluster.
Disclosure of Invention
The method and the device for estimating the resources of the distributed storage system are provided aiming at the problems that the existing method for estimating the occupied resources by utilizing the built test environment to carry out testing or directly carrying out online trial operation increases the enterprise cost and has adverse effect on the online cluster.
The application provides a distributed storage system resource pre-estimation method, which comprises the following steps:
receiving a resource occupation query request aiming at each cluster in the distributed storage system;
acquiring metadata of each cluster in the distributed storage system according to the resource occupation query request;
acquiring currently occupied resource parameters of each cluster according to the metadata of each cluster, wherein the currently occupied resource parameters comprise the number of data files, the data volume, the size of data blocks, the number of data blocks and the memory required for processing each task;
and calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster.
Optionally, the step of obtaining metadata of each cluster in the distributed storage system according to the resource occupation query request includes:
collecting a binary format metadata file stored in the distributed storage system in a preset period, and converting the metadata file into a text format;
extracting metadata from the metadata file in the text format, wherein the metadata comprises at least one or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.
Optionally, the step of obtaining the resource parameter currently occupied by each cluster according to the metadata of each cluster includes:
analyzing the identifier of the data to be queried from the resource occupation query request, and acquiring the metadata of the data to be queried according to the identifier of the data to be queried;
determining a target cluster to which the data to be queried belongs according to the metadata of the data to be queried;
sending the resource occupation query request to the target cluster;
and receiving the currently occupied resource parameters sent by the target cluster.
Optionally, after the step of determining, according to the metadata of the data to be queried, a target cluster to which the data to be queried belongs, and before the step of sending the resource occupation query request to the target cluster, the method further includes:
judging whether the target cluster is a heterogeneous cluster, and if so, rewriting the resource occupation query request;
the sending the resource occupation query request to the target cluster includes: and sending the rewritten resource occupation query request to the target cluster.
Optionally, the step of calculating the storage resource occupation parameters of the distributed storage system according to the resource parameters currently occupied by each cluster includes:
determining the maximum value of the number of data files and the number of data blocks for each cluster, calculating the sum of the maximum value and the data amount, and calculating the ratio of the sum of the maximum value and the data amount to the size of the data blocks to obtain the task number of the cluster;
determining the task number of the distributed storage system according to the task number of each cluster;
aiming at each cluster, calculating the product of the number of tasks and the memory required for processing each task to obtain the memory occupation amount of the cluster;
and determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.
The present application further provides a device for pre-estimating resources of a distributed storage system, including:
the system comprises a receiving module, a query module and a query module, wherein the receiving module is used for receiving resource occupation query requests aiming at each cluster in the distributed storage system;
the first acquisition module is used for acquiring the metadata of each cluster in the distributed storage system according to the resource occupation query request;
a second obtaining module, configured to obtain, according to the metadata of each cluster, a currently occupied resource parameter of each cluster, where the currently occupied resource parameter includes a number of data files, a data amount, a data block size, a data block number, and a memory required for processing each task;
and the calculation module is used for calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster.
Optionally, the first obtaining module includes:
the acquisition submodule is used for acquiring the metadata file in the binary format stored in the distributed storage system in a preset period and converting the metadata file into a text format;
an extraction sub-module, configured to extract metadata from the metadata file in the text format, where the metadata includes at least one of the following items or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.
Optionally, the second obtaining module includes:
the acquisition sub-module is used for analyzing the identifier of the data to be queried from the resource occupation query request and acquiring the metadata of the data to be queried according to the identifier of the data to be queried;
the determining submodule is used for determining a target cluster to which the data to be queried belongs according to the metadata of the data to be queried;
the sending sub-module is used for sending the resource occupation query request to the target cluster;
and the receiving submodule is used for receiving the currently occupied resource parameters sent by the target cluster.
Optionally, the second obtaining module further includes:
the judging module is used for judging whether the target cluster is a heterogeneous cluster or not, and if so, rewriting the resource occupation inquiry request;
the sending submodule is specifically configured to: and sending the rewritten resource occupation query request to the target cluster.
Optionally, the calculation module includes:
the first calculation submodule is used for determining the maximum value of the number of data files and the number of data blocks aiming at each cluster, calculating the sum of the maximum value and the data volume, and calculating the ratio of the sum of the maximum value and the data volume to the size of the data blocks to obtain the task number of the cluster;
the second computing submodule is used for determining the task quantity of the distributed storage system according to the task quantity of each cluster;
the third calculation submodule is used for calculating the product of the number of tasks and the memory required by processing each task aiming at each cluster to obtain the memory occupation amount of the cluster;
and the fourth calculation submodule is used for determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.
The resource estimation method for the distributed storage system extracts metadata of each cluster in the distributed storage system by drawing a full-dimensional portrait of the distributed storage system, analyzes a submitted resource occupation query request, finds out currently occupied resource parameters of each cluster according to the metadata, and estimates the running total task number and the total memory occupation amount by combining a calculation flow according to the currently occupied resource parameters of each cluster so as to complete resource estimation of the distributed storage system.
Drawings
Fig. 1 is a flowchart of a resource estimation method for a distributed storage system according to a first embodiment of the present application;
FIG. 2 is an alternative implementation of step S2 in FIG. 1 according to the first embodiment of the present application;
FIG. 3 is an alternative implementation of step S3 in FIG. 1 according to the first embodiment of the present application;
FIG. 4 is a diagram illustrating another alternative implementation of step S4 in FIG. 1 according to the first embodiment of the present application;
fig. 5 is a schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application;
fig. 6 is another schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application;
fig. 7 is another schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application;
fig. 8 is another schematic structural diagram of a resource estimation method for a distributed storage system according to a second embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The application provides a distributed storage system resource pre-estimation method and device. The following detailed description is made with reference to the drawings of the embodiments provided in the present application, respectively.
A resource estimation method for a distributed storage system according to a first embodiment of the present application is as follows:
as shown in fig. 1, a method for predicting resources of a distributed storage system according to an embodiment of the present application is shown, which includes the following steps.
Step S1, a resource occupation query request for each cluster in the distributed storage system is received.
In this step, a resource occupation query request, that is, an SQL query request, is received for each cluster in the distributed storage system. The SQL Language is an abbreviation of Structured Query Language (Structured Query Language). The SQL language is a database query and programming language for accessing data and querying, updating, and managing relational database systems; and is also an extension of the database script file.
And step S2, acquiring the metadata of each cluster in the distributed storage system according to the resource occupation query request.
Metadata (Metadata), also called intermediary data and relay data, is data (data about data) describing data, and is mainly information describing data attribute (property) for supporting functions such as indicating storage location, history data, resource search, file record, and the like. Metadata is an electronic catalog, and in order to achieve the purpose of creating a catalog, the contents or features of data must be described and collected, so as to achieve the purpose of assisting data retrieval. Metadata is information about the organization of data, data fields, and their relationships, and in short, metadata is data about data.
Preferably, as shown in fig. 2, the step S2, obtaining metadata of each cluster in the distributed storage system according to the resource occupation query request includes:
step S201, collecting the metadata file in binary format stored in the distributed storage system in a preset period, and converting the metadata file into a text format.
The distributed storage system stores the detailed information of the directories and files of the whole system in a memory, and in order to prevent the memory data from being lost after downtime, the storage system serializes the data in the memory to a disk in a binary mode at intervals.
In this step, the binary files in the distributed storage system are periodically collected and deserialized into a text format for use in extracting metadata. The preset period is a preset value, and may be specifically set as required, and is not limited herein.
Step S202, extracting metadata from the metadata file in the text format.
In this step, metadata of each cluster is extracted through distributed computation. The extraction steps comprise steps of custom KV, secondary sorting, custom partitioning, custom merging, custom grouping and the like, and the extracted metadata comprises at least one or any combination of the following items: file system directory name, file system directory, access user (user to which folder belongs), user group, authority, file path, file modification time, and file access time. It is also possible to include other data such as directory capacity (capacity of folder), number of directory files (number of files under folder), maximum minimum average file size, file format, etc.
Specifically, in an enterprise, there is often more than one large data cluster, and possibly a plurality of clusters or even heterogeneous clusters of different types, and when we submit an SQL query, it may be necessary to combine a plurality of clusters for computation, and no matter whether a single cluster or a plurality of heterogeneous clusters need to consume resources, resource occupation estimation needs to be performed, so the system needs to collect metadata of each cluster. Each heterogeneous cluster provides an http interface to expose own metadata information, and a metadata acquirer calls the http interface in a program to acquire the metadata of each heterogeneous cluster. That is, the above steps are performed for each cluster.
Step S3, obtaining the resource parameters currently occupied by each cluster according to the metadata of each cluster.
In this step, data required for resource evaluation by each cluster is obtained according to the metadata of each cluster, that is, the resource parameters currently occupied by each cluster include the number of data files, the amount of data, the size of data blocks, the number of data blocks, and the memory required for processing each task. It should be noted here that the currently occupied resource parameter is dynamic data generated during data processing.
Preferably, as shown in fig. 3, the step S3, obtaining the currently occupied resource parameter of each cluster according to the metadata of each cluster, includes:
step S301, analyzing the identifier of the data to be queried from the resource occupation query request, and acquiring the metadata of the data to be queried according to the identifier of the data to be queried.
In this step, when the SQL job is submitted, the logical execution plan and the physical execution plan of SQL are parsed out. And acquiring the identifier of the data to be queried according to the analyzed logic execution plan and the physical execution plan, and acquiring metadata corresponding to the data to be queried according to the identifier of the data to be queried. The data to be queried has not been parsed, but only the identification of what data to query is parsed.
Step S302, according to the metadata of the data to be queried, determining a target cluster to which the data to be queried belongs.
In this step, according to the metadata acquired in the previous step, and according to the system directory name, the file system directory, and the file path in the metadata, it is determined which cluster the file path corresponds to.
Step S303, sending the resource occupation query request to the target cluster.
In the step, the cluster to which the data to be queried belongs is judged according to the metadata in the previous step, and then the resource occupation query request is routed to the target cluster.
Preferably, after the step S302 and before the step S303, the method further includes: and judging whether the target cluster is a heterogeneous cluster, and if so, rewriting the resource occupation query request. The step S303, sending the resource occupation query request to the target cluster, includes: and sending the rewritten resource occupation query request to the target cluster.
In this step, since there is a possibility that there are heterogeneous clusters in the distributed storage system, a certain degree of rewriting of SQL is necessary during routing, where the rewriting is performed for each heterogeneous cluster as a statement adapted to the corresponding heterogeneous cluster, and the specific rewriting statement is set by itself as necessary, which is not limited here. After a target heterogeneous cluster corresponding to the data to be queried is obtained, the SQL sentences aiming at the heterogeneous cluster are rewritten, and the rewritten SQL sentences are routed to the corresponding target heterogeneous cluster.
Take a real big data system as an example: SQL grammars supported by hive, ES and HBase clusters have certain difference, the cluster types can be obtained according to metadata, execution grammars are adapted according to the difference specific to the cluster types, for example, the HBase does not support SQL, and SQL statements need to be changed into built-in APIs of the HBase for calculation. An API is a calling interface that an operating system leaves to an application program, which causes the operating system to execute commands of the application program by calling the API of the operating system.
Step S304, receiving the currently occupied resource parameter sent by the target cluster.
In this step, the number of data files, the data size, the data block number, and the memory required for processing each task of each target cluster are queried from each target cluster according to the metadata. The extracted metadata mainly comprises at least one or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time. And acquiring which cluster the metadata corresponds to according to the directory name, the directory and the file path of the file system, and acquiring the number of data files, the data volume, the size of data blocks, the number of data blocks and the memory required for processing each task of each target cluster according to the access user, the user group, the authority, the file modification time and the file access time.
It should be noted that after the SQL is rewritten, the scanned data to be queried needs to be analyzed again according to the rewritten SQL, and the number of data files, the data amount, the data block size, the data block number, and the memory required for processing each task of each target heterogeneous cluster are queried from each target heterogeneous cluster according to the metadata corresponding to the data to be queried.
Step S4, calculating the storage resource occupation parameter of the distributed storage system according to the currently occupied resource parameter of each cluster.
In this step, according to the data required for resource evaluation of the distributed storage system obtained in step S3, that is, the currently occupied resource parameters, calculation is performed to obtain the storage resource occupation parameters of the final distributed storage system, including the total task number and the total memory occupation amount, so as to complete resource evaluation.
Preferably, as shown in fig. 4, the storage resource usage parameter includes a task number and a memory usage amount, and the step S4 of calculating the storage resource usage parameter of the distributed storage system according to the currently occupied resource parameter of each cluster includes:
step S401, aiming at each cluster, determining the maximum value of the number of data files and the number of data blocks, calculating the sum of the maximum value and the data volume, and calculating the ratio of the sum of the maximum value and the data volume to the size of the data block to obtain the task number of the cluster. Meanwhile, each task corresponds to a CPU core number, and the CPU core number is the task number.
In this step, the task number of one cluster is calculated, and the task number is max (number of data files, number of data blocks) + data amount/data block size. It can be seen that what the number of task is mainly reflected by what the number of files is and how large the data amount is.
In a preferred embodiment, merging is performed during the recommendation process if the average file size is smaller than the data block size set by the system, and the main adjustment direction is that the minimum number of fragments is greater than or equal to the number of data blocks, and tuning is performed specifically according to resource evaluation. For example, if the average file size is 1M, one data block size is 10M, and there are 100 files in total to be processed, and the average file size is smaller than the data block size, the minimum number of fragments is 100/10 or more, which is 10.
Step S402, determining the task quantity of the distributed storage system according to the task quantity of each cluster.
Merging the resource evaluation results of each cluster in the distributed storage system, and calculating the total task number and the total CPU core number, wherein the calculation formula is as follows:
total _ task, cluster task + cluster task.
The total number of CPU cores is the total task number.
Step S403, for each cluster, calculating a product of the number of tasks and the memory required for processing each task, to obtain the memory occupancy of the cluster.
In this step, for one cluster, the memory occupied by the cluster is calculated, and the memory required for processing each task is calculated as a task number.
And S404, determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.
In this step, the resource evaluation results of each cluster in the distributed storage system are merged, and then the total memory occupancy is calculated, wherein the calculation formula is as follows:
total memory footprint (total _ memory) ═ cluster memory + cluster memory.
In a preferred implementation, the present example also calculates the bottleneck of each cluster in the distributed storage system, which is the required resource/total resource. The required resource is the memory occupancy of a cluster, and the total resource is the total memory of the cluster.
The resource estimation method for the distributed storage system comprises the steps of performing full-dimensional portrayal on the distributed storage system, extracting metadata of each cluster in the distributed storage system, analyzing submitted SQL, finding out currently occupied resource parameters of each cluster according to a generated logic plan, a physical plan and the metadata, and estimating the total task number and the total memory occupation amount of operation by combining a calculation flow according to the currently occupied resource parameters so as to complete resource estimation of the distributed storage system.
A resource estimation apparatus of a distributed storage system according to a second embodiment of the present application is as follows:
fig. 5 is a schematic structural diagram illustrating a resource prediction method of a distributed storage system according to an embodiment of the present application, and includes the following modules.
A receiving module 11, configured to receive a resource occupation query request for each cluster in the distributed storage system;
a first obtaining module 12, configured to obtain metadata of each cluster in the distributed storage system according to the resource occupation query request;
a second obtaining module 13, configured to obtain, according to the metadata of each cluster, a resource parameter currently occupied by each cluster, where the resource parameter currently occupied includes the number of data files, the data size, the data block size, the number of data blocks, and a memory required for processing each task;
and the calculating module 14 is configured to calculate a storage resource occupation parameter of the distributed storage system according to the resource parameter currently occupied by each cluster.
Optionally, as shown in fig. 6, the first obtaining module 12 includes:
the acquisition submodule 121 is configured to acquire a metadata file in a binary format stored in the distributed storage system in a preset period, and convert the metadata file into a text format;
an extracting sub-module 122, configured to extract metadata from the metadata file in the text format, where the metadata includes at least one of the following or any combination of the following: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.
Optionally, as shown in fig. 7, the second obtaining module 13 includes:
the obtaining sub-module 131 is configured to analyze an identifier of the data to be queried from the resource occupation query request, and obtain metadata of the data to be queried according to the identifier of the data to be queried;
a determining submodule 132, configured to determine, according to the metadata of the data to be queried, a target cluster to which the data to be queried belongs;
a sending submodule 133, configured to send the resource occupation query request to the target cluster;
and the receiving submodule 134 is configured to receive the currently occupied resource parameter sent by the target cluster.
Optionally, the second obtaining module 13 (not shown in the figure) further includes:
the judging module is used for judging whether the target cluster is a heterogeneous cluster or not, and if so, rewriting the resource occupation inquiry request;
the sending submodule is specifically configured to: and sending the rewritten resource occupation query request to the target cluster.
Optionally, as shown in fig. 8, the calculating module 14 includes:
the first calculating submodule 141 is configured to determine, for each cluster, a maximum value of the number of data files and the number of data blocks, calculate a sum of the maximum value and the data amount, and calculate a ratio of the sum of the maximum value and the data amount to the size of the data block, so as to obtain the number of tasks of the cluster;
the second computing submodule 142 is configured to determine the task number of the distributed storage system according to the task number of each cluster;
the third computation submodule 143 is configured to compute, for each cluster, a product of the number of tasks and a memory required for processing each task, so as to obtain a memory occupancy amount of the cluster;
and the fourth calculating submodule 144 is configured to determine the memory occupancy of the distributed storage system according to the memory occupancy of each cluster.
It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (6)

1. A resource estimation method of a distributed storage system is characterized by comprising the following steps:
receiving a resource occupation query request aiming at each cluster in the distributed storage system;
acquiring metadata of each cluster in the distributed storage system according to the resource occupation query request;
acquiring currently occupied resource parameters of each cluster according to the metadata of each cluster, wherein the currently occupied resource parameters comprise the number of data files, the data volume, the size of data blocks, the number of data blocks and the memory required for processing each task;
calculating storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster;
the step of obtaining the currently occupied resource parameters of each cluster according to the metadata of each cluster comprises:
analyzing the identifier of the data to be queried from the resource occupation query request, and acquiring the metadata of the data to be queried according to the identifier of the data to be queried;
determining a target cluster to which the data to be queried belongs according to the metadata of the data to be queried;
sending the resource occupation query request to the target cluster;
receiving a currently occupied resource parameter sent by the target cluster;
the storage resource occupation parameters comprise the number of tasks and the memory occupation amount, and the step of calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster comprises the following steps:
determining the maximum value of the number of data files and the number of data blocks for each cluster, calculating the sum of the maximum value and the data amount, and calculating the ratio of the sum of the maximum value and the data amount to the size of the data blocks to obtain the task number of the cluster; if the average file size is smaller than the data block size set by the system, merging is carried out during processing, and the minimum fragment number is larger than or equal to the data block number;
determining the task number of the distributed storage system according to the task number of each cluster;
aiming at each cluster, calculating the product of the number of tasks and the memory required for processing each task to obtain the memory occupation amount of the cluster;
and determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.
2. The method for resource estimation of a distributed storage system according to claim 1, wherein the step of obtaining metadata of each cluster in the distributed storage system according to the resource occupation query request includes:
collecting a binary format metadata file stored in the distributed storage system in a preset period, and converting the metadata file into a text format;
extracting metadata from the metadata file in the text format, wherein the metadata comprises at least one or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.
3. The method for pre-estimating the resources of the distributed storage system according to claim 1, wherein after the step of determining the target cluster to which the data to be queried belongs according to the metadata of the data to be queried, and before the step of sending the query request for resource occupancy to the target cluster, the method further comprises:
judging whether the target cluster is a heterogeneous cluster, and if so, rewriting the resource occupation query request;
the sending the resource occupation query request to the target cluster includes: and sending the rewritten resource occupation query request to the target cluster.
4. A distributed storage system resource pre-estimation device is characterized by comprising:
the system comprises a receiving module, a query module and a query module, wherein the receiving module is used for receiving resource occupation query requests aiming at each cluster in the distributed storage system;
the first acquisition module is used for acquiring the metadata of each cluster in the distributed storage system according to the resource occupation query request;
a second obtaining module, configured to obtain, according to the metadata of each cluster, a currently occupied resource parameter of each cluster, where the currently occupied resource parameter includes a number of data files, a data amount, a data block size, a data block number, and a memory required for processing each task;
the calculation module is used for calculating the storage resource occupation parameters of the distributed storage system according to the currently occupied resource parameters of each cluster;
the second obtaining module includes:
the acquisition sub-module is used for analyzing the identifier of the data to be queried from the resource occupation query request and acquiring the metadata of the data to be queried according to the identifier of the data to be queried;
the determining submodule is used for determining a target cluster to which the data to be queried belongs according to the metadata of the data to be queried;
the sending sub-module is used for sending the resource occupation query request to the target cluster;
the receiving submodule is used for receiving the currently occupied resource parameters sent by the target cluster;
the calculation module comprises:
the first calculation submodule is used for determining the maximum value of the number of data files and the number of data blocks aiming at each cluster, calculating the sum of the maximum value and the data volume, and calculating the ratio of the sum of the maximum value and the data volume to the size of the data blocks to obtain the task number of the cluster; if the average file size is smaller than the data block size set by the system, merging is carried out during processing, and the minimum fragment number is larger than or equal to the data block number;
the second computing submodule is used for determining the task quantity of the distributed storage system according to the task quantity of each cluster; the third calculation submodule is used for calculating the product of the number of tasks and the memory required by processing each task aiming at each cluster to obtain the memory occupation amount of the cluster;
and the fourth calculation submodule is used for determining the memory occupation amount of the distributed storage system according to the memory occupation amount of each cluster.
5. The device for resource estimation of a distributed storage system according to claim 4, wherein the first obtaining module includes:
the acquisition submodule is used for acquiring the metadata file in the binary format stored in the distributed storage system in a preset period and converting the metadata file into a text format;
an extraction sub-module, configured to extract metadata from the metadata file in the text format, where the metadata includes at least one of the following items or any combination of the following items: file system directory name, file system directory, access user, user group, authority, file path, file modification time, and file access time.
6. The apparatus for resource estimation of a distributed storage system according to claim 4, wherein the second obtaining module further comprises:
the judging module is used for judging whether the target cluster is a heterogeneous cluster or not, and if so, rewriting the resource occupation inquiry request;
the sending submodule is specifically configured to: and sending the rewritten resource occupation query request to the target cluster.
CN201910425874.4A 2019-05-21 2019-05-21 Distributed storage system resource estimation method and device Active CN110134738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910425874.4A CN110134738B (en) 2019-05-21 2019-05-21 Distributed storage system resource estimation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910425874.4A CN110134738B (en) 2019-05-21 2019-05-21 Distributed storage system resource estimation method and device

Publications (2)

Publication Number Publication Date
CN110134738A CN110134738A (en) 2019-08-16
CN110134738B true CN110134738B (en) 2021-09-10

Family

ID=67572348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910425874.4A Active CN110134738B (en) 2019-05-21 2019-05-21 Distributed storage system resource estimation method and device

Country Status (1)

Country Link
CN (1) CN110134738B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569317A (en) * 2019-09-12 2019-12-13 北京明略软件系统有限公司 metadata collection method and device for data source
CN111680799B (en) * 2020-04-08 2024-02-20 北京字节跳动网络技术有限公司 Method and device for processing model parameters
CN113553166A (en) * 2020-04-26 2021-10-26 广州汽车集团股份有限公司 Cross-platform high-performance computing integration method and system
CN113111038B (en) * 2021-03-31 2024-01-19 北京达佳互联信息技术有限公司 File storage method, device, server and storage medium
CN115904859A (en) * 2021-09-30 2023-04-04 中兴通讯股份有限公司 Memory occupancy estimation method and device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542040A (en) * 2011-12-27 2012-07-04 北京奇虎科技有限公司 Capacity acquiring method and system
CN103678563A (en) * 2011-12-27 2014-03-26 北京奇虎科技有限公司 Capacity obtaining method and system
CN104657260A (en) * 2013-11-25 2015-05-27 航天信息股份有限公司 Achievement method for distributed locks controlling distributed inter-node accessed shared resources
CN108694071A (en) * 2017-03-29 2018-10-23 瞻博网络公司 More cluster panels for distributed virtualization infrastructure elements monitoring and policy control

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372250A1 (en) * 2011-03-04 2014-12-18 Forbes Media Llc System and method for providing recommended content
US20190051210A1 (en) * 2017-08-09 2019-02-14 Inchstones, LLC Distributed architecture for data synchronization

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542040A (en) * 2011-12-27 2012-07-04 北京奇虎科技有限公司 Capacity acquiring method and system
CN103678563A (en) * 2011-12-27 2014-03-26 北京奇虎科技有限公司 Capacity obtaining method and system
CN104657260A (en) * 2013-11-25 2015-05-27 航天信息股份有限公司 Achievement method for distributed locks controlling distributed inter-node accessed shared resources
CN108694071A (en) * 2017-03-29 2018-10-23 瞻博网络公司 More cluster panels for distributed virtualization infrastructure elements monitoring and policy control

Also Published As

Publication number Publication date
CN110134738A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110134738B (en) Distributed storage system resource estimation method and device
US10831562B2 (en) Method and system for operating a data center by reducing an amount of data to be processed
CN101996214B (en) Method and device for processing database operation request
CN108647357B (en) Data query method and device
WO2008013632A1 (en) Data processing over very large databases
CN110147470B (en) Cross-machine-room data comparison system and method
AU2021244852B2 (en) Offloading statistics collection
CN111488323B (en) Data processing method and device and electronic equipment
CN110647318A (en) Method, device, equipment and medium for creating instance of stateful application
CN107871055B (en) Data analysis method and device
CN108509453B (en) Information processing method and device
JP2016162016A (en) Management information acquisition program, management information acquisition method, and management information acquisition device
CN104573127B (en) Assess the method and system of data variance
CN115374109B (en) Data access method, device, computing equipment and system
CN113220530B (en) Data quality monitoring method and platform
CN116010447A (en) Load balancing method and device for optimizing heterogeneous database user query
CN111159213A (en) Data query method, device, system and storage medium
CN111737371B (en) Data flow detection classification method and device capable of dynamically predicting
CN111177100B (en) Training data processing method, device and storage medium
CN114020446A (en) Cross-multi-engine routing processing method, device, equipment and storage medium
CN111159229B (en) Data query method and device
CN112181825A (en) Test case library construction method and device, electronic equipment and medium
CN107679093B (en) Data query method and device
CN106528577B (en) Method and device for setting file to be cleaned
CN114780620B (en) Cloud computing service analysis method, device and system based on big data mining performance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant