CN113467700B

CN113467700B - Heterogeneous storage-based data distribution method and device

Info

Publication number: CN113467700B
Application number: CN202010241808.4A
Authority: CN
Inventors: 周雁波
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2024-04-23
Anticipated expiration: 2040-03-31
Also published as: CN113467700A

Abstract

The embodiment of the specification provides a data distribution method and a device based on heterogeneous storage, wherein the data distribution method based on heterogeneous storage comprises the following steps: obtaining data to be distributed, calculating access cost corresponding to the data to be distributed under the data distribution combination according to access parameters of a storage cluster and data distribution combinations for distributing the data to be distributed to storage units in at least one type of storage cluster, determining target data distribution combinations corresponding to optimal access cost according to the access cost corresponding to the data to be distributed, and performing data distribution on the data to be distributed according to target clusters and target storage units recorded in the target data distribution combinations.

Description

Heterogeneous storage-based data distribution method and device

Technical Field

The embodiment of the specification relates to the technical field of data processing, in particular to a data distribution method based on heterogeneous storage. One or more embodiments of the present specification relate to a heterogeneous storage-based data distribution apparatus, a computing device, and a computer-readable storage medium.

Background

In a cloud computing system, a plurality of forms of distributed storage systems are involved, a heterogeneous storage architecture is generally adopted in the distributed storage systems, a storage back end is composed of a large number of different types of storage devices, a hard disk is used as a data storage medium, and the cloud computing system is widely applied to the field of cloud computing.

Because the number, performance and price of different types of hard disks are different, when the number of hard disks used in the system is large, and the read-write pressure brought to the hard disks is large, the data processing amount to be read-written is increased sharply, the data processing time is correspondingly prolonged, and the situation can occur.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a data allocation method based on heterogeneous storage. One or more embodiments of the present specification are also directed to a heterogeneous storage-based data distribution apparatus, a computing device, and a computer-readable storage medium that address the technical deficiencies of the prior art.

According to a first aspect of embodiments of the present disclosure, there is provided a data allocation method based on heterogeneous storage, including:

obtaining data to be distributed to storage units in at least one type of storage cluster;

according to the access parameters of the storage cluster and the data distribution combination of the data to be distributed to the storage units, calculating the corresponding access cost of the data to be distributed under the data distribution combination;

determining a target data allocation combination corresponding to the optimal access cost according to the access cost corresponding to the data to be allocated;

and distributing the data to be distributed according to the target cluster and the target storage unit recorded in the target data distribution combination.

Optionally, the calculating the access overhead corresponding to the data to be allocated under the data allocation combination according to the access parameter of the storage cluster and the data allocation combination that allocates the data to be allocated to the storage unit includes:

Determining the data distribution combination for distributing the data to be distributed to the storage units in at least one type of storage cluster;

according to the access parameters of the storage cluster, respectively calculating first access overhead corresponding to the data to be distributed under each data distribution combination;

And calculating the total access cost for accessing the data to be allocated according to the first access cost corresponding to the data to be allocated under each data allocation combination, and taking the total access cost as the access cost corresponding to the data to be allocated under the data allocation combination.

Optionally, for any one of the data to be allocated, the first access overhead of the data to be allocated in any one of the data allocation combinations includes:

Under the data distribution combination, the sum of the network overhead generated by the write access and the network overhead generated by the read access of the data to be distributed by the computing node is multiplied by the number of write accesses, and the sum of the network overhead generated by the read access and the number of read accesses of the data to be distributed by the computing node is multiplied by the number of read accesses.

Optionally, the total access overhead of the data to be allocated in any one of the data allocation combinations includes:

And under the data distribution combination, the sum of the first access overheads of all the computing nodes to the data to be distributed.

Optionally, the determining, according to the access overhead corresponding to the data to be allocated, a target data allocation combination corresponding to the optimal access overhead includes:

determining local optimal access cost of first data to be distributed in the data to be distributed in a first calculation stage according to the access cost corresponding to the data to be distributed;

Determining the local optimal access cost of the next data to be distributed in the next computing stage according to the local optimal access cost of the previous data to be distributed in the previous computing stage in the data to be distributed;

judging whether all the data to be distributed contained in the data to be distributed are calculated;

if yes, taking the global optimal access cost of the last data to be distributed in the last computing stage in the data to be distributed as the optimal access cost, and determining a target data distribution combination corresponding to the optimal access cost;

and if not, returning to execute the step of determining the local optimal access overhead of the data to be distributed in the next computing stage according to the local optimal access overhead of the data to be distributed in the previous computing stage.

Optionally, the local optimal access overhead of the first data to be allocated in the first computing stage is determined by the following manner:

and comparing the access cost of the first data to be distributed in each data distribution combination, and taking the minimum access cost as the local optimal access cost of the first data to be distributed in the first calculation stage.

Optionally, the local optimal access overhead of the next data to be allocated in the next calculation stage is determined by adopting the following manner:

summing the access cost of the next data to be allocated in each data allocation combination with the local optimal access cost of the previous data to be allocated in the previous calculation stage respectively to obtain each system access cost corresponding to the next data to be allocated in the next stage;

And comparing the system access cost corresponding to the next data to be allocated in the next stage, and taking the minimum system access cost as the local optimal access cost of the next data to be allocated in the next calculation stage.

Optionally, the storage cluster includes at least one of:

the system comprises a storage cluster formed by solid state disks, a storage cluster formed by serial mechanical disks and a storage cluster formed by archive mechanical disks.

Optionally, the method is applied to a cloud computing platform, and the computing node includes: computing nodes in a computing cluster of a cloud computing platform.

According to a second aspect of embodiments of the present specification, there is provided a heterogeneous storage-based data distribution apparatus, comprising:

the acquisition module is configured to acquire data to be distributed to storage units in at least one type of storage clusters;

The computing module is configured to compute the corresponding access expense of the data to be distributed under the data distribution combination according to the access parameters of the storage cluster and the data distribution combination of the data to be distributed to the storage unit;

the determining module is configured to determine a target data allocation combination corresponding to the optimal access overhead according to the access overhead corresponding to the data to be allocated;

and the data distribution module is configured to distribute the data to be distributed according to the target cluster and the target storage unit recorded in the target data distribution combination.

Optionally, the computing module includes:

A combination determination submodule configured to determine the data allocation combination which allocates the data to be allocated to the storage units in at least one type of storage cluster;

The first access overhead calculation submodule is configured to calculate first access overheads corresponding to the data to be distributed under each data distribution combination according to the access parameters of the storage cluster;

The total access overhead calculation sub-module is configured to calculate the total access overhead for accessing the data to be distributed according to the first access overhead corresponding to the data to be distributed under each data distribution combination, and the total access overhead is used as the access overhead corresponding to the data to be distributed under the data distribution combination.

Optionally, the determining module includes:

The first local optimal access cost determining submodule is configured to determine the local optimal access cost of the first data to be distributed in the data to be distributed in a first calculation stage according to the access cost corresponding to the data to be distributed;

The second local optimal access cost determining submodule is configured to determine the local optimal access cost of the next data to be distributed in the next computing stage according to the local optimal access cost of the previous data to be distributed in the previous computing stage in the data to be distributed;

the judging submodule is configured to judge whether all the data to be distributed contained in the data to be distributed are calculated;

if the execution result of the judging sub-module is yes, the optimal access expense determining sub-module is operated;

If the execution result of the judging sub-module is negative, the second local optimal access cost determining sub-module is operated;

The optimal access cost determining submodule is configured to take global optimal access cost of last data to be distributed in the last computing stage in the data to be distributed as the optimal access cost.

Optionally, the first locally optimal access overhead determination submodule is further configured to:

Optionally, the second locally optimal access overhead determination submodule is further configured to:

According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions:

According to a fourth aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the heterogeneous storage based data allocation method.

According to the method, the device and the system, the data to be distributed is obtained, the access cost corresponding to the data to be distributed under the data distribution combination is calculated according to the access parameters of the storage clusters and the data distribution combination for distributing the data to be distributed to storage units in at least one type of storage clusters, the target data distribution combination corresponding to the optimal access cost is determined according to the access cost corresponding to the data to be distributed, and the data to be distributed is distributed according to the target clusters and the target storage units recorded in the target data distribution combination, so that the data distribution problem in a heterogeneous storage system is solved, the overall access performance of the system is optimized, and the space utilization is maximized.

Drawings

FIG. 1 is a process flow diagram of a heterogeneous storage-based data allocation method provided in one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of heterogeneous storage-based data allocation provided in one embodiment of the present disclosure;

FIG. 3 is a flowchart of a process of applying a data allocation method based on heterogeneous storage to an actual scenario according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a data distribution device based on heterogeneous storage according to one embodiment of the present disclosure;

FIG. 5 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

SSD: solid state disk.

HDD: a mechanical hard disk.

NVMe SSD: solid state disk based on NVMe protocol standard.

SATAHDD: a mechanical hard disk based on SATA protocols.

IOPS: the number of reads/writes that can be processed per second is generally used to reflect how fast a disk processes a read/write request.

In the present specification, a data distribution method based on heterogeneous storage is provided, and the present specification relates to a data distribution apparatus based on heterogeneous storage, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments.

Fig. 1 shows a process flow diagram of a heterogeneous storage-based data allocation method according to one embodiment of the present disclosure, including steps 102 to 108.

Step 102, obtaining data to be distributed to storage units in at least one type of storage cluster.

In the embodiments of the present disclosure, the storage clusters include a storage cluster formed by a solid state hard disk (NVMe SSD), a storage cluster formed by a serial mechanical hard disk (SATA HDD), and a storage cluster formed by an archive mechanical hard disk.

The relevant performance of different types of hard disks is different, for example, NVMe SSD has higher throughput performance, higher read IOPS and lower delay, however, such devices are expensive, have smaller number and lower write tolerance, and are suitable for storing data with higher frequency of read access; the common SATA HDD has relatively low price, more quantity and balanced read-write performance, but has certain bottleneck in performance, and is suitable for the data accessed at lower frequency; the archiving type hard disk has the advantages of relatively lowest price and worst performance, and is suitable for data which are not accessed frequently.

Furthermore, in an actual storage application scenario, the type of each data is different from the accessed frequency, e.g., some data is often read; some data is often written; and some of the data is not accessed frequently. Therefore, if data is randomly allocated to a certain storage system, the access performance of the storage system and the utilization of the storage space may not be optimized.

Based on this, the data allocation method based on heterogeneous storage provided in the embodiments of the present disclosure is applied to a cloud computing platform, by acquiring data to be allocated, calculating access costs corresponding to the data to be allocated under different data combinations, determining a target data allocation combination corresponding to the optimal access costs, and finally performing data allocation on the data to be allocated according to a target cluster and a target storage unit recorded in the target data allocation combination, thereby implementing reasonable allocation of the data to be allocated to corresponding storage devices in a storage system according to the data type of the data to be allocated and the accessed frequency of the data to be allocated, so that the access performance of the whole system is optimal, and the storage space utilization of the system is maximized.

Step 104, according to the access parameters of the storage cluster and the data distribution combination of the data to be distributed to the storage unit, calculating the access cost corresponding to the data to be distributed under the data distribution combination.

Specifically, the access parameters of the storage cluster include: read access overhead for storage cluster m _i Write Access overhead/>, of storage Cluster m _i The computing node c _x/>, the number of write accesses to the data d _q to be allocatedThe computing node c _x/>, the number of read accesses to the data d _q to be allocatedNetwork overhead generated by computing node c _x accessing storage cluster m _i

In addition, in the embodiment of the present disclosure, the data to be allocated is d ₁、d₂、......、d_q, where q is the number of data to be allocated; the computing node is c ₁、c₂、......、c_x, wherein x is the number of the computing nodes; the storage cluster is m ₁、m₂、......、m_i, wherein i is the number of the storage clusters; the maximum storage capacity of storage cluster m _i is denoted by S _i and the single storage unit user S _i of storage cluster m _i.

In practical application, the data to be distributed can be distributed by using a data distribution function F to obtain the data distribution combination, and the data distribution function F is defined: f (D) =m, where D is the set of required allocation data, d= { D ₁、d₂、......、d_q }, M is the set of storage clusters, m= { M ₁、m₂、......、m_i }. For example, F (d _q)＝m_i denotes storing data d _q in storage cluster m _i.

After the data to be distributed is distributed by utilizing the data distribution function F to obtain a data distribution combination, the corresponding access expense of the data to be distributed under the data distribution combination can be calculated according to the access parameters of the storage cluster and the data distribution combination, and the method can be realized by the following steps:

In specific implementation, for any data to be distributed in the data to be distributed, the first access cost of the data to be distributed in any data distribution combination in the data distribution combination, namely, the product of the sum of the write access cost of the node to the data to be distributed and the network cost generated by the write access and the product of the write access times, and the sum of the read access cost of the node to the data to be distributed and the network cost generated by the read access and the product of the read access times are calculated.

Specifically, the computing nodes include computing nodes in a computing cluster of a cloud computing platform.

The first access Cost, that is, the access Cost generated by a certain computing node accessing certain data to be distributed after the data to be distributed is stored in a certain storage cluster, is represented by Cost (d _q,m_i,c_x), and the computing mode of Cost (d _q,m_i,c_x) is shown as a formula (1).

In addition, the total access overhead of the data to be distributed in any one of the data distribution combinations, namely the sum of the first access overheads of all the computing nodes to the data to be distributed under the data distribution combination.

Specifically, the total access overhead, that is, when a certain data is allocated to a storage cluster, is represented by DataCost (d _q,m_i) and the calculation mode of DataCost (d _q,m_i) is shown in formula (2), where the sum of the first access overheads generated by all the computing nodes accessing the data.

Where |c| represents the total number of compute nodes.

Taking any one of the data d ₁ to be allocated as an example, according to the function F (d ₁)＝m₁, it can be determined that the data d ₁ to be allocated is stored in the storage cluster m ₁, and the computing cluster includes 2 computing nodes, respectively c ₁、c₂, then the computing node c ₁ accesses the data d ₁ to be allocated in the storage cluster m ₁ according to the formula (1) can obtain the access overhead When the computing node c ₂ accesses the data d ₁ to be distributed in the storage cluster m ₁, the access overhead is thatThe sum total access overhead of the first access overhead generated by all the computing nodes accessing the data is DataCost(d₁,m₁)＝Cost(d₁,m₁,c₁)+Cost(d₁,m₁,c₂).

And finally, determining a target storage cluster of the data to be distributed by comparing the sizes of the access overheads, thereby being beneficial to optimizing the overall access performance of the storage system.

And 106, determining a target data allocation combination corresponding to the optimal access cost according to the access cost corresponding to the data to be allocated.

In the implementation, according to the access cost corresponding to the data to be allocated, determining a target data allocation combination corresponding to the optimal access cost can be realized in the following manner:

Further, the local optimal access overhead of the first data to be distributed in the first computing stage is determined by the following manner:

In addition, the local optimal access cost of the next data to be distributed in the next calculation stage is determined by the following method:

Specifically, the embodiment of the specification determines the target data allocation combination corresponding to the optimal access overhead in the above manner, and the core idea is to decompose the solved problem into a plurality of sub-problems (sub-stages), sequentially solve the solutions of each sub-problem, the solution of the former sub-problem provides useful information for the solution of the latter sub-problem, list various possible partial solutions when solving any sub-problem, reserve the possible optimal partial solution through decision, discard other partial solutions, and sequentially solve each sub-problem, wherein the solution of the last sub-problem is the solution of the initial problem;

Since the final optimal solution is required, the situation of the sub-problem of each stage before needs to be calculated, the sub-problem is defined as the problem of distributing single data to be distributed on a certain limited storage unit.

The optimal access cost is to determine the local optimal access cost of each data to be allocated in each computing stage, wherein the local optimal access cost is the minimum system access cost in each system access cost corresponding to different data to be allocated in different stages, the system access cost refers to the system access total cost generated by all computing nodes accessing the data after the data is allocated to the corresponding storage clusters, the system access total cost is represented by SysCost (D, M), and the computing mode of SysCost (D, M) is shown as a formula (3).

Where |d| represents the number of data allocated.

If the optimal access cost is determined, determining that the problem of the optimal access cost is decomposed into a plurality of sub-problems, wherein after the first sub-problem is solved and the first data to be distributed is distributed to the storage space of which storage cluster, the access cost corresponding to the first generation of distributed data is the local optimal access cost;

After determining the local optimal access cost corresponding to the first allocation data, a second sub-problem is to calculate to which storage space in the storage cluster except the storage space where the first allocation data is located the second data to be allocated, and the access cost corresponding to the second allocation data is the local optimal access cost;

And after determining the local optimal access cost corresponding to the second allocation data, and the like, sequentially solving each sub-problem until the last sub-problem is solved, and determining the local optimal access cost corresponding to the last data to be allocated as the optimal access cost.

Specifically, in the embodiment of the present disclosure, the optimal access overhead is determined according to the access overhead corresponding to the data to be allocated, and the calculation formula is shown in formula (4).

Wherein c= [ q, s ₀,s₁,…,s_i ] represents the optimal access overhead.

The input of equation (4) is the read access overhead of storage cluster m _i Write Access overhead/>, of storage Cluster m _i The computing node c _x/>, the number of write accesses to the data d _q to be allocatedThe computing node c _x/>, the number of read accesses to the data d _q to be allocatedNetwork overhead/>, generated by computing node c _x accessing storage cluster m _i The set d= { D ₁、d₂、......、d_q } of the data to be allocated, the set m= { M ₁、m₂、......、m_i } of the storage cluster, and the capacity S _i thereof, the set c= { C ₁、c₂、......、c_x } of the computing node;

The output is the final allocation result Allocaiton = [ q, s ₀,s₁,…,s_i ] of the data to be allocated and the current overhead c= [ q, s ₀,s₁,…,s_i ], wherein Allocaiton = [ q, s ₀,s₁,…,s_i ] is used for recording the allocation situation of the first q data to be allocated.

When the number q of the data to be allocated=0, the current system overhead c=0, when the number q of the data to be allocated is >0, but the storage unit s < q, the current system overhead C is infinite, and when the number q of the data to be allocated is >0, and the storage unit s > q, the optimal access overhead (the minimum value of the current system overhead C) can be calculated according to the local optimal access overhead corresponding to each data to be allocated.

For the combination i of all data D _q and storage cluster M _i in the sets D and M, dataCost (D _q,m_i) can be calculated according to formulas (1) and (2), and the optimal access cost and the target data allocation combination of the data to be allocated can be obtained by combining the local optimal access cost of each calculation stage corresponding to each data to be allocated.

Taking two obtained data to be distributed as an example, namely the two data to be distributed are d ₁ and d ₂ respectively; the storage system has three storage clusters, namely m ₁ and m ₂ respectively, the capacity of each storage cluster is only 1, and the computing clusters comprise 2 computing nodes, namely c ₁、c₂ respectively;

Determining a target data distribution combination corresponding to the optimal access cost, namely determining the local optimal access cost of first data to be distributed in the data to be distributed in a first computing stage; according to the formula (2), the access overhead of the first data d ₁ to be allocated in the first computing stage (each data allocation combination) is DataCost (d ₁,m₁) and DataCost (d ₁,m₂), and the local optimal solution of the first data d ₁ to be allocated in the first computing stage is DataCost (d ₁,m₂) (the access overhead value is the smallest) can be determined by comparison, so that the optimal allocation mode of the data d ₁ to be allocated can be determined to be the storage cluster m ₂.

After determining the local optimal access cost of the first data to be allocated in the first computing stage, and then determining the local optimal access cost of the second data to be allocated in the second computing stage, wherein the storage capacity in each storage cluster is 1, the optional storage cluster of the second data to be allocated d ₂ is the storage cluster m ₁, and the access cost of the second data to be allocated d ₂ in the second computing stage (each data allocation combination) is DataCost (d ₂,m₁) according to the formula (2).

And summing DataCost (d ₁,m₂) and DataCost (d ₂,m₁) to obtain the system access overhead SysCost corresponding to the data d ₂ to be distributed in the second stage, namely the local optimal access overhead of the data d ₂ to be distributed in the second computing stage.

And (3) only obtaining two data to be distributed, wherein all the data to be distributed are calculated, so that the local optimal access cost of the last second data d ₂ to be distributed in the last second calculation stage in the data to be distributed is taken as the optimal access cost.

According to the embodiment of the specification, the optimal solution is obtained by determining the target allocation combination of the data to be allocated, so that the time complexity and the space complexity of solving the optimal solution are reduced.

And step 108, performing data distribution on the data to be distributed according to the target cluster and the target storage unit recorded in the target data distribution combination.

Specifically, after determining a target data allocation combination, performing data allocation on data to be allocated according to a target cluster and a target storage unit recorded in the target data allocation combination.

Along the above example, the optimal access overhead SysCost 1= DataCost (d ₁,m₂)+DataCost(d₂,m₁), and it can be known that the target allocation of the data to be allocated d ₁ and d ₂ is combined into the storage clusters m ₂ and m ₁, and then the data to be allocated d ₁ and d ₂ are allocated to the storage clusters m ₂ and m ₁ respectively.

In practical application, a schematic diagram of data allocation based on heterogeneous storage is shown in fig. 2, where the storage clusters in fig. 2 include an NVMe SSD cluster, a SATA HDD cluster, and an archive HDD cluster, and after data to be allocated is obtained, in a process of allocating the data to be allocated, the data types of the data to be allocated and performances of different storage clusters need to be combined, so as to implement reasonable allocation of the data to be allocated to the corresponding storage clusters.

In addition, since the computing node and the storage cluster perform data transmission through a network, the computing node accesses the storage cluster with a certain network access cost, so that the cost generated by different computing nodes accessing different storage clusters is different. And the two are comprehensively considered, and the calculation and analysis are carried out on a plurality of different distribution conditions, so that the final optimal solution is obtained, and the overall space utilization rate and the comprehensive performance are optimized.

The data distribution method based on heterogeneous storage provided in the present specification will be further described with reference to fig. 3 by taking an application of the data distribution method based on heterogeneous storage in a practical scenario as an example. Fig. 3 is a flowchart of a process of applying the heterogeneous storage-based data allocation method to an actual scenario according to an embodiment of the present disclosure, where specific steps include steps 302 to 318.

Step 302, obtaining data to be distributed to storage units in at least one type of storage cluster.

The data distribution method based on heterogeneous storage, provided by the embodiment of the specification, is applied to a cloud computing platform, and is used for calculating the access cost of the data to be distributed under different data combinations by acquiring the data to be distributed, determining the target data distribution combination corresponding to the optimal access cost, and finally performing data distribution on the data to be distributed according to the target cluster and the target storage unit recorded in the target data distribution combination, namely reasonably distributing the data to be distributed to the corresponding storage equipment in the storage system according to the data type of the data to be distributed and the accessed frequency of the data to be distributed, so that the access performance of the whole system is optimal, and the storage space utilization rate of the system is maximized.

Step 304, determining the data allocation combination for allocating the data to be allocated to the storage units in at least one type of storage cluster.

Specifically, the data to be allocated may be allocated by using a data allocation function F to obtain the data allocation combination, and the data allocation function F is defined: f (D) =m, where D is the set of required allocation data, d= { D ₁、d₂、......、d_q }, M is the set of storage clusters, m= { M ₁、m₂、......、m_i }. For example, F (d _q)＝m_i denotes storing data d _q in storage cluster m _i.

Step 306, according to the access parameters of the storage cluster, calculating the first access overhead corresponding to the data to be distributed under each data distribution combination.

Specifically, for any data to be distributed in the data to be distributed, the first access cost of the data to be distributed in any data distribution combination in the data distribution combination, namely, under the data distribution combination, the product of the sum of the write access cost of the node to the data to be distributed and the network cost generated by the write access and the product of the write access times, and the sum of the read access cost of the node to the data to be distributed and the network cost generated by the read access and the product of the read access times are calculated.

Step 308, calculating total access overhead for accessing the data to be allocated according to the first access overhead corresponding to the data to be allocated under each data allocation combination, and taking the total access overhead as the access overhead corresponding to the data to be allocated under the data allocation combination.

Specifically, the first access overhead, that is, the access overhead generated when a certain computing node accesses a certain data to be allocated after the data to be allocated is stored in a certain storage cluster, is represented by Cost (d _q,m_i,c_x), and the computing mode of Cost (d _q,m_i,c_x) is shown in formula (1).

The total access overhead, i.e., the sum of the first access overheads generated when a certain data is allocated to a storage cluster and accessed by all computing nodes, is denoted by DataCost (d _q,m_i), and the computing manner of DataCost (d _q,m_i) is shown in formula (2).

Step 310, determining a local optimal access cost of the first data to be allocated in the first calculation stage in the data to be allocated according to the access cost corresponding to the data to be allocated.

Step 312, determining the local optimal access overhead of the next data to be allocated in the next computing stage according to the local optimal access overhead of the previous data to be allocated in the previous computing stage.

Step 314, judging whether all the data to be distributed contained in the data to be distributed are calculated; if yes, go to step 316; if not, go back to step 312.

And step 316, taking the global optimal access overhead of the last data to be distributed in the last computing stage in the data to be distributed as the optimal access overhead.

For each s _i, initialize c= [ q, s ₀,s₁,…,s_i ] to infinity, allocaiton = [ q, s ₀,s₁,…,s_i ] to null, and calculate DataCost (D _q,m_i) according to equations (1) and (2) for all the combinations i of data D _q and storage cluster M _i in sets D and M.

And updating the output final distribution result and the current system overhead according to the data to be distributed corresponding to different data distribution combinations, so as to obtain the target data distribution combination of the data to be distributed.

And step 318, performing data distribution on the data to be distributed according to the target cluster and the target storage unit recorded in the target data distribution combination.

The embodiment of the specification calculates the optimal storage position of all data, and when the data is distributed, the characteristics of different storage devices are considered, and meanwhile, the access characteristics of different data are considered. And the two are comprehensively considered, and the calculation and analysis are carried out on a plurality of different distribution conditions, so that the final optimal solution is obtained, and the overall space utilization rate and the comprehensive performance are optimized.

Corresponding to the method embodiment, the present disclosure further provides an embodiment of a data distribution device based on heterogeneous storage, and fig. 4 shows a schematic structural diagram of the data distribution device based on heterogeneous storage according to one embodiment of the present disclosure. As shown in fig. 4, the apparatus includes:

an obtaining module 402, configured to obtain data to be allocated to storage units in at least one type of storage cluster;

A calculating module 404, configured to calculate access overhead corresponding to the data to be allocated under the data allocation combination according to the access parameter of the storage cluster and the data allocation combination that allocates the data to be allocated to the storage unit;

A determining module 406, configured to determine, according to the access overhead corresponding to the data to be allocated, a target data allocation combination corresponding to the optimal access overhead;

The data allocation module 408 is configured to allocate data to be allocated according to the target cluster and the target storage unit recorded in the target data allocation combination.

Optionally, the computing module 404 includes:

Optionally, the determining module 406 includes:

Optionally, the storage cluster includes at least one of:

Optionally, the apparatus is applied to a cloud computing platform, and the computing node includes: computing nodes in a computing cluster of a cloud computing platform.

The foregoing is a schematic scheme of a data distribution device based on heterogeneous storage in this embodiment. It should be noted that, the technical solution of the data distribution device based on heterogeneous storage and the technical solution of the data distribution method based on heterogeneous storage belong to the same concept, and details of the technical solution of the data distribution device based on heterogeneous storage, which are not described in detail, can be referred to the description of the technical solution of the data distribution method based on heterogeneous storage.

Fig. 5 illustrates a block diagram of a computing device 500 provided in accordance with one embodiment of the present description. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530 and database 550 is used to hold data.

Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 5 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.

Wherein the memory 510 is configured to store computer executable instructions and the processor 520 is configured to execute the following computer executable instructions:

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the data distribution method based on heterogeneous storage belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the data distribution method based on heterogeneous storage.

An embodiment of the present specification also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the heterogeneous storage-based data allocation method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the data distribution method based on heterogeneous storage belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the data distribution method based on heterogeneous storage.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A heterogeneous storage-based data allocation method, comprising:

determining a data distribution combination for distributing the data to be distributed to storage units in at least one type of storage clusters;

calculating the total access cost for accessing the data to be distributed according to the first access cost corresponding to the data to be distributed under each data distribution combination as the access cost corresponding to the data to be distributed under the data distribution combination,

The first access overhead of any one of the data allocation combinations for any one of the data to be allocated, includes:

Under the data distribution combination, calculating the product of the sum of the network overhead generated by the write access and the write access of the node to the data to be distributed and the number of write accesses, and the sum of the product of the sum of the network overhead generated by the read access and the number of read accesses of the node to be distributed;

2. The heterogeneous storage-based data allocation method according to claim 1, wherein the total access overhead of the data to be allocated in any one of the data allocation combinations comprises:

3. The heterogeneous storage-based data allocation method according to any one of claims 1 to 2, wherein the determining, according to the access overhead corresponding to the data to be allocated, a target data allocation combination corresponding to the optimal access overhead includes:

4. A heterogeneous storage-based data allocation method according to claim 3, wherein the locally optimal access overhead of the first data to be allocated in the first computing stage is determined by:

5. The heterogeneous storage-based data allocation method according to claim 4, wherein the local optimal access overhead of the next data to be allocated in the next calculation stage is determined by the following manner:

6. The heterogeneous storage-based data allocation method of claim 1, the storage cluster comprising at least one of:

7. The heterogeneous storage-based data distribution method according to claim 1, the method being applied to a cloud computing platform, the computing node comprising: computing nodes in a computing cluster of a cloud computing platform.

8. A heterogeneous storage-based data distribution device, comprising:

The computing module comprises a combination determining sub-module, a first access overhead computing sub-module and a total access overhead computing sub-module, wherein,

The combination determination submodule is configured to determine a data distribution combination for distributing the data to be distributed to storage units in at least one type of storage cluster,

The first access overhead calculation sub-module is configured to calculate the first access overhead corresponding to the data to be distributed under each data distribution combination according to the access parameters of the storage cluster,

The total access overhead calculation sub-module is configured to calculate the total access overhead for accessing the data to be allocated according to the first access overhead corresponding to the data to be allocated under each data allocation combination as the access overhead corresponding to the data to be allocated under the data allocation combination,

9. The heterogeneous storage-based data allocation apparatus of claim 8, the total access overhead of the data to be allocated in any one of the data allocation combinations, comprising:

10. The heterogeneous storage-based data allocation device according to any one of claims 8 to 9, the determining module comprising:

11. The heterogeneous storage-based data allocation device of claim 10, the first locally optimal access overhead determination submodule further configured to:

12. The heterogeneous storage-based data allocation device of claim 11, the second locally optimal access overhead determination submodule further configured to:

13. A computing device, comprising:

a memory and a processor;

14. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the heterogeneous storage based data allocation method of any of claims 1 to 7.