WO2017206649A1 - 一种去中心化的分布式异构存储系统数据分布方法 - Google Patents

一种去中心化的分布式异构存储系统数据分布方法 Download PDF

Info

Publication number
WO2017206649A1
WO2017206649A1 PCT/CN2017/082718 CN2017082718W WO2017206649A1 WO 2017206649 A1 WO2017206649 A1 WO 2017206649A1 CN 2017082718 W CN2017082718 W CN 2017082718W WO 2017206649 A1 WO2017206649 A1 WO 2017206649A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
read
data objects
placement group
ssd
Prior art date
Application number
PCT/CN2017/082718
Other languages
English (en)
French (fr)
Inventor
沙行勉
诸葛晴凤
吴林
Original Assignee
重庆大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 重庆大学 filed Critical 重庆大学
Priority to CN201780026690.XA priority Critical patent/CN109196459B/zh
Publication of WO2017206649A1 publication Critical patent/WO2017206649A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0607Improving or facilitating administration, e.g. storage management by facilitating the process of upgrading existing storage systems, e.g. for improving compatibility between host and storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the invention belongs to the field of distributed computer storage technologies, and in particular relates to a decentralized distributed heterogeneous storage system data distribution method.
  • the data distribution strategy of the storage system must consider the "write tolerance" of SSDs and the performance difference of various types of hard disks, while ensuring system scalability and load balancing, because excessive write operations will accelerate SSD storage media. Loss, placing too much data on the archived hard drive can affect system read and write performance.
  • Ceph uses the communication capability of the storage device to design a new data distribution method. The method is divided into two steps. The first step is to use the hash algorithm to map the data object to the "placement group" and the input of the hash function.
  • the second step uses a pseudo-random hash algorithm to distribute each "placement group" to multiple storage devices.
  • This data distribution method does not take into account the heterogeneous nature of the storage system, which can result in intensive write operations to the solid state drive.
  • SSDs disk drives
  • This centralized data distribution strategy makes the system non-scalable and not suitable for very large-scale data applications.
  • the technical problem to be solved by the present invention is to provide a decentralized score.
  • a distributed data distribution method for heterogeneous storage systems that maintains the performance, load balancing, and scalability of the storage system by analyzing access methods of data objects while reducing write operations to the solid state drive.
  • the technical problem to be solved by the present invention is achieved by such a technical solution, and the first method of the present invention comprises the following steps:
  • Step 1 During the execution of the program, count the number of times each data object is read/written, convert the number of read and write times into a weight, as an access mode of the data; classify the data objects according to the access mode of the data;
  • Step 2 classify the storage device according to the capacity and read/write performance of the storage device.
  • Step 3 The storage data is divided into different "placement group clusters", and the "placement group cluster” includes a plurality of "placement groups", and the type of each storage device corresponds to a type of "placement group cluster”;
  • Step 4 Calculate the proportion of each type of data object to be deposited to be placed in a different type of “placement group cluster” according to the load balancing target and the performance indicator of the storage system;
  • Step 5 Using a hash algorithm to determine which of the "placement group" of the data object to be deposited belongs to the "placement group cluster”;
  • Step 6 Using the data distribution algorithm of the storage system, storing the data objects in each “placement group” into a plurality of corresponding storage devices; the “placement group” of the solid state hard disk is allocated to the solid state hard disk, the mechanical hard disk “ The Placement Group will be assigned to the mechanical hard drive.
  • the second method of the present invention comprises the following steps:
  • Step 1 During the execution of the program, the total number of read and write times of the system and the total number of accessed data objects are calculated for a period of time, and used to determine the access mode of the system during the period;
  • Step 2 classify the storage device according to the capacity and read/write performance of the storage device.
  • Step 3 Divide the data objects into different “placement group clusters”, and “place group clusters” include multiple “placement groups”, and the type of each storage device corresponds to a type of “placement group cluster”;
  • Step 4 For the newly stored data object, use a uniform hash algorithm to map the data object to the “place group cluster” and the “place group”, and add an identifier for each data object to indicate where the data object belongs. a "place group cluster”;
  • Step 5 Using the data distribution algorithm of the storage system, storing the data objects in each “placement group” into a plurality of corresponding storage devices; the “placement group” of the solid state hard disk is allocated to the solid state hard disk, the mechanical hard disk “ The Placement Group will be assigned to the mechanical hard drive.
  • Step 6 During the running of the system, calculate the migration threshold of the data access of each storage device according to the data access mode, and dynamically migrate the data object to the appropriate storage device according to the thresholds, so as to reduce the number of writes to the solid state hard disk. And improve system read and write performance.
  • the first method of the present invention distributes different types of data to different “placement group clusters”. In this case, it is necessary to calculate different types of data objects to be deposited into different “placement group clusters”. The ratio is used to control the load balancing between the “placement group clusters”.
  • the hash algorithm is used to calculate the “placement group” corresponding to the data object;
  • the data objects in the Placement Group are distributed to the storage device. This distributes the data evenly to the storage device, eliminating the centralized data storage structure, which not only maintains the performance, load balancing and scalability of the storage system, but also reduces the number of write operations to the solid state drive and prolongs its life.
  • the second method of the present invention migrates different types of data to a suitable "placement group cluster" according to the dynamic change of the data object access mode, and the data migration process needs to set different access thresholds for controlling the data migration process. Load balancing between "place group clusters.”
  • step 4 of the second method of the present invention an identifier is added to each data object.
  • the originally stored "placement group cluster” may change, and it is recorded which one the current data object belongs to.
  • the access situation of the data object is counted during the running of the system, and a threshold is set for each storage device, and the data object exceeding the threshold generates a dynamic migration operation.
  • Using a dynamic migration strategy can make the system more versatile while reducing writes to SSDs.
  • Figure 1 is a flow chart of the first method for calculating the proportion of each of the data objects to be stored stored in each "placement group cluster";
  • FIG. 2 is a diagram of a data storage process of the present invention
  • FIG. 3 is a schematic diagram of mapping a read-intensive data object to a "placement group"
  • FIG. 4 is a schematic diagram of mapping a write-intensive data object to a "placement group"
  • Figure 5 is a flow chart of the threshold algorithm in step 6 of the second method.
  • a first method of the invention comprises the steps of:
  • Step 1 During the execution of the program, count the number of times each data object is read/written, and convert the number of read and write times. For the weight, as the data access mode; according to the data access mode, the data objects are classified, such as read-intensive, write-intensive and hybrid; the classification method can adopt the common K-Means clustering algorithm, each type Data objects have an attribute value that represents the average number of writes for such data objects.
  • Step 2 According to the capacity and read/write performance of the storage device, classify the storage device, such as a high-speed solid-state hard disk, a low-speed solid-state hard disk, a high-speed mechanical hard disk, and a low-speed mechanical hard disk.
  • Each storage device has its own read and write performance parameters, such as an average read. Write delay time, capacity.
  • Step 3 Divide the storage data into different “placement group clusters”.
  • the “placement group cluster” includes multiple “placement groups”, and each storage device type corresponds to a type of “placement group cluster”.
  • "Place group cluster” is used to combine data objects with similar read and write attributes;
  • place group cluster is a logical concept, mainly used to aggregate data objects, and
  • place group cluster also has capacity and read Write performance attributes, capacity is the capacity of all the hard disks corresponding to the "place group cluster”, read and write performance is the average read and write latency of these hard drives.
  • Step 4 Calculate the proportion of each type of data object to be deposited to be placed in a different type of “placement group cluster” according to the load balancing target and the performance indicator of the storage system;
  • this ratio refers to the proportion of the number of "placement group clusters” placed in each category to the total number of such data.
  • the performance metrics of the storage system are set according to the read and write performance of the storage device. For example, for all data objects, the average latency of read operations is 0.2 milliseconds, and the average latency of write operations is 0.5 milliseconds.
  • the purpose of setting each data object in different types of "place group clusters" is to ensure that the data is evenly distributed among the "placed group clusters". In extreme cases, all data objects are write-intensive. According to the target allocated by the storage device, write-intensive data objects should be allocated to the mechanical hard disk to reduce the write operation to the solid-state hard disk, but if all data objects are It is write-intensive, so it will all be assigned to the "placement group cluster" corresponding to the mechanical hard disk, so that there is no data in the solid state drive. In order to avoid this situation, it is necessary to assign the same type of data objects to different "placement group clusters", and use this ratio to control the load balancing between "place group clusters".
  • Step 5 Using a hash algorithm to determine which of the "placement group” the data object to be deposited belongs to the "placement group cluster", because a "placement group cluster” includes a plurality of "placement groups”.
  • Step 6 Using the data distribution algorithm of the storage system, storing the data objects in each “placement group” into multiple corresponding storage devices, and the “placement group” in the “placement group cluster” corresponding to the solid state drive is allocated to For SSDs, the "Placement Group” in the “Placement Group Cluster” of the mechanical hard disk is assigned to the mechanical hard disk.
  • a "placement group" is stored to multiple storage devices is to back up multiple times on the same data.
  • the number of backups is set by the system initialization. Because there are multiple storage devices corresponding to the same "placement group", a mapping is required. The algorithm will determine which storage device each "placement group” should be placed into. In Ceph's storage strategy, a pseudo-random hash algorithm is used to create multiple backups of data in each "placement group" and store them on different storage devices.
  • step 4 the flow chart of calculating the ratio of each of the data objects to be deposited to each "placement group cluster" is as shown in FIG. 1:
  • step 801 The process begins in step 801 and then:
  • step 802 the total number of all data objects to be stored, that is, the sum of different types of data objects, is calculated
  • step 803 the total number of existing data objects is calculated, that is, the number of data objects that all storage devices have stored in the initial state;
  • step 804 according to the load balancing condition, the maximum value of the data objects that can be stored in each “place group cluster” is calculated; that is, the capacity of each “place group cluster” is determined;
  • Load balancing is a configuration parameter of the system. For example, in the case where all data objects are completely evenly distributed, an increase or decrease of 5% according to the capacity of each storage device is considered to be load balancing.
  • step 805 all the data objects to be stored are arranged in ascending order according to the average number of writes, and the average number of writes is an attribute of a different type of data object;
  • step 806 all the "placement group clusters” are arranged in descending order of performance.
  • the performance of the "placement group cluster” is the read/write performance of the corresponding storage device, and the read/write performance of the solid state hard disk is better than that of the mechanical hard disk;
  • step 809 the data object to be stored in the i-th class is assigned to the j-th group "placement group cluster";
  • step 805 and step 806 the number of each type of data object to be stored is sequentially filled according to the “placement group cluster” capacity calculated in step 804;
  • step 810 the number of data objects to be stored in the i-type storage group stored in the "placement group cluster" j is recorded, and is used to calculate the storage ratio of each type of data object to be stored;
  • the total number of data objects to be stored in each class is known, and each class of data objects to be stored is recorded to each
  • the number of a "placed group cluster" is divided by the total number of records of each type of data objects to be stored, and the ratio is obtained.
  • step 811 it is determined whether the "placement group cluster" j has reached the maximum number of storage, and if so, step 812 is performed, otherwise step 813 is performed;
  • step 813 it is determined whether all the data objects to be stored are processed, and if so, step 816 is performed; otherwise, step 814 is performed;
  • step 814 the pointer i for scanning the data object class array is moved to the next location, that is, processing the next class to be stored in the data object, step 809 is performed;
  • step 812 the pointer j used to scan the "place group cluster” array is moved to the next location, that is, the next "place group cluster” is processed;
  • step 815 it is determined whether all the "placement group cluster" processing is completed, if yes, step 816 is performed, otherwise, step 809 is performed;
  • step 816 according to the number of each type of data objects to be stored stored in each "place group cluster" recorded in step 810, the proportion of each type of data objects to be deposited allocated to each "placement group cluster" is calculated;
  • step 817 the algorithm for assigning each type of data to be allocated to various "placement group clusters" ends.
  • the data storage process of the above steps 5 and 6 is as shown in FIG. 2, and the "placement group” of the storage system is divided into different "placement group clusters", and each "placement group cluster” includes a plurality of "placement groups”.
  • This process passes through the process of Figure 1.
  • To calculate the load of different types of objects into different "placement group clusters control the load balancing between "place group clusters”, and then use the hash algorithm to determine that the data object belongs to this "placement group cluster" Which one "places the group”.
  • Step 6 maps this "placement group” to a different storage device using a pseudo-random hash algorithm (CRUSH).
  • CRUSH pseudo-random hash algorithm
  • each type of storage device corresponds to one “placement group cluster”, then the system has five “placement group clusters”. All placement group clusters have been sorted by performance from high to low (corresponding to step 806). As shown in Table 1.
  • the maximum capacity RMAX when the data is completely evenly distributed is assumed to be: 255, 383, 511, 638, 766.
  • step 807 i is initialized to 0, which is used to scan the data objects to be stored in A, B, and C.
  • j is initialized to 0 to scan the "place group cluster" 1, 2, 3, 4, 5.
  • the class A with the least number of average writes and the least number of reads is preferentially assigned to the OSD.1 with a small write delay and small read latency.
  • Place group cluster 1 is full.
  • Place group cluster 2 is full.
  • type A is allocated, and 0 remains;
  • Assigning type B preferentially assigned to the placement group cluster 3 with relatively small read and write delays
  • Assigning type C still preferentially assigned to the placement group cluster 3 with relatively small read and write delays
  • Place group cluster 3 is full.
  • Place group cluster 4 is full.
  • Place group cluster 5 itself load 700, the calculated maximum capacity is 766, and the capacity is 766-700 66;
  • the 63 data objects of type C are assigned to the placement group cluster 5, and the type C is allocated, with 0 remaining.
  • step 816 based on the final result, the proportion of each type of data object to be deposited allocated to each "placement group cluster" is calculated:
  • step 5 of the present invention maps different types of data objects to different "placement groups” is explained below.
  • a read-intensive data object is mapped to a "placement group” 13.
  • the "placement group” is the first one.
  • Place group clusters 60% of read-intensive data belong to the first "place group cluster
  • 21-50 "placement group” is the second "place group cluster”
  • 20% read-intensive data It belongs to the second “placement group cluster”
  • the “placement group” of 51-100 is the third “placement group cluster”
  • 20% of the read-intensive data belongs to the third “placement group cluster”.
  • the hash algorithm is used to calculate the target of the data object.
  • Group is 13.
  • a write-intensive data object is mapped to a "placement group” 62.
  • the object's distribution ratio in the three "placement group clusters” is 1:3:6, and the data object's identity is also 50 by the hash function, but 50 belongs to the third "placement group cluster" (because the read is dense)
  • the ratio of type data to write-intensive data is different for each placement group cluster.
  • the intermediate hash value in Figure 4 lists the placement ratios of three "placement group clusters", and the data objects with hash values 1-10.
  • step 4 of the first method above determines the "placement group cluster" to which the data object should be placed by calculating the scale, and the result is determined, the subsequent operation will not change any more, so the disadvantage of the first method is Yes: It is only applicable to static storage of data, that is, only offline data can be classified and stored. In a storage system, the characteristics of data objects may change as the system runs. This trend leads to static The classification of the storage is invalid, and ultimately the purpose of reducing the number of writes to the solid state hard disk cannot be effectively achieved. To this end, the present invention also provides a second method.
  • the second method of the present invention comprises the following steps:
  • Step 1 During the execution of the program, the total number of read and write times of the system and the total number of accessed data objects are counted for a period of time, and used to determine the access mode of the system during the period. For example, in one day, there are M data objects with a read count of 1, N with a data object of 2, and K with a write count of 1, and so on. The number of reads, the total number of writes, and the total number of data objects being accessed.
  • Step 2 Classify storage devices according to the capacity and read/write performance of the storage device, such as solid state drives, mechanical hard disks, and archive hard disks.
  • Each storage device has its own read and write performance parameters, such as average read and write delay time and capacity.
  • Step 3 The data is divided into different “placement group clusters”, and the “placement group cluster” includes a plurality of “placement groups”, and each type of storage device corresponds to a type of “placement group cluster”.
  • Place group cluster is used to combine data objects with similar read and write attributes;
  • place group cluster is a logical concept, mainly used to aggregate data objects.
  • Step 4 For the newly stored data object, use a uniform hash algorithm to map the data object to the “place group cluster” and the “place group”, and add an identifier for each data object to indicate where the data object belongs.
  • the hash algorithm basically guarantees that each "placement group” has 10 data objects.
  • Step 5 Using the data distribution algorithm of the storage system, storing the data objects in each “placement group” into a plurality of corresponding storage devices.
  • the CRUSH algorithm is used to create multiple backups of the data in each "placement group” and store them on different storage devices.
  • Step 6 During the running of the system, calculate the migration threshold of the data access of each storage device according to the data access mode, and dynamically migrate the data object to the appropriate storage device according to the thresholds, so as to reduce the number of writes to the solid state hard disk. And improve system read and write performance.
  • the SSD has a write threshold.
  • the data object needs to be migrated to the mechanical hard disk to reduce the number of writes to the SSD.
  • the mechanical hard disk has a read threshold, if stored.
  • the data object will be migrated from the mechanical hard disk to the solid state hard disk to improve the system read performance;
  • the archive hard disk has two thresholds, the read threshold and the write threshold, when the data stored in the archive hard disk.
  • the read threshold of the data object exceeds the threshold, the data objects are migrated to the SSD and the mechanical hard disk.
  • the specific data object migration process is as follows: In the read and write process, the process running on the storage device (OSD) updates the number of accesses of the data objects involved in the operation after each read and write operation is completed, and uses the update after the update. The number of visits is compared with the calculated migration threshold. If the migration threshold is reached, the pseudo-random hash algorithm CRUSH is used to calculate which new storage device the data object should be stored on. The process on the OSD will All backups are migrated to the new device and the upper level read and write process is notified to the end. That is to say, data migration is done in the read and write process, and this migration process is transparent to the upper application.
  • OSD storage device
  • determining the threshold of migration is a key point of the solution. If the threshold is set too small, the migration of data objects is frequent, resulting in a large migration overhead. If the threshold is set too large , it will make a lot of writes to the solid state drive. Therefore, setting the threshold judgment condition requires comprehensive consideration of system performance and load balancing.
  • the invention also provides a threshold algorithm.
  • Table 3 lists the meaning of each letter or combination of letters.
  • the flow chart of the threshold algorithm is shown in Figure 5.
  • the input parameters of the program the data object satisfies the load balancing limit value ⁇ , the performance improvement ratio ⁇ , the initial performance P 0 under the uniform distribution condition; the read operation information record table; the write operation information Record table (input parameters are treated as known quantities).
  • Program output four thresholds.
  • step 000 The process begins in step 000 and then:
  • step 001 the input parameters are obtained: initial performance P 0 , performance improvement ratio ⁇ , and the data object satisfies the load balancing limit value ⁇ ;
  • V ssd represents the number of data objects removed from the SSD under load balancing conditions
  • the corresponding number of writes is W(j).
  • the initial value of j is 0, that is, the j of the first execution of this loop is 1, and the data of the first line of the write operation record table is read.
  • step 006 if the data object whose write count is greater than j-1 times in the SSD is moved to the HDD, it is determined whether the number of moved data objects is greater than k ⁇ V ssd , and if yes, step 005 is performed; otherwise, the threshold WS is assigned W(j), W(j) is the write count value of the data of the jth row, and step 007 is performed;
  • the initial value of i is 0, that is, i is 1 when the loop is executed for the first time, and the data of the first row of the read operation record table is read, and the row corresponds to the data of the number of times of reading R(i) of 0 times;
  • step 009 if the data object whose read count is greater than i-1 times in the HDD is moved to the SSD, it is determined whether the number of moved data objects is greater than or equal to k ⁇ Vssd, and if yes, step 008 is performed; otherwise, the threshold RH is assigned. For R(i), perform step 010;
  • the number of writes is R(i);
  • step 012 in order to avoid insufficient SSD storage space, the data unit of the Archive HDD is moved to the number of data units in the SSD and the HDD as C S /C h . It is determined whether the number of data objects in the Archive HDD whose read count is greater than R(i) times is greater than C S /C h ⁇ V ssd , and if yes, step 011 is performed; otherwise, the threshold RA is assigned R(i), the loop ends, step 013 is performed;
  • the corresponding number of writes is W(j);
  • step 015 the data object whose write count is greater than j-1 in the Archive HDD is moved to the HDD, and it is determined that the number of moves is greater than (C h - C S ) / C h ⁇ V ssd , and if yes, step 014 is performed. Otherwise, the threshold value WA is assigned to W (j), the loop ends, step 016 is performed;
  • step 017 the thresholds WS, RH, RA, WA are output;
  • the four loops in the above-mentioned threshold algorithm are independent, but there is a sequence, which is the design idea of the algorithm.
  • the fluctuation range V ssd of the data stored in the SSD can be calculated, for example, 100 can be stored in a state of complete average distribution
  • the unit of the mobile data object is in accordance with the multiple of V ssd , that is, k in the program, that is, 5, 10, and 15 can be moved from the SSD. After moving the data, it is necessary to calculate the performance after the movement. Variety.
  • the role of the third and fourth cycles is to calculate the threshold for removal from the Archieve HDD.
  • the third loop considers the read threshold.
  • the intensive data on the Archieve HDD will be moved to the SSD and HDD, but the mobile data cannot exceed the maximum allowable fluctuation capacity of the SSD V ssd .
  • the dense data objects on the archieve HDD will follow the SSD and The HDD's capacity ratio is moved to SSD and HDD.
  • the fourth cycle considers the write threshold. Write-intensive data objects only move to the HDD, taking into account the maximum number of Archieve HDDs that can be moved out and the HDD's capacity limit.
  • Table 4 read and write delay table
  • Table 5 read operation data input form
  • Table 7 read operation formula symbol definition table
  • 3PG s h R h ⁇ (NR i ⁇ (L r h -L r s )-NO i ⁇ L h ⁇ s )
  • 6PG r a ⁇ s R a ⁇ C s /C s +C h ⁇ C s /C h ⁇ ((L r a -L r s ) ⁇ NR i -NO i ⁇ L a ⁇ s )
  • HDD data capacity accounts for 3/9 of the total system data capacity
  • Archive HDD data capacity accounts for 5/9 of the total system data capacity
  • the ratio of the capacity of the HDD is 1:3, and the data of 2/3 on the Archive HDD is moved out to the HDD in order to improve the write performance.
  • Table 8 read operation data record table
  • 3PL h s R s ⁇ (NW j ⁇ (L w h -L w s )-NO j ⁇ L s ⁇ h )
  • 4PG h a R a ⁇ (C h -C S )/C h ⁇ ((L w a -L w h ) ⁇ NR i -NO i ⁇ L a ⁇ h )
  • the calculation formula for the SSD to move to the HDD write performance change value is 3, and the calculation formula for the Archive HDD to move to the HDD write performance change value is 4. Calculate all values of the write operation record table according to the above formula, see Table 10:

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明公开了一种去中心化的分布式异构存储系统数据分布方法,它包括以下步骤:1、将数据对象分类;2、将存储设备分类;3、将存储数据分成不同的"放置组集群",每种存储设备的类型对应于一类"放置组集群";4、计算待存入的每种数据对象应该放置到不同类型"放置组集群"的比例;5、利用哈希算法确定待存入的数据对象属于"放置组集群"中的哪一个"放置组";6、利用存储系统的数据分布算法,将每个"放置组"中的数据对象存储到多个对应的存储设备中;7、在系统运行过程中,根据数据对象的访问特征计算迁移阈值,并动态迁移数据对象。本发明的优点是:保持了存储系统的性能、负载均衡和可扩展性,又减少了对固态硬盘的写操作次数。

Description

一种去中心化的分布式异构存储系统数据分布方法 技术领域
本发明属于分布式计算机存储技术领域,具体涉及一种去中心化的分布式异构存储系统数据分布方法。
背景技术
在大数据应用、科学计算和云计算平台中,可靠并且可扩展的存储系统对系统性能有着至关重要的作用。随着数据量增大(PB级别),存储系统的数据分布策略必须保证性能和可扩展性。去中心化的数据分布策略,比如Ceph,利用存储设备本身的处理能力提供可靠的对象存储系统。固态硬盘(SSD)读写性能优于传统的机械硬盘(HDD),越来越被广泛地应用到存储系统中,形成大规模分布式异构存储系统。此外,新型归档硬盘(Archive HDD)也被越来越多地用在数据中心,这类硬盘拥有更大的容量,适合大数据存储,但是其读写速度比传统的机械硬盘更慢。因此,存储系统的数据分布策略必须考虑固态硬盘的“写耐受性”和各类硬盘的性能差异,同时保证系统的可扩展性和负载均衡,因为过多的写操作会加速固态硬盘存储介质的损耗,放置过多的数据在归档硬盘中会使系统读写性能受影响。
目前,有许多研究致力于工作流系统的数据分布和任务调度。比如在科学计算中,“工作流管理系统”会更具执行计算站点的存储资源和计算能力分配计算任务。根据工作流模型中任务的依赖关系,可以确定这些任务所需数据的数据量大小,然后将不同阶段的计算任务分配到不同的计算站点,分配方案中主要考虑减少不同站点的远程访问传输开销。Ceph利用存储设备自身的通信能力,设计出了一种新的数据分布方法,该方法分为两步,第一步利用哈希算法,将数据对象映射到“放置组”,哈希函数的输入就是数据对象的全局唯一的标识符,哈希函数的输出结果相同的数据对象被放置到同一个“放置组”。第二步利用伪随机哈希算法,将每个“放置组”分布到多个存储设备。该数据分布方法没有考虑存储系统的异构特性,这样会导致对固态硬盘产生密集的写操作。还有一些技术利用固态硬盘提升中心化的存储性能,这种中心化的数据分布策略使得系统不具备扩展性,不适合超大规模的数据应用。
发明内容
针对现有技术存在的不足,本发明所要解决的技术问题是提供一种去中心化的分 布式异构存储系统数据分布方法,它通过分析数据对象的访问方式来保持存储系统的性能、负载均衡和可扩展性,同时减少对固态硬盘的写操作。
本发明所要解决的技术问题是通过这样的技术方案实现的,本发明的第一种方法包括以下步骤:
步骤1、在程序的执行过程中,统计每个数据对象被读/写的次数,将读写次数转换为权值,作为数据的访问模式;根据数据的访问模式,将数据对象分类;
步骤2、根据存储设备的容量和读写性能,将存储设备分类;
步骤3、将存储数据分成不同的“放置组集群”,“放置组集群”包含多个“放置组”,每种存储设备的类型对应于一类“放置组集群”;
步骤4、根据存储系统的负载均衡目标和性能指标,计算待存入的每种数据对象应该放置到不同类型“放置组集群”的比例;
步骤5、利用哈希算法确定待存入的数据对象属于“放置组集群”中的哪一个“放置组”;
步骤6、利用存储系统的数据分布算法,将每个“放置组”中的数据对象存储到多个对应的存储设备中;固态硬盘的“放置组”会被分配到固态硬盘,机械硬盘的“放置组”会被分配到机械硬盘。
在经上述步骤进行初始存储数据分布之后,为了能将访问特征变化的数据迁移到合适的设备,保持存储系统的性能、负载均衡和可扩展性,通过在不同的存储设备之间移动数据,达到减少对固态硬盘的写操作目的,对上述方法进行改进有:
本发明的第二种方法包括以下步骤:
步骤1、在程序的执行过程中,统计一段时间内系统总的读写次数和被访问的数据对象的总数,用来确定该段时间内系统的访问模式;
步骤2、根据存储设备的容量和读写性能,将存储设备分类;
步骤3、将数据对象分成不同的“放置组集群”,“放置组集群”包含多个“放置组”,每种存储设备的类型对应于一类“放置组集群”;
步骤4、对新存入的数据对象,利用均匀哈希算法将数据对象映射到“放置组集群”和“放置组”,并为每个数据对象增加一个标识,用来表示该数据对象属于哪一个“放置组集群”;
步骤5、利用存储系统的数据分布算法,将每个“放置组”中的数据对象存储到多个对应的存储设备中;固态硬盘的“放置组”会被分配到固态硬盘,机械硬盘的“放置组”会被分配到机械硬盘。
步骤6、在系统运行过程中,根据数据的访问模式,计算每个存储设备数据访问的迁移阈值,并根据这些阈值将数据对象动态迁移到合适的存储设备,以使对固态硬盘的写次数减少,并提升系统读写性能。
本发明的技术效果:
本发明第一种方法根据数据对象的访问模式,将不同类别的数据分布到不同的“放置组集群”,此时需要计算待存入的不同类型数据对象放到不同的“放置组集群”的比例,用来控制“放置组集群”之间的负载均衡,在确定了每个数据对象所属的“放置组集群”之后,再利用哈希算法计算该数据对象对应的“放置组”;再把“放置组”中的数据对象分布到存储设备中。这样将数据均匀分布到存储设备中,排除了中心化的数据存储结构,既保持了存储系统的性能、负载均衡和可扩展性,又减少了对固态硬盘的写操作次数,延长其寿命。
本发明第二种方法根据数据对象访问模式的动态变化,在系统运行过程中,将不同类别的数据迁移到合适的“放置组集群”,数据迁移的过程需要设置不同的访问阈值,用来控制“放置组集群”之间的负载均衡。
本发明第二种方法的步骤4中对每个数据对象增加一个标识,在步骤6中数据移动之后,原来存储的“放置组集群”可能发生变化,则有记录当前数据对象属于哪一个“放置组集群”的标识。步骤6中,在系统运行过程中统计数据对象的访问情况,并为每个存储设备设定阈值,超过阈值的数据对象会产生动态迁移的操作。使用动态迁移的策略可以在达到减少对固态硬盘写操作的同时,使系统更具有通用性。
附图说明
本发明的附图说明如下:
图1为第一种方法计算每种待存入数据对象存储到每种“放置组集群”的比例算法流程图;
图2为本发明的数据存储过程图;
图3为映射读密集型数据对象映射到“放置组”的示意图;
图4为映射写密集型数据对象映射到“放置组”的示意图;
图5为第二种方法步骤6中的阈值算法流程图。
具体实施方式
下面结合附图和实施例对本发明作进一步说明:
一、本发明的第一种方法包括以下步骤:
步骤1、在程序的执行过程中,统计每个数据对象被读/写的次数,将读写次数转换 为权值,作为数据的访问模式;根据数据的访问模式,将数据对象分类,比如读密集型,写密集型和混合型;分类方法可以采用常见的K-Means聚类算法,每种类型的数据对象有个属性值用来表示这类数据对象平均写次数。
步骤2、根据存储设备的容量和读写性能,将存储设备分类,比如高速固态硬盘,低速固态硬盘,高速机械硬盘,低速机械硬盘,每种存储设备有自己的读写性能参数,比如平均读写延迟时间、容量。
步骤3、将存储数据分成不同的“放置组集群”,“放置组集群”包含多个“放置组”,每种存储设备类型对应于一类“放置组集群”。“放置组集群”用来将读写属性相似的数据对象组合到一起;“放置组集群”是一个逻辑的概念,主要用来对数据对象做聚合,同时,“放置组集群”也有容量和读写性能的属性,容量就是该“放置组集群”对应的所有硬盘的容量,读写性能是这些硬盘的平均读写延迟。
步骤4、根据存储系统的负载均衡目标和性能指标,计算待存入的每种数据对象应该放置到不同类型“放置组集群”的比例;
例如,假设系统有3个“放置组集群”,对于读密集型数据,20%放入第一个“放置组集群”,30%放入第二个“放置组集群”,50%放入第三个“放置组集群”,这个比例指的是放入每类“放置组集群”的个数占该类数据总数的比例。
存储系统的性能指标根据存储设备的读写性能来设定,比如,要求对所有的数据对象,读取操作的平均延迟为0.2毫秒,写入操作的平均延迟为0.5毫秒。设置每种数据对象在不同类型“放置组集群”比例的目的就是要保证数据在“放置组集群”之间均衡分布。在极端的情况下,所有的数据对象都是写密集型,按照存储设备分配的目标,写密集型数据对象应分配到机械硬盘中,以便减少对固态硬盘的写操作,但如果所有数据对象都是写密集型的,那么全都会被分配到机械硬盘对应的“放置组集群”,使得固态硬盘中没有数据。为了避免这种情况,需要将同一种类型的数据对象分配到不同的“放置组集群”,用这个比例来控制“放置组集群”之间的负载均衡。
步骤5、利用哈希算法确定待存入的数据对象属于“放置组集群”中的哪一个“放置组”,因为一个“放置组集群”中包含多个“放置组”。
步骤6、利用存储系统的数据分布算法,将每个“放置组”中的数据对象存储到多个对应的存储设备中,固态硬盘对应“放置组集群”中的“放置组”会被分配到固态硬盘,机械硬盘对应“放置组集群”中的“放置组”会被分配到机械硬盘。
一个“放置组”存储到多个存储设备的原因是为了对同一个数据多次备份。备份数由系统初始化设置。因为同一个“放置组”对应的存储设备有多个,所以需要有个映射 算法将确定每个“放置组”应该放入哪一个存储设备。在Ceph的存储策略中,会利用伪随机哈希算法将每个“放置组”中的数据创建多个备份,分别存到不同的存储设备。
上述步骤4中,计算每种待存入数据对象放置到每种“放置组集群”的比例算法流程图如图1所示:
该流程开始于步骤801,然后:
在步骤802,计算所有待存入数据对象的总数,即不同类型数据对象的总和;
在步骤803,计算已有数据对象的总数,即在初始状态下,所有存储设备已经存储的数据对象的个数;
在步骤804,根据负载均衡条件,计算每个“放置组集群”能存储的数据对象最大值;即确定每个“放置组集群”的容量;
负载均衡是系统的配置参数,例如,在所有数据对象完全平均分布的情况下,根据每个存储设备的容量增加或者减少5%都认为是负载均衡的。比如,某个“放置组集群”在完全平均分布的状态下能存储100个数据对象,负载均衡的平衡条件允许5%的浮动,那么该“放置组集群”最多能存储100+100*0.05=105个数据对象;
在步骤805,将所有待存入数据对象按照平均写次数升序排列,平均写次数是不同类的数据对象的属性;
假设待存入数据对象被分成了3类,读密集性,写密集型和混合型,其中读密集型数据的平均写次数是10,写密集型的平均写次数是80,混合型的平均写次数是50。
在步骤806,将所有“放置组集群”按照性能降序排列,其中“放置组集群”的性能就是与其对应的存储设备的读写性能,固态硬盘的读写性能优于机械硬盘;
在步骤807,初始化变量i=0,用来扫描待存入数据对象类别;
假设待存入数据对象被分成了3类,这个流程中的i就是1,2,3,这是一个循环迭代过程,即分别扫描待存入的每个类别的数据对象;
在步骤808,初始化变量j=0,用来扫描“放置组集群”类别;
假设数据“放置组集群”被分成了4类,这个流程中的j就是1,2,3,4;
在步骤809,将第i类待存入的数据对象分配到第j类“放置组集群”;
该步骤是按照步骤805和步骤806中排好的顺序,依据步骤804所计算的“放置组集群”容量依次填充待存入的每类数据对象的个数;
在步骤810,记录存储在“放置组集群”j中i类待存入的数据对象的个数,用于计算待存入的每类数据对象的存储比例;
每一类待存入的数据对象的总个数是已知的,记录每一类待存入数据对象放置到每 一个“放置组集群”的个数,用该数值除以每一类待存入记录数据对象的总数,就得到了比值。
在步骤811,判断“放置组集群”j是否达到最大存储个数,如果是,执行步骤812,否则执行步骤813;
在步骤813,判断是否所有待存入数据对象处理完毕,如果是,执行步骤816,否则,执行步骤814;
在步骤814,将用来扫描数据对象类别数组的指针i移动到下一个位置,即处理下一类待存入数据对象,执行步骤809;
在步骤812,将用来扫描“放置组集群”数组的指针j移动到下一个位置,即处理下一个“放置组集群”;
在步骤815,判断是否所有“放置组集群”处理完毕,如果是,执行步骤816,否则,执行步骤809;
在步骤816,根据步骤810中记录的每个“放置组集群”存储的每类待存入数据对象的个数,计算每类待存入数据对象分配到每种“放置组集群”的比例;
在步骤817,每类待存入数据分配到各类“放置组集群”的算法结束。
上述步骤5和步骤6的数据存储过程如图2所示,将存储系统的“放置组”划分为不同的“放置组集群”,每一个“放置组集群”包含多个“放置组”。存储数据对象时,需要首先根据每个数据对象的类别和该类别的数据对象在“放置组集群”的分配比例,确定该数据属于哪一个“放置组集群”,这个过程要通过图1的流程来计算不同类型的对象放到不同的“放置组集群”的比列,以来控制“放置组集群”之间的负载均衡,再利用哈希算法确定该数据对象属于这个“放置组集群”中的哪一个“放置组”。步骤6利用一个伪随机哈希算法(CRUSH)把这个“放置组”映射到不同的存储设备中去。
(一)、图1所示流程图的实施例
假设存储系统有5类存储设备,每类存储设备对应一个“放置组集群”,那么系统有5个“放置组集群”。所有放置组集群已经按照性能从高到低进行排序(对应于步骤806)。如表1所示。
表1 系统存储设备的属性
Figure PCTCN2017082718-appb-000001
存储系统的总容量为:1000+1500+2000+2500+3000=10000
假设待存入的数据对象分成了3类,每类对象的平均读写次数和数量如表2所示。每个类型已经按照写次数进行排序(对应于步骤805)。
表2 所有待存入数据对象的属性
Figure PCTCN2017082718-appb-000002
根据图1所示的流程,算法的运行过程如下:
在步骤802,所有待存入的数据对象总数为350+150+200=700;
在步骤803,计算已有数据对象的总数为60+260+300+530+700=1850;
数据对象的总数量为:700+1850=2500;
在步骤804,假设系统负载均衡的平衡因子e=0.001,则对于每一个“放置组集群”计算出其相应可容纳的最大值RMAX,计算公式如下:
“放置组集群”1:RMAX.1:(1+0.001)*(1000*(700+1850))/10000=255;
“放置组集群”2:RMAX.2:(1+0.001)*(1500*(700+1850))/10000=383;
“放置组集群”3:RMAX.3:(1+0.001)*(2000*(700+1850))/10000=511;
“放置组集群”4:RMAX.4:(1+0.001)*(2500*(700+1850))/10000=638;
“放置组集群”5:RMAX.5:(1+0.001)*(3000*(700+1850))/10000=766;
因此,对五个“放置组集群”,假设数据完全平均分配时的最大容量RMAX分别为:255,383,511,638,766。
在步骤807,i初始化为0,用来扫描A,B,C三类待存入的数据对象。
在步骤808,j初始化为0,用来扫描“放置组集群”1,2,3,4,5.
在步骤809的分配与步骤810记录过程如下:
对三种类型数据对象进行分类时,对平均写次数最少读次数较多的A类优先分配到写延迟小读延迟小的OSD.1上。
1.放置组集群1本身负载为60,计算出的可容纳最大值为255,可容纳量255-60=195。
故可分配类型A的195个数据对象分配到放置组集群1中,此时类型A剩余350-195=155。
放置组集群1已满。
Figure PCTCN2017082718-appb-000003
Figure PCTCN2017082718-appb-000004
2.放置组集群2本身负载为260,计算出的可容纳最大值为383,可容纳量383-260=123。
继续将类型A的123个数据对象分配到放置组集群2中,类型A剩余155-123=32。
放置组集群2已满。
Figure PCTCN2017082718-appb-000005
3.放置组集群3本身负载为300,计算出的可容纳最大值为511,可容纳量511-300=211;
继续将类型A的32个数据对象分配到放置组集群3中,类型A分配完毕,剩余0;
放置组集群3剩余容量为211-32=179;
对类型B进行分配,优先分配到读写延迟都相对较小的放置组集群3中;
将类型B的150个数据对象全部分配到放置组集群3中。此时放置组集群3剩余容量179-150=29;
对类型C进行分配,仍优先分配到读写延迟都相对较小的放置组集群3中;
将类型C的29个数据对象分配到放置组集群3中,类型C剩余200-29=171。
放置组集群3已满。
Figure PCTCN2017082718-appb-000006
4.放置组集群4本身负载530,计算出的可容纳最大值为638,可容纳量为638-530=108;
继续将类型C的108个数据对象分配到放置组集群4中,类型C剩余63=171-108。
放置组集群4已满。
Figure PCTCN2017082718-appb-000007
Figure PCTCN2017082718-appb-000008
5.放置组集群5本身负载700,计算出的可容纳最大值为766,可容纳量为766-700=66;
将类型C的63个数据对象分配到放置组集群5中,类型C分配完毕,剩余0。
放置组集群5剩余容量仍为66-63=3。
Figure PCTCN2017082718-appb-000009
在步骤816,根据最后的结果,计算每类待存入的数据对象分配到每个“放置组集群”的比例:
Figure PCTCN2017082718-appb-000010
(二)、下面说明本发明的步骤5如何将不同类型的数据对象映射到不同的“放置组”。
本实施例中,假设系统有100个“放置组”,编号从1到100。根据系统存储设备类型,这些“放置组”被分成了3个“放置组集群”:1-20号为第一个“放置组集群”,21-50号为第二个“放置组集群”,51-100号为第三个“放置组集群”。
如图3所示,将一个读密集型的数据对象映射到“放置组”13。假设通过如图1的流程算法得出读密集型数据对象在三个“放置组集群”中的分布比例为6:2:2,也就是:1-20号“放置组”为第一个“放置组集群”,有60%的读密集型的数据属于第一个“放置组集群”,21-50号“放置组”为第二个“放置组集群”,有20%的读密集型数据属于第二个“放置组集群”,51-100号“放置组”为第三个“放置组集群”,有20%的读密集型数据属于第三个“放置组集群”。由于当前这个读密集型数据对象的标识经过哈希函数得到的结果是50,在第一个“放置组集群”的范围之内,再利用哈希算法,计算出此该数据对象的目标“放置组”为13。
如图4所示,将一个写密集型的数据对象映射到“放置组”62。假设写密集型数据 对象在三个“放置组集群”中的分布比例为1:3:6,该数据对象的标识经过哈希函数得到的结果也是50,但50属于第三个“放置组集群”(因为读密集型数据和写密集型数据对应放置到每个放置组集群的比例是不一样的,图4中间哈希值列出三个“放置组集群”的放置比例,哈希值1-10的数据对象可以认为是放置到第一类“放置组集群”的,哈希值11-40的数据对象是放置到第二类“放置组集群”的,哈希值41-100的数据对象是放置到第三类“放置组集群”的),因此这个对象最终被映射到“放置组”62中。
由于上述第一种方法的步骤4通过计算比例来确定数据对象应该放到哪一个“放置组集群”,并且这个结果确定之后,后续操作就不会再变化,所以第一种方法的不足之处是:只适用于数据的静态存储,即只能对离线的数据进行分类并存储,而在一个存储系统中,数据对象的特征可能会随着系统的运行而发生变化,这样的趋势会导致静态存储的分类失效,最终不能有效达到减小固态硬盘写次数的目的。为此,本发明还提供了第二种方法。
二、本发明的第二种方法包括以下步骤:
步骤1、在程序的执行过程中,统计一段时间内系统总的读写次数和被访问的数据对象的总数,用来确定该段时间内系统的访问模式。比如,在一天之内,读次数为1的数据对象有M个,读此次为2的数据对象有N个,写次数为1的数据对象有K个,以此类推,可以得到这一天总的读次数,总的写次数和被访问的数据对象的总数。
步骤2、根据存储设备的容量和读写性能,将存储设备分类,比如固态硬盘,机械硬盘和归档硬盘,每种存储设备有自己的读写性能参数,比如平均读写延迟时间、容量。
步骤3、将数据分到不同的“放置组集群”,“放置组集群”包含多个“放置组”,每种存储设备类型对应于一类“放置组集群”。“放置组集群”用来将读写属性相似的数据对象组合到一起;“放置组集群”是一个逻辑的概念,主要用来对数据对象做聚合。
步骤4、对新存入的数据对象,利用均匀哈希算法将数据对象映射到“放置组集群”和“放置组”,并为每个数据对象增加一个标识,用来表示该数据对象属于哪一个“放置组集群”。比如,系统有100个“放置组”,被分成了5类“放置组集群”,每个“放置组集群”包含20个“放置组”,假设系统需要存储1000个新的数据对象,则均匀哈希算法会基本保证每个“放置组”有10个数据对象。
步骤5、利用存储系统的数据分布算法,将每个“放置组”中的数据对象存储到多个对应的存储设备中。在Ceph的存储策略中,会利用CRUSH算法将每个“放置组”中的数据创建多个备份,分别存到不同的存储设备。
步骤6、在系统运行过程中,根据数据的访问模式,计算每个存储设备数据访问的迁移阈值,并根据这些阈值将数据对象动态迁移到合适的存储设备,以使对固态硬盘的写次数减少,并提升系统读写性能。
例如,假设系统有三类存储设备,分别为固态硬盘,机械硬盘和归档硬盘。固态硬盘有个写次数阈值,当存储在固态硬盘的数据对象写次数超过写阈值时,需要将数据对象迁移到机械硬盘,以减小固态硬盘的写次数;机械硬盘有读次数阈值,如果存储在其中的数据对象读次数超过阈值,则数据对象会从机械硬盘迁移到固态硬盘,用来提升系统读性能;归档硬盘有两个阈值,读阈值和写阈值,当存储在归档硬盘中的数据对象写次数超过阈值时,数据对象迁移到机械硬盘,提升写性能,当数据对象的读阈值超过阈值时,这些数据对象会迁移到固态硬盘和机械硬盘。
具体的数据对象迁移过程如下:在读写流程中,运行在存储设备(OSD)上的进程在每次读写操作完成之后会更新本次操作涉及的数据对象的访问次数,并用这个更新之后的访问次数和计算出的迁移阈值作比较,如果达到迁移阈值,就会利用伪随机哈希算法CRUSH,计算该数据对象应该存放到哪些新的存储设备,这个OSD上的进程会将该数据对象以及所有备份迁移到新的设备,然后通知上层读写流程结束。也就是说,数据迁移在读写流程中完成,这个迁移过程对上层的应用来说是透明的。
由本发明第二种方法可以看出,确定迁移的阈值是该方案的关键点,如果阈值设定太小,会使得数据对象的迁移很频繁,产生很大的迁移开销,如果阈值设定太大,则会使得对固态硬盘的写次数很多。因此,设定阈值的判断条件需要综合考虑系统的性能和负载均衡。本发明还提供一种阈值算法。
表3列出了各字母或字母组合的意义。
表3:字母标识的定义
Figure PCTCN2017082718-appb-000011
Figure PCTCN2017082718-appb-000012
表3中,Rs=Cs/(Cs+Ch+Ca),Rh=Ch/(Cs+Ch+Ca),Ra=Cs/(Cs+CH+Ca)。
该阈值算法的流程图如图5所示,程序的输入参数:数据对象满足负载均衡的限制值α,性能提升比例β,均匀分布条件下初始性能P0;读操作信息记录表;写操作信息记录表(输入参数当作已知量处理)。程序的输出:四个阈值。
该流程开始于步骤000,然后:
在步骤001,获取输入参数:初始性能P0、性能提升比例β、数据对象满足负载均衡的限制值α;
在步骤002,定义以下变量:数据对象单位数k=0、数据对象从HDD移到SSD性能提升PGs h=0、从HDD移到SSD性能下降PLh s=0、从Archive HDD移到HDD性能提升PGw a=0、从Archive HDD移到SSD和HDD性能提升PGr a=0、读操作数据记录表的行号i=0、写操作数据记录表的行号j=0;
在步骤003,判断性能提升是否满足PGs h+PGw a+PGr a–PLh s<=P0·β,如果是执行步骤017,否则执行步骤004;
在步骤004,将数据对象单位数k做加1操作;在初始化时,设置以k·Vssd个数据对象移动,在不能满足性能提升要求的情况下,本步骤就执行操作k=k+1,以k·Vssd个数据对象进行移动,其中Vssd表示满足负载均衡条件下从SSD中移出的数据对象个数;
在步骤005,赋值j=j+1,并读取写操作数据记录表中的第j行的数据,即找出写次数相对于上次循环增加1次的数据对象个数,j行数据的对应的写次数为W(j)。j的初始值为0,即第一次执行此循环的j为1,读取写操作记录表的第1行数据。
在步骤006,若将SSD中写次数大于j-1次的数据对象移动到HDD中,判断移动的数据对象个数是否大于k·Vssd,若是,执行步骤005,否则,将阈值WS赋值为W(j),W(j)是第j行数据的写次数值,执行步骤007;
在步骤007,记录阈值WS、性能下降值PLh s,执行操作j=0,然后执行步骤008;
在步骤008,赋值i=i+1,并读取读操作数据记录表中的第i行的数据,即找出读次数相对于上次循环增加1次的数据对象个数,i行数据的对应的读次数为R(i)。i的初始值为0,即第一次执行此循环的i为1,读取读操作记录表的第1行数据,该行对应读次数R(i)为0次数据的相关数据;
在步骤009,若将HDD中读次数大于i-1次的数据对象移动到SSD中,判断移动的数据对象个数是否大于等于k·Vssd,若是,则执行步骤008,否则,将阈值RH赋值为R(i),执行步骤010;
在步骤010,记录阈值RH,性能提高值PGs h,执行操作i=0,然后执行步骤011;
在步骤011,赋值i=i+1,读取读操作数据记录表中的第i行的数据,即找出读次数相对于上次循环增加1次的数据对象个数,i行数据的对应的写次数为R(i);
在步骤012,为了避免SSD存储空间不足,将Archive HDD的数据对象移到SSD与HDD中的数据单位个数比为CS/Ch。判断将Archive HDD中读次数大于R(i)次的数据对象移到SSD和HDD中的个数是否大于CS/Ch·Vssd,若是,则执行步骤011,否则,将阈值RA赋值为R(i),此循环结束,执行步骤013;
在步骤013,记录阈值RA,性能提高值PGr a;执行操作i=0,然后执行步骤014;
在步骤014,赋值j=j+1,并读取写操作数据记录表中的第j行的数据,即找出写次数相对于上次循环增加1次的数据对象个数,j行数据的对应的写次数为W(j);
在步骤015,将Archive HDD中写次数大于j-1的数据对象移到HDD中,判断移动的个数是大于(Ch-CS)/Ch·Vssd,若是,则执行步骤014,否则,将阈值WA赋值为W(j),此循环结束,执行步骤016;
在步骤016,记录阈值WA,性能提高值PGw a;执行操作j=0,然后执行步骤003;
在步骤017,输出阈值WS、RH、RA、WA;
在步骤018,程序结束。
上述阈值算法的流程中四个循环是独立的,但是有先后顺序,这就是该算法的设计思想。即先考虑从SSD移出一部分数据对象到HDD(第一个循环),为了达到负载均衡,需要从HDD移动相同个数的数据对象到SSD(第二个循环),在这个移动过程中,需要考虑到底移动多少个数据对象,以及移动之后性能的变化情况,根据输入参数α和SSD的容量,可以计算出存放到SSD的数据的波动范围Vssd,比如,在完全平均分布的状态下能存储100个数据对象,负载均衡的平衡条件允许5%的浮动,则α是5%,即SSD最多存105个数据对象,最少存95个数据对象,Vssd=5。移动数据对象的单位就是按照Vssd的倍数进行,即程序中的k,也就是说,可以从SSD移动5个,10个,15个…..,在移动数 据之后,需要计算移动之后的性能变化。第三、四个循环的作用是计算从Archieve HDD移出的阈值。第三个循环考虑读阈值,Archieve HDD上读密集的数据会被移动到SSD和HDD,但是移动的数据不能超过SSD的最大允许波动容量Vssd,archieve HDD上读密集的数据对象会按照SSD和HDD的容量比例移动到SSD和HDD。第四个循环考虑写阈值,写密集的数据对象只会移动到HDD,需要考虑Archieve HDD能移出的最大个数和HDD的容量限制。
阈值算法的实施例
一个阈值算法的实例,假设如下(即程序的输入信息):
①SSD数据容量,HDD数据容量,Archive HDD数据容量比值为1:3:5;
②数据对象在不同存储介质上迁移的延迟都为10毫秒;
③α=20%,β=10%;
④各存储设备的读写延迟如表4:
表4 读写延迟表
Figure PCTCN2017082718-appb-000013
表4中:各种存储设备的延迟根据读写性能指标归一化转换而来。
⑤假设一组存储数据的读次数记录表为表5、写次数记录表为表6:
表5 读操作数据输入表
Figure PCTCN2017082718-appb-000014
表6 写操作数据输入表
Figure PCTCN2017082718-appb-000015
Figure PCTCN2017082718-appb-000016
表7 读操作公式符号定义表
Figure PCTCN2017082718-appb-000017
表7中,各项数值的计算公式:
①NOi=NOi-1+Nr i
②NRi=NRi-1+Fr i·Nr i
③PGs h=Rh·(NRi·(Lr h-Lr s)-NOi·Lh~s)
④PGr a=PGr a~h+PGr a~s
⑤PGr a~h=Ra·Ch/Cs+Ch·Cs/Ch·((Lr a-Lr h)·NRi-NOi·La~h)
⑥PGr a~s=Ra·Cs/Cs+Ch·Cs/Ch·((Lr a-Lr s)·NRi-NOi·La~s)
使用上述公式计算获得表8。
在表8中,对读次数而言,“≥R(i)次的数据对象数”的计算公式为①式,例如,第一行数据中,读次数≥R(1)次的数据对象数为:3400+1600=5000。“≥R(i)次的数据对象总的读次数”的计算公式为②式,例如,第一行数据中读次数≥R(1)次的数据对象总的读次数为:5940+1600×0=5940。
由SSD数据容量,HDD数据容量,Archive HDD数据容量比值为1:3:5,即CS:Ch:Ca=1:3:5,可知,SSD数据容量占整个系统数据容量的1/9,HDD数据容量占整个系统数据容量的3/9,Archive HDD数据容量占整个系统数据容量的5/9;由Archive HDD数据对象在读和写上迁出的比例为(CS/Ch·Vssd):((Ch-CS)/Ch·Vssd)=1:2,可知, 为了提高读性能将Archive HDD上1/3的数据迁出到SSD和HDD上,再根据SSD和HDD的容量比例1:3分配数据,为了提高写性能将Archive HDD上2/3的数据迁出到HDD上。
在表8中,HDD移到SSD读性能变化值的计算公式为③式。Archive HDD移到HDD读性能变化值由计算公式为④式,Archive HDD移到SSD读性能变化值的计算公式为⑤式。根据以上公式计算出读操作记录表的所有值,参见表8:
表8 读操作数据记录表
Figure PCTCN2017082718-appb-000018
表9 写操作公式符号定义表
Figure PCTCN2017082718-appb-000019
表9中,各项数值的计算公式:
①NOj=NOj-1+Nw j
②NWj=NWj-1+Fw j·Nw ji
③PLh s=Rs·(NWj·(Lw h-Lw s)-NOj·Ls~h)
④PGh a=Ra·(Ch-CS)/Ch·((Lw a-Lw h)·NRi-NOi·La~h)
使用上述公式计算获得表10。
在表10中,对写次数而言,“≥W(j)次的数据对象数”的计算公式为①式,例如,写 次数≥W(1)次的数据对象数为:2400+2600=5000;“≥W(j)次的数据对象总的写次数”的计算公式为②式,例如≥0次的数据对象总的写次数为:6100+6100×0=6100。SSD移到HDD写性能变化值的计算公式为③式,Archive HDD移到HDD写性能变化值的计算公式为④式。根据以上公式计算出写操作记录表的所有值,参见表10:
表10 写操作数据记录表
Figure PCTCN2017082718-appb-000020
假设原始性能P0为10000毫秒,数据对象总数为5000,数据在不同存储介质上移动SSD数据容量的α为20%仍保持数据对象的均衡分布,由步骤004得可移动数据对象的总数k·Vssd,在k=1时1×5000×1/9×20%=111.11,若性能提升比例β为10%,即性能值为P0·β,即10000×10%=1000毫秒,由表中数据可知,最多移动111个数据对象,Archive HDD在提高读性能方面移动111×1/3=37,在提高写性能方面移动111×2/3=74时,满足均衡分布条件时,
步骤005和步骤006的循环至j由0增加到7时,满足条件,此时WS=6,PLh s=738.222毫秒。
步骤008和步骤009的循环至i由0增加到6时,满足条件,此时RH=5,PGs h=947.3333毫秒。
步骤011和步骤012的循环至i由0增加到9时,满足条件,此时RA=8,PGr a=222.2222+136.5741=358.7963毫秒,计算公式为读数据计算公式③。
步骤014和步骤015的循环至j由0增加到9时,满足条件,此时WA=8,PGw a=857.7778毫秒。
性能总体提高PGs h+PGw a+PGr a-PLh s=1425.685毫秒>1000毫秒,满足性能提高的要求,所以得到阈值WS=6,RH=5,RA=8,WA=8。

Claims (6)

  1. 一种去中心化的分布式异构存储系统数据分布方法,其特征是,包括以下步骤:
    步骤1、在程序的执行过程中,统计每个数据对象被读/写的次数,将读写次数转换为权值,作为数据的访问模式;根据数据的访问模式,将数据对象分类;
    步骤2、根据存储设备的容量和读写性能,将存储设备分类;
    步骤3、将存储数据分成不同的“放置组集群”,“放置组集群”包含多个“放置组”,每种存储设备的类型对应于一类“放置组集群”;
    步骤4、根据存储系统的负载均衡目标和性能指标,计算待存入的每种数据对象应该放置到不同类型“放置组集群”的比例;
    步骤5、利用哈希算法确定待存入的数据对象属于“放置组集群”中的哪一个“放置组”;
    步骤6、利用存储系统的数据分布算法,将每个“放置组”中的数据对象存储到多个对应的存储设备中。
  2. 根据权利要求1所述的一种去中心化的分布式异构存储系统数据分布方法,其特征是,所述步骤4中,计算待存入每种数据对象放置到每种“放置组集群”的比例的步骤包括:
    步骤802,计算所有待存入数据对象的总数;
    步骤803,计算已有数据对象的总数;
    步骤804,根据负载均衡条件,计算每个“放置组集群”能存储的数据对象最大值;
    步骤805,将所有待存入数据对象按照平均写次数升序排列;
    步骤806,将所有“放置组集群”按照性能降序排列;
    步骤807,初始化变量i=0,用来扫描待存入数据对象类别;
    步骤808,初始化变量j=0,用来扫描“放置组集群”类别;
    步骤809,将第i类待存入的数据对象分配到第j类“放置组集群”;
    步骤810,记录存储在“放置组集群”j中i类待存入的数据对象的个数;
    步骤811,判断“放置组集群”j是否达到最大存储个数,如果是,执行步骤812,否则执行步骤813;
    步骤813,判断是否所有待存入数据对象处理完毕,如果是,执行步骤816,否则,执行步骤814;
    步骤814,处理下一类待存入数据对象,执行步骤809;
    步骤812,处理下一个“放置组集群”;
    步骤815,判断是否所有“放置组集群”处理完毕,如果是,执行步骤816,否则,执行步骤809;
    步骤816,根据步骤810中记录的每个“放置组集群”存储的每类待存入数据对象的个数,计算每类待存入数据对象分配到每种“放置组集群”的比例。
  3. 根据权利要求2所述的一种去中心化的分布式异构存储系统数据分布方法,其特征是,所述步骤809中,第i类待存入的数据对象分配到第j类“放置组集群”的方法是:按照步骤805和步骤806中排好的顺序,依据步骤804所计算的“放置组集群”容量依次填充待存入的每类数据对象的个数。
  4. 根据权利要求1所述的一种去中心化的分布式异构存储系统数据分布方法,其特征是:步骤6中,采用伪随机哈希算法把这个“放置组”映射到不同的存储设备中。
  5. 一种去中心化的分布式异构存储系统数据分布方法,其特征是,包括以下步骤:
    步骤1、在程序的执行过程中,统计一段时间内系统总的读写次数和被访问的数据对象的总数,用来确定该段时间内系统中数据对象的访问模式;
    步骤2、根据存储设备的容量和读写性能,将存储设备分类;
    步骤3、将数据对象分成不同的“放置组集群”,“放置组集群”包含多个“放置组”,每种存储设备的类型对应于一类“放置组集群”;
    步骤4、对新存入的数据对象,利用均匀哈希算法将数据对象映射到“放置组集群”和“放置组”,并为每个数据对象增加一个标识,用来表示该数据对象属于哪一个“放置组集群”;
    步骤5、利用存储系统的数据分布算法,将每个“放置组”中的数据对象存储到多个对应的存储设备中;
    步骤6、在系统运行过程中,根据数据的访问模式,计算每个存储设备数据访问的迁移阈值,并根据这些阈值将数据对象动态迁移到合适的存储设备,以使对固态硬盘的写次数减少,并提升系统读写性能。
  6. 根据权利要求5所述的一种去中心化的分布式异构存储系统数据分布方法,其特征是,所述步骤6中,计算每个存储设备数据访问的迁移阈值的步骤包括:
    在步骤001,获取输入参数:初始性能P0、性能提升比例β、数据对象满足负载均衡的限制值α;
    在步骤002,定义以下变量:数据对象单位数k=0、数据对象从HDD移到SSD性能提升PGs h=0、从HDD移到SSD性能下降PLh s=0、从Archive HDD移到HDD性能提 升PGw a=0、从Archive HDD移到SSD和HDD性能提升PGr a=0、读操作数据记录表的行号i=0、写操作数据记录表的行号j=0;
    在步骤003,判断性能提升是否满足PGs h+PGw a+PGr a–PLh s<=P0·β,如果是执行步骤017,否则执行步骤004;
    在步骤004,将数据对象单位数k做加1操作;在初始化时,设置以k·Vssd个数据对象移动,在不能满足性能提升要求的情况下,本步骤就执行操作k=k+1,以k·Vssd个数据对象进行移动,其中Vssd表示满足负载均衡条件下从SSD中移出的数据对象个数;
    在步骤005,赋值j=j+1,并读取写操作数据记录表中的第j行的数据,即找出写次数相对于上次循环增加1次的数据对象个数,j行数据的对应的写次数为W(j);j的初始值为0,即第一次执行此循环的j为1,读取写操作记录表的第1行数据;
    在步骤006,若将SSD中写次数大于j-1次的数据对象移动到HDD中,判断移动的数据对象个数是否大于k·Vssd,若是,执行步骤005,否则,将阈值WS赋值为W(j),W(j)是第j行数据的写次数值,执行步骤007;
    在步骤007,记录阈值WS、性能下降值PLh s,执行操作j=0,然后执行步骤008;
    在步骤008,赋值i=i+1,并读取读操作数据记录表中的第i行的数据,即找出读次数相对于上次循环增加1次的数据对象个数,i行数据的对应的读次数为R(i);i的初始值为0,即第一次执行此循环的i为1,读取读操作记录表的第1行数据,该行对应读次数R(i)为0次数据的相关数据;
    在步骤009,若将HDD中读次数大于i-1次的数据对象移动到SSD中,判断移动的数据对象个数是否大于等于k·Vssd,若是,则执行步骤008,否则,将阈值RH赋值为R(i),执行步骤010;
    在步骤010,记录阈值RH,性能提高值PGs h,执行操作i=0,然后执行步骤011;
    在步骤011,赋值i=i+1,读取读操作数据记录表中的第i行的数据,即找出读次数相对于上次循环增加1次的数据对象个数,i行数据的对应的写次数为R(i);
    在步骤012,为了避免SSD存储空间不足,将Archive HDD的数据对象移到SSD与HDD中的数据单位个数比为CS/Ch;判断将Archive HDD中读次数大于R(i)次的数据对象移到SSD和HDD中的个数是否大于CS/Ch·Vssd,若是,则执行步骤011,否则,将阈值RA赋值为R(i),此循环结束,执行步骤013;
    在步骤013,记录阈值RA,性能提高值PGr a;执行操作i=0,然后执行步骤014;
    在步骤014,赋值j=j+1,并读取写操作数据记录表中的第j行的数据,即找出写 次数相对于上次循环增加1次的数据对象个数,j行数据的对应的写次数为W(j);
    在步骤015,将Archive HDD中写次数大于j-1的数据对象移到HDD中,判断移动的个数是大于(Ch-CS)/Ch·Vssd,若是,则执行步骤014,否则,将阈值WA赋值为W(j),此循环结束,执行步骤016;
    在步骤016,记录阈值WA,性能提高值PGw a;执行操作j=0,然后执行步骤003;
    在步骤017,输出阈值WS、RH、RA、WA。
PCT/CN2017/082718 2016-05-31 2017-05-02 一种去中心化的分布式异构存储系统数据分布方法 WO2017206649A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201780026690.XA CN109196459B (zh) 2016-05-31 2017-05-02 一种去中心化的分布式异构存储系统数据分布方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610376033.5 2016-05-31
CN201610376033.5A CN106055277A (zh) 2016-05-31 2016-05-31 一种去中心化的分布式异构存储系统数据分布方法

Publications (1)

Publication Number Publication Date
WO2017206649A1 true WO2017206649A1 (zh) 2017-12-07

Family

ID=57171584

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/082718 WO2017206649A1 (zh) 2016-05-31 2017-05-02 一种去中心化的分布式异构存储系统数据分布方法

Country Status (2)

Country Link
CN (2) CN106055277A (zh)
WO (1) WO2017206649A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026337A (zh) * 2019-12-30 2020-04-17 中科星图股份有限公司 一种基于机器学习和ceph思想的分布式储存方法
CN111880747A (zh) * 2020-08-01 2020-11-03 广西大学 一种基于分级映射的Ceph存储系统自动均衡存储方法
CN113467700A (zh) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 基于异构存储的数据分配方法以及装置
CN114048239A (zh) * 2022-01-12 2022-02-15 树根互联股份有限公司 时间序列数据的存储方法、查询方法及装置
WO2022105441A1 (zh) * 2020-11-20 2022-05-27 苏州浪潮智能科技有限公司 一种存储集群的扩容方法、系统及相关装置
CN117724663A (zh) * 2024-02-07 2024-03-19 济南浪潮数据技术有限公司 一种数据存储方法、系统、设备及计算机可读存储介质

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055277A (zh) * 2016-05-31 2016-10-26 重庆大学 一种去中心化的分布式异构存储系统数据分布方法
CN106506636A (zh) * 2016-11-04 2017-03-15 武汉噢易云计算股份有限公司 一种基于OpenStack的云平台集群方法及系统
CN106991170A (zh) * 2017-04-01 2017-07-28 广东浪潮大数据研究有限公司 一种分布式文件容量均衡的方法与装置
CN107317864B (zh) * 2017-06-29 2020-08-21 苏州浪潮智能科技有限公司 一种存储设备的数据均衡方法及装置
CN107329705B (zh) * 2017-07-03 2020-06-05 中国科学院计算技术研究所 一种针对异构存储的Shuffle方法
CN107391039B (zh) * 2017-07-27 2020-05-15 苏州浪潮智能科技有限公司 一种数据对象存储方法和装置
CN110231913A (zh) * 2018-03-05 2019-09-13 中兴通讯股份有限公司 数据处理方法、装置及设备、计算机可读存储介质
CN109002259B (zh) * 2018-06-28 2021-03-09 苏州浪潮智能科技有限公司 一种归置组所属硬盘分配方法、系统、装置及存储介质
CN109491970B (zh) * 2018-10-11 2024-05-10 平安科技(深圳)有限公司 面向云存储的不良图片检测方法、装置及存储介质
CN110347497B (zh) * 2019-06-03 2020-07-21 阿里巴巴集团控股有限公司 一种将多个存储设备划分设备组的方法及装置
US11099759B2 (en) 2019-06-03 2021-08-24 Advanced New Technologies Co., Ltd. Method and device for dividing storage devices into device groups
CN111258508B (zh) * 2020-02-16 2020-11-10 西安奥卡云数据科技有限公司 一种分布式对象存储中的元数据管理方法
CN111708486B (zh) * 2020-05-24 2023-01-06 苏州浪潮智能科技有限公司 一种主放置组均衡优化的方法、系统、设备及介质
CN112835530A (zh) * 2021-02-24 2021-05-25 珠海格力电器股份有限公司 延长存储器使用寿命的方法及空调
CN113885797B (zh) * 2021-09-24 2023-12-22 济南浪潮数据技术有限公司 一种数据存储方法、装置、设备及存储介质
CN115827757B (zh) * 2022-11-30 2024-03-12 西部科学城智能网联汽车创新中心(重庆)有限公司 一种对多HBase集群的数据操作方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150263A (zh) * 2012-12-13 2013-06-12 深圳先进技术研究院 分级存储方法
CN103916459A (zh) * 2014-03-04 2014-07-09 南京邮电大学 一种大数据归档存储系统
CN105589937A (zh) * 2015-12-14 2016-05-18 江苏鼎峰信息技术有限公司 一种分布式数据库存储架构系统
CN106055277A (zh) * 2016-05-31 2016-10-26 重庆大学 一种去中心化的分布式异构存储系统数据分布方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8321645B2 (en) * 2009-04-29 2012-11-27 Netapp, Inc. Mechanisms for moving data in a hybrid aggregate
US8700842B2 (en) * 2010-04-12 2014-04-15 Sandisk Enterprise Ip Llc Minimizing write operations to a flash memory-based object store
CN102170460A (zh) * 2011-03-10 2011-08-31 浪潮(北京)电子信息产业有限公司 一种集群存储系统及其数据存储方法
CN102831088A (zh) * 2012-07-27 2012-12-19 国家超级计算深圳中心(深圳云计算中心) 基于混合存储器的数据迁移方法和装置
CN103124299A (zh) * 2013-03-21 2013-05-29 杭州电子科技大学 一种异构环境下的分布式块级别存储系统
CN103605615B (zh) * 2013-11-21 2017-02-15 郑州云海信息技术有限公司 一种分级存储中基于块级数据的定向分配方法
US9448924B2 (en) * 2014-01-08 2016-09-20 Netapp, Inc. Flash optimized, log-structured layer of a file system
CN103778255A (zh) * 2014-02-25 2014-05-07 深圳市中博科创信息技术有限公司 一种分布式文件系统及其数据分布方法
CN103905540A (zh) * 2014-03-25 2014-07-02 浪潮电子信息产业股份有限公司 基于两级哈希的对象存储数据分布机制
CN105138476B (zh) * 2015-08-26 2017-11-28 广东创我科技发展有限公司 一种基于hadoop异构存储的数据存储方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150263A (zh) * 2012-12-13 2013-06-12 深圳先进技术研究院 分级存储方法
CN103916459A (zh) * 2014-03-04 2014-07-09 南京邮电大学 一种大数据归档存储系统
CN105589937A (zh) * 2015-12-14 2016-05-18 江苏鼎峰信息技术有限公司 一种分布式数据库存储架构系统
CN106055277A (zh) * 2016-05-31 2016-10-26 重庆大学 一种去中心化的分布式异构存储系统数据分布方法

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026337A (zh) * 2019-12-30 2020-04-17 中科星图股份有限公司 一种基于机器学习和ceph思想的分布式储存方法
CN113467700A (zh) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 基于异构存储的数据分配方法以及装置
CN113467700B (zh) * 2020-03-31 2024-04-23 阿里巴巴集团控股有限公司 基于异构存储的数据分配方法以及装置
CN111880747A (zh) * 2020-08-01 2020-11-03 广西大学 一种基于分级映射的Ceph存储系统自动均衡存储方法
CN111880747B (zh) * 2020-08-01 2022-11-08 广西大学 一种基于分级映射的Ceph存储系统自动均衡存储方法
WO2022105441A1 (zh) * 2020-11-20 2022-05-27 苏州浪潮智能科技有限公司 一种存储集群的扩容方法、系统及相关装置
CN114048239A (zh) * 2022-01-12 2022-02-15 树根互联股份有限公司 时间序列数据的存储方法、查询方法及装置
CN114048239B (zh) * 2022-01-12 2022-04-12 树根互联股份有限公司 时间序列数据的存储方法、查询方法及装置
CN117724663A (zh) * 2024-02-07 2024-03-19 济南浪潮数据技术有限公司 一种数据存储方法、系统、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN109196459A (zh) 2019-01-11
CN109196459B (zh) 2020-12-08
CN106055277A (zh) 2016-10-26

Similar Documents

Publication Publication Date Title
WO2017206649A1 (zh) 一种去中心化的分布式异构存储系统数据分布方法
US11150829B2 (en) Storage system and data control method
US9189389B2 (en) Memory controller and memory system
US9311252B2 (en) Hierarchical storage for LSM-based NoSQL stores
KR102290540B1 (ko) 네임스페이스/스트림 관리
US20060218347A1 (en) Memory card
US9335927B1 (en) Storage space allocation for logical disk creation
WO2016112713A1 (zh) 一种对内存中内存页的处理方法及装置
JP2023536693A (ja) 階層マッピングに基づくCephストレージシステムの自動均等化ストレージ方法
JPWO2014045391A1 (ja) 物理ブロック間でデータをコピーするディスクアレイ装置、ディスクアレイコントローラ及び方法
US9304946B2 (en) Hardware-base accelerator for managing copy-on-write of multi-level caches utilizing block copy-on-write differential update table
CN1545030A (zh) 基于磁盘特征的数据分布动态映射的方法
CN106406762A (zh) 一种重复数据删除方法及装置
CN107291539A (zh) 基于资源重要程度的集群程序调度方法
KR101579941B1 (ko) 가상머신 i/o 관리 방법 및 장치
CN106055274A (zh) 一种数据存储方法、数据读取方法及电子设备
CN106775493A (zh) 一种存储控制器及io请求处理方法
US9699254B2 (en) Computer system, cache management method, and computer
WO2020125362A1 (zh) 文件系统及数据布局方法
CN108345432B (zh) 用于超量配置存储器系统的高效压缩的算法方法
CN106843803A (zh) 一种基于归并树的全排序加速器及应用
CN103324577B (zh) 基于最小化io访问冲突和文件分条的大规模分条文件分配系统
CN105930202B (zh) 一种三阈值的虚拟机迁移方法
CN111078143B (zh) 基于段映射进行数据布局和调度的混合存储方法及系统
CN108537719A (zh) 一种提高通用图形处理器性能的系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17805594

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17805594

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17805594

Country of ref document: EP

Kind code of ref document: A1