WO2024109066A1 - 数据处理方法及装置 - Google Patents

数据处理方法及装置 Download PDF

Info

Publication number
WO2024109066A1
WO2024109066A1 PCT/CN2023/104582 CN2023104582W WO2024109066A1 WO 2024109066 A1 WO2024109066 A1 WO 2024109066A1 CN 2023104582 W CN2023104582 W CN 2023104582W WO 2024109066 A1 WO2024109066 A1 WO 2024109066A1
Authority
WO
WIPO (PCT)
Prior art keywords
compressed
data
target data
index information
data set
Prior art date
Application number
PCT/CN2023/104582
Other languages
English (en)
French (fr)
Inventor
解为斌
王道辉
丁萌
门勇
王阳
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024109066A1 publication Critical patent/WO2024109066A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers

Definitions

  • the present application relates to the field of storage technology, and in particular to a data processing method and device.
  • Data deduplication and compression technology is the most effective and direct way to reduce storage costs.
  • Deduplication and compression technology based on similar data clustering can cluster similar data distributed in time and space in the system together, which is conducive to improving the data reduction rate of the system.
  • data deduplication and compression technology can cluster similar data that are very discrete and unevenly distributed in time and space together through the method of eigenvalue sampling, and then compress the data through layered data reduction technology.
  • Layered data reduction technology finds exactly the same data in the clustered similar data for deduplication, and on the other hand, compresses the data in the clustered similar data through a differential compression scheme to obtain differential blocks, and compresses the differential blocks through a deep compression algorithm to further improve the reduction rate.
  • the present application provides a data processing method and device.
  • the present application simplifies the data compression process, effectively reduces the delay of data compression, and improves the data compression efficiency.
  • the technical solutions provided by the present application are as follows:
  • the present application provides a data processing method, which includes: obtaining multiple data blocks to be compressed; merging the multiple data blocks to be compressed; and compressing the merged multiple data blocks to be compressed to obtain a merged and compressed data set.
  • the multiple data blocks to be compressed are merged, and then the merged multiple data blocks to be compressed are merged and compressed to obtain a merged and compressed data set.
  • the data compression process is simplified, the data compression delay is effectively reduced, the data compression efficiency is improved, and the compression resource (such as CPU) consumption is reduced.
  • multiple data blocks to be compressed are arranged in order, and the coding partition of any data block to be compressed among the multiple data blocks to be compressed includes: any data block to be compressed, and all data blocks to be compressed that are arranged before any data block to be compressed among the multiple data blocks to be compressed.
  • the data volume of the coding partition is expanded. Since the data volume of the coding partition for finding redundant data when compressing a single data block is expanded, it is helpful to improve the time and space utilization of finding redundant data samples when the data block to be compressed is used, which helps to further improve the reduction rate of the compression process, thereby further reducing the storage cost.
  • the combined multiple data blocks to be compressed are compressed, including: based on forward coding, the combined multiple data blocks to be compressed are compressed.
  • the compression order of the compression algorithm can perfectly match the coding partition of each data block, which is more conducive to further improving the reduction rate while maintaining a low compression/decompression overhead.
  • the compression rate of the compression algorithm can also be referred to to further ensure the compression rate of the compressed data, thereby ensuring the storage cost of the storage system.
  • the data blocks to be compressed can be obtained based on the data writing instruction. For example, in the storage field, multiple data blocks to be written can be obtained based on the data writing instruction; and multiple data blocks to be compressed can be obtained based on the multiple data blocks to be written.
  • the method After obtaining the merged and compressed data set, the method also includes: storing the data set on a storage medium.
  • the index information of the storage medium can also be updated based on the position, so that when the data in the data set needs to be read, the corresponding data block can be indexed according to the index information.
  • the index information of the storage medium is used to record the index information of all data stored in the storage system.
  • the index information of the data block to be compressed may indicate not only the physical address of the compressed data, but also the logical index number (logic-idex) of the compressed data block.
  • the logical index number may be allocated to the multiple data blocks to be compressed before the multiple data blocks to be compressed are compressed.
  • the index information further indicates that the data set is obtained through merging and compression. Since the data stored in the storage system may be merged and compressed or not, the index information indicates that the data is obtained through merging and compression, which can distinguish the merged and compressed data from the unmerged and compressed data.
  • the index information of the to-be-screened data blocks retained after deduplication processing can also indicate that the retained to-be-screened data blocks are obtained after deduplication processing. In this way, the deduplication-processed data and the non-deduplication-processed data in the storage system can be distinguished.
  • the multiple data blocks to be compressed may be data blocks containing similar data. Based on the multiple data blocks to be written, multiple data blocks to be compressed are obtained, including: based on the multiple data blocks to be written, multiple data blocks to be screened are obtained; based on the multiple data blocks to be screened that have similar data in the multiple data blocks to be screened, multiple data blocks to be compressed are obtained.
  • the data to be written may include data that needs to be compressed and data that does not need to be compressed.
  • multiple data blocks to be screened are obtained, including: screening out data blocks that need to be compressed from the multiple data blocks to be written, to obtain multiple data blocks to be screened.
  • obtaining multiple data blocks to be compressed based on the data blocks to be filtered that have similar data among the multiple data blocks to be filtered may include: after filtering out the data blocks to be filtered that have similar data among the multiple data blocks to be filtered, performing deduplication processing on the data blocks to be filtered that have similar data to obtain multiple data blocks to be compressed.
  • the index information of the to-be-screened data block retained after the deduplication process also indicates that the retained to-be-screened data block is obtained after the deduplication process.
  • the present application provides a data processing method.
  • the method includes: obtaining index information of a target data set to be decompressed, the target data set includes one or more data blocks, and the target data set is obtained based on a merge compression process; based on the index information of the target data set, obtaining the target data set; performing a decompression, merge compression process on the target data set to obtain a decompressed, merged and compressed target data set.
  • the data processing method is used to decompress data. After obtaining the index information of the target data set to be decompressed, the target data set is obtained based on the index information of the target data set, and then the target data set is decompressed, merged and compressed to obtain the target data set that has been decompressed, merged and compressed.
  • the target data set includes one or more data blocks, and the target data set is obtained based on the merge compression process.
  • the index information of the target data set also indicates the processing type of the data in the target data set.
  • the target data set is decompressed, merged and compressed to obtain the target data set that has been decompressed, merged and compressed, including: when the index information of the target data set indicates that the target data set is obtained through merge compression, the target data set is decompressed, merged and compressed to obtain the target data set that has been decompressed, merged and compressed.
  • a decompression method that matches the merge compression of the target data set can be used to decompress the target data set. For example, when the target data set is obtained through merge compression based on forward coding, a decompression method that matches forward coding can be used to decompress the target data set.
  • the coding partition of the any of the data blocks to be compressed includes the any of the data blocks to be compressed and all the data blocks to be compressed that are arranged before the any of the data blocks to be compressed, then during decompression, only one decompression is required to complete the decompression of the target data set.
  • the target data set is stored on a storage medium
  • the main index area and the online index area of the storage medium both record index information
  • the main index area records the first index information of all data stored in the storage medium
  • the online index area records the second index information
  • the first index information is updated based on the second index information
  • the index information of the target data set to be decompressed is obtained, including: preferentially obtaining the index information of the target data set from the first index information recorded in the main index area.
  • the index information of the target data set When obtaining the index information of the target data set, if the index information of the target data set is preferentially obtained from the first index information recorded in the main index area, then when the target data set cannot be obtained based on the first index information, the index information of the target data set can continue to be obtained from the second index information recorded in the online index area, and the target data set can be obtained based on the index information.
  • the target data set can be obtained based on a data read instruction.
  • the target data set is obtained based on a data read instruction.
  • the data reading instruction instructs to read the target data block in the target data set
  • the method also includes: obtaining index information of the target data block; based on the index information of the target data block, obtaining the target data block from the decompressed, merged and compressed target data set; and feeding back the target data block to the data reading instruction.
  • the index information of the target data block indicates the logical index number of the target data block, and based on the index information of the target data block, the target data block is obtained from the decompressed and merged target data set, including: based on the logical index number, obtaining the target data block from the decompressed and merged target data set.
  • the present application provides a data processing device, which includes: an acquisition module for acquiring multiple data blocks to be compressed; a merging module for merging the multiple data blocks to be compressed; and a compression module for compressing the merged multiple data blocks to be compressed to obtain a merged and compressed data set.
  • multiple data blocks to be compressed are arranged in order, and the encoding partition of any data block to be compressed among the multiple data blocks to be compressed includes: any data block to be compressed, and all data blocks to be compressed that are arranged before any data block to be compressed among the multiple data blocks to be compressed.
  • the compression module is specifically used to: compress the merged multiple data blocks to be compressed based on forward coding.
  • the acquisition module is specifically used to: acquire multiple data blocks to be written based on the data writing instruction; acquire multiple data blocks to be compressed based on the multiple data blocks to be written.
  • the compression module is also used to: store the data set on a storage medium.
  • the compression module is further used to: update index information of the storage medium based on the location of the data set on the storage medium.
  • the index information also indicates that the data set is obtained through merging and compression processing.
  • the acquisition module is specifically used to: acquire multiple data blocks to be screened; and obtain multiple data blocks to be compressed based on the data blocks to be screened that have similar data in the multiple data blocks to be screened.
  • the acquisition module is specifically used to: perform deduplication processing on the data blocks to be screened that have similar data, so as to obtain multiple data blocks to be compressed.
  • the index information of the data block to be screened that is retained after the deduplication process further indicates that the retained data block to be screened is obtained after the deduplication process.
  • the index information of the data block to be compressed further indicates logical index numbers of multiple data blocks to be compressed.
  • the present application provides a data processing device.
  • the device includes: a first acquisition module, used to acquire index information of a target data set to be decompressed, the target data set includes one or more data blocks, and the target data set is obtained based on a merge compression process; a second acquisition module, used to acquire the target data set based on the index information of the target data set; and a decompression module, used to perform a decompression, merge and compress process on the target data set to obtain a decompressed, merged and compressed target data set.
  • the decompression module is specifically used to: when the index information of the target data set indicates that the target data set is obtained through merging and compression processing, perform de-merging and compression processing on the target data set to obtain the de-merging and compressed target data set.
  • the target data set is stored on a storage medium, and the main index area and online index area of the storage medium both record index information, the main index area records first index information of all data stored in the storage medium, and the online index area records second index information, and the first index information is updated based on the second index information.
  • the first acquisition module is specifically used to: preferentially obtain the index information of the target data set from the first index information recorded in the main index area.
  • the first acquisition module is further configured to: acquire index information of the target data set from second index information recorded in the online index area.
  • the target data set is obtained based on a data read instruction.
  • the data read instruction indicates to read the target data block in the target data set.
  • the first acquisition module is further used to acquire the index information of the target data block;
  • the second acquisition module is further used to acquire the target data block from the decompressed, merged and compressed target data set based on the index information of the target data block;
  • the second acquisition module is further used to feed back the target data block to the data read instruction.
  • the index information of the target data block indicates a logical index number of the target data block
  • the second acquisition module is specifically used to: acquire the target data block from the decompressed, merged and compressed target data set based on the logical index number.
  • the present application provides a computing device including a memory and a processor, wherein the memory stores program instructions, and the processor runs the program instructions to execute the method provided in the first and second aspects of the present application and any possible implementation thereof.
  • the present application provides a computer cluster, comprising multiple computing devices, the multiple computing devices comprising multiple processors and multiple memories, the multiple memories storing program instructions, and the multiple processors executing the program instructions, so that the computer cluster executes the methods provided in the first and second aspects of the present application and any possible implementation thereof.
  • the present application provides a computer-readable storage medium, which is a non-volatile computer-readable storage medium, and the computer-readable storage medium includes program instructions.
  • the program instructions When the program instructions are executed on a computing device, the computing device executes the method provided in the first and second aspects of the present application and any possible implementation thereof.
  • the present application provides a computer program product comprising instructions, which, when executed on a computer, enables the computer to execute the methods provided in the first and second aspects of the present application and any possible implementation thereof.
  • FIG1 is a schematic diagram of an implementation environment involved in a data processing method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of an implementation environment involved in another data processing method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the architecture of a storage system provided in an embodiment of the present application.
  • FIG4 is a flow chart of a data processing method for implementing data writing provided by an embodiment of the present application.
  • FIG5 is a schematic diagram of a coding partition of a data block provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of a storage system provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a data writing process provided by an embodiment of the present application.
  • FIG8 is a flow chart of a data processing method for implementing data reading provided by an embodiment of the present application.
  • FIG9 is a schematic diagram of a data reading process provided by an embodiment of the present application.
  • FIG10 is a schematic diagram of a data processing device provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of another data processing device provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
  • FIG. 13 is a schematic diagram of the structure of a computing device cluster provided in an embodiment of the present application.
  • Data deduplication and compression technology is an effective and direct way to reduce storage costs.
  • Data deduplication and data compression can reduce redundant data in the storage system, play a role in reducing data, and can significantly reduce the storage cost of the entire storage system.
  • the data that the storage system needs to process includes a lot of similar data, and the distribution of these similar data in time and space is very discrete and uneven.
  • data deduplication and compression technology can cluster similar data that are very discrete and unevenly distributed in time and space together through the method of feature value sampling, and then compress the data through hierarchical data reduction technology.
  • Hierarchical data reduction technology searches for exactly the same data in the clustered similar data for deduplication, and on the other hand, compresses the data in the clustered similar data through a differential compression scheme to obtain differential blocks, and compresses the differential blocks through a deep compression algorithm to further improve the reduction rate.
  • the embodiment of the present application provides a data processing method.
  • the computing device In the data compression stage, after the computing device obtains multiple data blocks to be compressed, it will merge the multiple data blocks to be compressed, and then perform compression processing (also called merge compression processing) on the merged multiple data blocks to be compressed to obtain a merged and compressed data set.
  • compression processing also called merge compression processing
  • the data decompression stage after the computing device obtains the index information of the target data set to be decompressed, it will obtain the target data set based on the index information of the target data set, and then perform de-merge compression processing on the target data set to obtain a de-merge compressed target data set.
  • the target data set includes one or more data blocks, and the target data set is obtained based on the merge compression processing.
  • the data when compressing data, the data is combined and compressed, which simplifies the data compression process, effectively reduces the delay of data compression, improves the data compression efficiency, and reduces the consumption of compressed resources (such as central processing unit (CPU)).
  • compressed resources such as central processing unit (CPU)
  • decompressing data the data is decompressed and combined, which effectively reduces the delay of data decompression, improves the data decompression efficiency, and reduces the consumption of decompression resources.
  • FIG1 is a schematic diagram of an implementation environment involved in a data processing method provided in an embodiment of the present application.
  • the implementation environment includes: a computing device 10.
  • the computing device 10 can obtain multiple data blocks to be compressed, and execute the data processing method provided in an embodiment of the present application on the multiple data blocks to be compressed, and compress the multiple data blocks to be compressed.
  • the computing device 10 can obtain a target data set to be decompressed, and execute the data processing method provided in an embodiment of the present application on the target data set, and decompress the target data set.
  • the data processing method provided in the embodiment of the present application can be implemented by running an executable program on the computing device 10.
  • the executable program of the data processing method can be presented in the form of an application installation package. After the application installation package is installed in the computing device 10, the data processing method can be implemented by running the executable program.
  • the computing device 10 can be a terminal.
  • the terminal can be a computer, a personal computer, a portable mobile terminal, a multimedia player, an e-book reader, or a wearable device, etc.
  • FIG. 2 is a schematic diagram of an implementation environment involved in another data processing method provided in an embodiment of the present application.
  • the implementation environment may include: a client 01 and a storage system 02.
  • the storage system 02 is used to store data and execute the data processing method provided in an embodiment of the present application.
  • the client 01 can establish a communication connection with the storage system 02.
  • a communication connection can be established between the client 01 and the storage system 02 via a network.
  • the network can be a local area network, the Internet, or other networks, which are not limited in the embodiment of the present application.
  • the client 01 is used for the user to interact with the storage system 02.
  • the client 01 is used to send instructions to the storage system 02 according to the user's instructions.
  • the client 01 is used to send a data write instruction to the storage system 02 according to the user's instructions to instruct to write data to the storage system.
  • the client 01 is used to send a data read instruction to the storage system 02 according to the user's instructions to instruct to read data from the storage system.
  • the client 01 may be a desktop computer, a laptop computer, a mobile phone, a smart phone, a tablet computer, a multimedia player, a smart home appliance, an artificial intelligence device, a smart wearable device, an e-reader, a smart vehicle-mounted device, or an Internet of Things device, etc.
  • the storage system 02 is used to receive instructions sent by the client 01 and perform the operations indicated by the instructions.
  • the storage system 02 provided in the embodiment of the present application needs to compress the data before storing the data, and then store the compressed data on the storage medium.
  • the storage system 02 needs to first obtain the target data set from the storage medium, then decompress the target data set, and obtain the data block to be read from the decompressed target data set, and then return the data block.
  • the storage system 02 is used to receive a data write instruction, and according to the data processing method provided in the embodiment of the present application, multiple data blocks to be compressed to be written are merged based on the data write instruction, and then the merged multiple data blocks to be compressed are merged and compressed to obtain a merged and compressed data set, and then the merged and compressed data set is stored on the storage medium.
  • the storage system 02 is used to receive a data read instruction, and according to the data processing method provided in the embodiment of the present application, the index information of the target data set to be decompressed is obtained based on the data read instruction, and then the target data set is obtained based on the index information, and then the target data set is decompressed, merged and compressed to obtain the target data set after decompression, and then the target data block is obtained from the decompressed, merged and compressed target data set, and the target data block is fed back to the data read instruction.
  • the target data set includes one or more data blocks, and the target data set is obtained based on the merge compression process.
  • the storage system 02 can be implemented by a computing device.
  • the computing device can be a server (such as a cloud server).
  • the storage system 02 can be implemented by a server cluster consisting of several servers, or by a cloud computing service center.
  • a large number of basic resources owned by a cloud service provider are deployed in the cloud computing service center.
  • computing resources, storage resources, and network resources are deployed in the cloud computing service center.
  • the cloud computing service center can use this large number of basic resources to implement the data processing method provided in the embodiment of the present application.
  • the storage system 02 When the storage system 02 is implemented through a cloud computing service center. Users can access the cloud platform through the client 01, and use the storage service provided by the storage system through the cloud platform.
  • the functions implemented by the data processing method provided in the embodiment of the present application can be abstracted into a storage cloud service by the cloud service provider on the cloud platform, and the cloud platform can use the resources in the cloud computing center to provide the storage cloud service to users.
  • the storage cloud service can store the data that needs to be written for the user, or provide the stored data to the user.
  • the cloud platform can be a cloud platform of a central cloud, a cloud platform of an edge cloud, or a cloud platform including a central cloud and an edge cloud, and the embodiment of the present application does not specifically limit it.
  • the storage system 02 can also be implemented by other resource platforms besides the cloud platform, and the present application embodiment does not specifically limit it. At this time, the storage system 02 can be implemented by resources in other resource platforms and provide relevant storage services to users.
  • FIG3 is a schematic diagram of the architecture of a storage system provided in an embodiment of the present application.
  • the storage system may be a distributed storage system.
  • the storage system includes: a service layer, an index layer, and a persistence layer.
  • the service layer is used to provide users with unified interface protocol services.
  • the services provided by the service layer may include: elastic volume service (EVS, also known as cloud hard disk), object storage service (OBS), scalable file service (SFS), data lake insight (DLI) service, and data warehouse service (DWS).
  • EVS elastic volume service
  • OBS object storage service
  • SFS scalable file service
  • DLI data lake insight service
  • DWS data warehouse service
  • the service layer is also configured with a cache layer.
  • the boot layer is used to provide metadata management services for the distributed storage system.
  • the boot layer can run the database (data base, DB) and perform deduplication and compression processing, and interact with the service layer through object boot and file boot.
  • the database can be a key-value database (key-value data base, KVDB).
  • the persistence layer is used to provide persistent storage services for distributed systems.
  • the persistence layer can achieve write-optimized and read-optimized through ishard mode and PLOG mode.
  • ishard mode and PLOG mode can share a storage pool.
  • the storage pool can be a data function virtualization (DFV) storage pool.
  • the storage medium of the storage pool can be non-volatile memory (NVM), solid state drive (SSD), hard disk drive (HDD) and optical storage media.
  • NVM non-volatile memory
  • SSD solid state drive
  • HDD hard disk drive
  • the above content is an exemplary description of the application scenario of the data processing method provided in the embodiment of the present application, and does not constitute a limitation on the application scenario of the data processing method. It is known to those of ordinary skill in the art that as business needs change, its application scenario can be adjusted according to application needs.
  • the data processing method provided in the embodiment of the present application can also be applied to the field of network transmission. It can be achieved by clustering similar data in the data to be transmitted, and then merging and compressing the clustered similar data to achieve the purpose of reducing the amount of network transmission data and improving the network transmission rate.
  • the data processing method provided in the embodiment of the present application can also be applied to a variety of compression fields, such as the field of image compression, the field of video compression, and the field of special database compression, so as to reduce the delay of data compression and decompression, improve the efficiency of data compression and decompression, and reduce the resource consumption of compression and decompression through the merge compression and decompression functions provided in the embodiment of the present application.
  • compression fields such as the field of image compression, the field of video compression, and the field of special database compression
  • the data access process of the storage system includes: a data writing process and a data reading process.
  • the data writing process is first described below, and then the data reading process is described.
  • the data writing process may include the following steps:
  • Step 401 Based on a data writing instruction, obtain a plurality of data blocks to be written.
  • a data write instruction can be sent to the storage system through a client.
  • the data write instruction carries the data to be written to the storage system, or the data write instruction can indicate the data to be written to the storage system, so the storage system can obtain the data to be written based on the data write instruction.
  • the data block to be written can be obtained by segmenting the data indicated by the storage system to be written by the user. For example, the storage system can segment the data to be written to the storage system according to a fixed length or a non-fixed length to obtain multiple data blocks to be written.
  • the data Before the storage system stores the data to be written, the data can be deduplicated and compressed. And the storage system usually processes multiple data blocks to be written in a deduplication and compression process.
  • Multiple data blocks to be written compressed in a deduplication and compression process can be regarded as data blocks in the same data set.
  • the storage system can have multiple data sets that need to be deduplicated and compressed, each data set includes multiple data blocks to be written, and multiple data blocks in the same data set can be data blocks with similar data.
  • multiple data blocks to be written in the same data set can come from the same user or from different users, and the embodiments of the present application do not specifically limit it.
  • Step 402 Based on the multiple data blocks to be written, obtain multiple data blocks to be compressed.
  • the multiple data blocks to be compressed may be data blocks containing similar data. Based on the multiple data blocks to be written, multiple data blocks to be compressed are obtained, including: based on the multiple data blocks to be written, multiple data blocks to be screened are obtained; based on the multiple data blocks to be screened that have similar data in the multiple data blocks to be screened, multiple data blocks to be compressed are obtained.
  • the data to be written may include data that needs to be compressed and data that does not need to be compressed.
  • multiple data blocks to be screened are obtained, including: screening out data blocks that need to be compressed from the multiple data blocks to be written, and obtaining multiple data blocks to be screened. According to the block.
  • obtaining multiple data blocks to be compressed based on the existence of data blocks to be filtered with similar data in multiple data blocks to be filtered may include: after filtering out the data blocks to be filtered with similar data in multiple data blocks to be filtered, performing deduplication processing on the data blocks to be filtered with similar data to obtain multiple data blocks to be compressed.
  • Performing deduplication processing on the data blocks to be filtered with similar data means: retrieving multiple identical data blocks to be filtered from multiple data blocks to be filtered with similar data, retaining one data block to be filtered from the multiple identical data blocks to be filtered, and deleting the remaining data blocks to be filtered from the multiple identical data blocks to be filtered.
  • the fingerprint of a data block may be a full-value hash value of the data block.
  • the fingerprint of a data block may be obtained according to a strong hash algorithm.
  • the storage address of the reserved data block in the multiple identical data blocks to be screened can be used to update the index information of the remaining data blocks to be screened in the multiple identical data blocks to be screened.
  • the physical address (PA) of the reserved data block in the multiple identical data blocks to be screened can be used to update the physical address recorded in the key-value pair of the remaining data blocks to be screened in the multiple identical data blocks to be screened.
  • the data blocks to be screened that have similar data among the multiple data blocks to be screened can be obtained by clustering the multiple data blocks to be screened.
  • the multiple data blocks to be screened can be clustered according to the feature values of the multiple data blocks to be screened, and the data blocks with the same feature values are clustered into one class, and the multiple data blocks in each class obtained by clustering are the data blocks to be screened that have similar data.
  • the feature values of the data blocks can be assembled in a specific manner based on the hash values of multiple groups of fragments of the data blocks.
  • Step 403 merge multiple data blocks to be compressed.
  • Merging multiple data blocks to be compressed includes: packaging the multiple data blocks to be compressed in a data packet.
  • the storage system is implemented by multiple computing devices deployed in a distributed manner, and obtaining the data blocks to be compressed and performing compression processing are respectively performed in different computing devices, after the computing device for obtaining the data blocks to be compressed obtains multiple data blocks to be compressed with similar data, the multiple data blocks to be compressed can be packaged in a data packet, and the data packet is provided to the computing device for performing compression processing, so that the computing device for performing compression processing directly uses the data packet as an input of the compression processing algorithm to perform Combine compression (CC) processing on the multiple data blocks to be compressed.
  • CC Combine compression
  • Step 404 compress the merged multiple data blocks to be compressed to obtain a merged and compressed data set.
  • the multiple data blocks to be compressed can be compressed to obtain a merged and compressed data set.
  • the compression process can be deep compression of the multiple data blocks to be compressed.
  • the coding partition of any data block to be compressed may include: any data block to be compressed, and all the data blocks to be compressed that are arranged before any data block to be compressed in the multiple data blocks to be compressed.
  • the coding partition is used to indicate the range of sample data used to find redundant data of the current data block.
  • the multiple data blocks are data blocks blk1 to data block blk8, and the coding partition of data block blk4 is: data block blk1, data block blk2, data block blk3 and data block blk4.
  • compressing the data block blk4 includes searching for redundant data of the data block blk4 among the data blocks blk1, blk2, blk3 and blk4, and then deeply compressing the data block blk4 according to the redundant data.
  • the coding partition of a data block when the coding partition of a data block is larger, the amount of data that can be used to find samples of redundant data in the data block is also larger. Therefore, for the same data block, the larger the coding partition, the greater the reduction rate.
  • the amount of data of the coding partition for finding redundant data when compressing a single data block is expanded, it helps to improve the utilization of time and space for finding redundant data samples when the data block to be compressed is used, which helps to further improve the reduction rate of the compression process, thereby further reducing the storage cost.
  • the combined multiple data blocks to be compressed can be combined and compressed based on forward coding.
  • the compression order of the compression algorithm can perfectly match the coding partition of each data block, which is more conducive to further improving the reduction rate while maintaining a low compression/decompression overhead.
  • the compression rate of the compression algorithm can also be referred to to further ensure the compression rate of the compressed data, thereby ensuring the storage cost of the storage system.
  • Step 405 Store the combined and compressed data set on a storage medium.
  • the merged and compressed data set can be saved to disk to make the data set persistent.
  • metadata information of the data set can be added to the data header of the data set, and then the data set with the added metadata information can be stored on the storage medium.
  • Step 406 Update the index information of the storage medium based on the location of the combined and compressed data set on the storage medium.
  • the index information of the storage medium can also be updated based on the position, so that when the data in the data set needs to be read, the corresponding data block can be indexed according to the index information.
  • the index information of the storage medium is used to record the index information of all data stored in the storage system.
  • the index information of the data block to be compressed can indicate not only the physical address of the compressed data, but also the logical index number (logic-idex) of the compressed data block.
  • the logical index number can be allocated to the multiple data blocks to be compressed before the compression process is performed on the multiple data blocks to be compressed. For example, after deduplication processing is performed on multiple data blocks to be screened, logical index numbers can be allocated to the data blocks to be screened that remain after the deduplication process.
  • a part of the fingerprint of the any data block to be compressed can be intercepted, and based on the fingerprint intercepted from the any data block to be compressed, the key of the index information of the any data block to be compressed is obtained, and based on the physical address and logical index number of the any data block to be compressed, the value of the index information of the any data block to be compressed is obtained.
  • the index information further indicates that the data set is obtained through merging and compression. Since the data stored in the storage system may be merged and compressed or not, the index information indicates that the data is obtained through merging and compression, which can distinguish the merged and compressed data from the unmerged and compressed data.
  • the index information of the to-be-screened data blocks retained after deduplication processing can also indicate that the retained to-be-screened data blocks are obtained after deduplication processing. In this way, the deduplication-processed data and the non-deduplication-processed data in the storage system can be distinguished.
  • step 406 may include: based on the index information of the merged and compressed data set recorded by the computing device performing the compression process, updating the index information of the storage medium.
  • FIG6 is a schematic diagram of the structure of a storage system provided by an embodiment of the present application.
  • the storage system is a distributed storage system.
  • the storage system may include a metadata management module, a compression module and a persistence module, and the metadata management module, the compression module and the persistence module are implemented by different computing devices.
  • the metadata management module is used to receive the user's input/output (I/O) request and perform the metadata management process based on the I/O request.
  • the compression module is used to merge and compress the data and/or decompress and merge the data.
  • the persistence module is used to store the received data and realize the persistence of the data.
  • the metadata management module includes a metadata management unit (such as LunMap) and a metadata processing unit (such as DusClient).
  • the metadata management unit is used to record the metadata of all data stored on the storage medium of the storage system, and the metadata includes the index information of the storage medium.
  • the metadata processing unit is used to dock with the metadata management unit, the compression module and the persistence module, and is responsible for calculating the fingerprint and feature value of each data block, storing the obtained fingerprint and feature value in the persistence module, and is responsible for combining the obtained fingerprint and feature value with other metadata information of the data block and sending it to the compression module.
  • the compression module includes an aggregation unit (such as OpTable), an analysis unit (such as a post data analysis (PDA) unit), a task allocation unit (such as task mag), a compression unit (such as a post data reduction (PDR) unit) and an index information storage unit (such as FPTable).
  • the aggregation unit is used to receive the fingerprint and feature value sent by the metadata processing unit in the metadata management module, and is responsible for aggregating data with the same feature value according to the feature value, obtaining aggregated data with similarity, providing the aggregated data with similarity to the compression unit, and providing the physical address and logical index number of the data with similarity to the compression unit.
  • the analysis unit is used to regularly analyze the aggregated data with similarity provided by the aggregation unit, and assemble the data with similarity scattered throughout the storage system into a similarity data chain.
  • the task allocation unit is used to receive the similar data chain sent by the analysis unit, compose different compression tasks according to the characteristics of the feature value dispersion in the storage system, and allocate the compression task to the compression unit.
  • the compression unit is used to perform deduplication processing and merge compression processing according to the assigned compression task, persist the merged and compressed data set to the persistence module, generate index information of each data block in the merged and compressed data set, and provide the index information to the index information storage unit.
  • the compression unit before the compression unit performs the merge compression process on the data, it can also first find the identical data on the similar data chain through fingerprint comparison, retain only one of the multiple identical data, and then perform the merge compression process on the remaining data.
  • the index information storage unit is used to receive the index information of each data block in the data set sent by the compression unit, persist the index information, and update the index information recorded in the metadata management unit based on it.
  • the computing device in the data compression stage, after the computing device obtains multiple data blocks to be compressed, it will merge the multiple data blocks to be compressed, and then merge and compress the merged multiple data blocks to be compressed to obtain a merged and compressed data set.
  • the data is simplified.
  • the compression process effectively reduces the delay of data compression, improves data compression efficiency, and reduces the consumption of compression resources (such as CPU).
  • the data reading process is described below. As shown in FIG8 , the data reading process may include the following steps:
  • Step 801 Obtain index information of a target data set to be decompressed based on a data read instruction.
  • the target data set includes one or more data blocks.
  • the target data set is obtained based on a merge compression process.
  • the user can send a data read instruction to the storage system through the client.
  • the data write instruction carries the indication information of the target data block to be read.
  • the storage system can obtain the index information of the target data block and the target data set to which it belongs based on the indication information.
  • the index information of the target data block is used to indicate the storage location of the target data block.
  • the index information of the target data set is used to indicate the storage location of the target data set.
  • Both the main index area and the online index area of the storage medium record index information
  • the main index area records the first index information of all data stored in the storage medium
  • the online index area records the second index information
  • the first index information is updated based on the second index information.
  • the main index area can be maintained by the metadata management unit
  • the online index area can be maintained by the index information storage unit.
  • obtaining the index information of the target data set to be read can include: preferentially obtaining the index information of the target data set from the first index information recorded in the main index area.
  • the main index area since the main index area records the first index information of all data stored in the storage medium, and the online index area records the second index information, the second index information may only indicate part of the data stored in the storage medium.
  • the index information of the target data set is preferentially obtained from the first index information recorded in the main index area, it can be guaranteed that the index information of the target data set is obtained.
  • the metadata management module, the compression module and the persistence module are implemented through different computing devices, since the metadata management module is used to receive data reading instructions, by preferentially obtaining the index information of the target data set from the first index information recorded in the main index area, there is no need to access the compression module through the metadata management module to obtain the index information, which can reduce one-hop access behavior and reduce the network overhead of searching for index information, thereby further reducing the latency of reading stored data.
  • Step 802 Acquire the target data set based on the index information of the target data set.
  • the storage location of the target data set on the storage medium can be obtained according to the index information, and then the target data set can be read from the storage location.
  • the index information can indicate the logical number (plogId) of the storage address of the target data set and the offset (offset) of the target data set on the storage unit indicated by the logical number, then the storage location of the target data set on the storage medium can be determined according to the logical number and the offset, and then the target data set can be read from the storage location.
  • the index information of the target data set when obtaining the index information of the target data set, if the index information of the target data set is obtained from the first index information recorded in the main index area first, then when the target data set cannot be obtained based on the first index information, the index information of the target data set can continue to be obtained from the second index information recorded in the online index area, and the target data set can be obtained based on the index information. If the target data set cannot be obtained based on the first index information at this time, it may be because the latest index information of the target data set has been updated from the online index area to the main index area.
  • the data in the target data set is a data block that has been deduplicated
  • this situation may occur because the reference block of the deduplicated data block has changed, resulting in a change in the physical address of the data corresponding to the deduplicated data block, but the new physical address has not been updated to the main index area in time.
  • the data in the target data set is a data block that has been merged and compressed
  • this situation may occur because the merged data block participates in a new merge compression. Since the storage system will use the logic of appending and writing to disk to save it again, it will cause the physical address of all data to change, but the new physical address has not been updated to the main index area in time.
  • Step 803 Based on the index information of the target data set, obtain the processing type of the data in the target data set.
  • the processing type of the data in the target data set can be obtained based on the index information of the target data set, and then the implementation method of obtaining the target data block from the target data set can be determined according to the processing type.
  • step 804 is executed to perform de-merge and compression processing on the target data set to obtain the target data block in the target data set. It should be noted that when the processing type of the data in the target data set is incomplete de-duplication and/or not merged and compressed, the target data block can be directly obtained from the target data set according to the index information of the target data block.
  • the index information can indicate that the data set is obtained through the merge compression process, and whether the data in the target data set is obtained through the merge compression process can be determined based on the index information.
  • the different values of the field used by the index information to indicate whether the data is obtained through the merge compression process are used to indicate whether the data is obtained through the merge compression process, and the index information can be used to indicate whether the data is obtained through the merge compression process. According to the value of the field indicating whether the data has been merged and compressed, it is determined whether the data in the target data set has been merged and compressed.
  • the index information of the to-be-screened data blocks retained after deduplication processing can also indicate that the retained to-be-screened data blocks are obtained after deduplication processing, and whether the data in the target data set has been deduplicated can be determined based on the index information.
  • different values of the field in the index information used to indicate whether the data has been deduplicated are used to indicate whether the data has been deduplicated, and whether the data in the target data set has been deduplicated can be determined based on the value of the field in the index information used to indicate whether the data has been deduplicated.
  • Step 804 When the index information of the target data set indicates that the target data set is obtained through merging and compression processing, demerge and compress the target data set to obtain a demerge and compress target data set.
  • the target data set can be de-merged and compressed to obtain the de-merged and compressed target data set.
  • a decompression method that matches the merge compression of the target data set can be used to de-merge and compress the target data set.
  • a decompression method that matches the forward coding can be used to de-merge and compress the target data set.
  • Step 805 Based on the index information of the target data block, obtain the target data block from the decompressed, merged and compressed target data set, and feed back the target data block to the data read instruction.
  • the target data block can be obtained from the target data set according to the index information, and then the target data block is fed back to the data read instruction.
  • the index information of the target data block indicates the logical index number of the target data block, and based on the index information of the target data block, the target data block is obtained from the decompressed, merged and compressed target data set, including: based on the logical index number, the target data block is obtained from the decompressed, merged and compressed target data set.
  • the metadata management module can obtain the index information of the target data block indicated by the data read request and the target data set to which it belongs from the metadata management unit, and provide the index information to the compression module.
  • the compression module can obtain the target data set from the storage medium according to the index information, and perform decompression, merging and compression processing on the target data set, and then obtain the target data block from the decompressed, merging and compressed target data set, and feed back the target data block to the metadata management module, thereby completing the reading process of the target data block.
  • the metadata management module can obtain the new index information of the target data block and the target data set to which it belongs from the index information storage unit, and then obtain the target data block according to the new index information.
  • the computing device in the data decompression stage, after obtaining the index information of the target data set to be decompressed, the computing device will obtain the target data set based on the index information of the target data set, and then perform decompression, merging and compression processing on the target data set to obtain the target data set that has been decompressed, merged and compressed.
  • the target data set includes one or more data blocks, and the target data set is obtained based on the merging and compression processing.
  • 3N-2 decompression cycles including the decompression depth compression process of the reference block, the decompression depth compression process of the similar data block, and the decompression difference compression process of the similar data
  • the data processing method provided in the embodiment of the present application can simplify the decompression process to perform one compression. Therefore, the decompression process of the data is effectively simplified.
  • FIG. 10 is a schematic diagram of the structure of a data processing device provided by the embodiment of the present application. Based on the following multiple modules shown in Figure 10, the data processing device shown in Figure 10 can perform all or part of the operations shown in Figure 4 above. It should be understood that the device may include more additional modules than the modules shown or omit some of the modules shown therein, and the embodiment of the present application does not limit this.
  • the data processing device can be configured on a cloud platform. As shown in Figure 10, the data processing device 1000 includes:
  • the acquisition module 1001 is used to acquire multiple data blocks to be compressed.
  • the merging module 1002 is used to merge multiple data blocks to be compressed.
  • the compression module 1003 is used to compress the merged multiple data blocks to be compressed to obtain a merged and compressed data set.
  • multiple data blocks to be compressed are arranged in order, and the encoding partition of any data block to be compressed among the multiple data blocks to be compressed includes: any data block to be compressed, and all data blocks to be compressed that are arranged before any data block to be compressed among the multiple data blocks to be compressed.
  • the compression module 1003 is specifically used to: compress the merged multiple data blocks to be compressed based on forward coding.
  • the acquisition module 1001 is specifically used to: acquire multiple data blocks to be written based on the data writing instruction; and acquire multiple data blocks to be compressed based on the multiple data blocks to be written.
  • the compression module 1003 is further used to: store the data set on a storage medium.
  • the compression module 1003 is further used to: update index information of the storage medium based on the location of the data set on the storage medium.
  • the index information also indicates that the data set is obtained through merging and compression processing.
  • the acquisition module 1001 is specifically used to: acquire a plurality of to-be-screened data blocks; and obtain a plurality of to-be-compressed data blocks based on to-be-screened data blocks having similar data among the plurality of to-be-screened data blocks.
  • the acquisition module 1001 is specifically used to: perform deduplication processing on the data blocks to be screened that have similar data, to obtain multiple data blocks to be compressed.
  • the index information of the data block to be screened that is retained after the deduplication process further indicates that the retained data block to be screened is obtained after the deduplication process.
  • the index information of the data block to be compressed further indicates logical index numbers of multiple data blocks to be compressed.
  • the computing device in the data compression stage, after the computing device obtains multiple data blocks to be compressed, it will merge the multiple data blocks to be compressed, and then merge and compress the merged multiple data blocks to be compressed to obtain a merged and compressed data set.
  • the method of merging and compressing the data is adopted when compressing the data, the data compression process is simplified, the delay of data compression is effectively reduced, the data compression efficiency is improved, and the compression resource (such as CPU) consumption is reduced.
  • FIG. 11 is a schematic diagram of the structure of a data processing device provided by an embodiment of the present application. Based on the following multiple modules shown in FIG. 11, the data processing device shown in FIG. 11 can perform all or part of the operations shown in FIG. 8 above. It should be understood that the device may include more additional modules than the modules shown or omit some of the modules shown therein, and the embodiment of the present application does not limit this.
  • the data processing device may be configured on a cloud platform. As shown in FIG. 11, the data processing device 1100 includes:
  • the first acquisition module 1101 is used to acquire index information of a target data set to be decompressed.
  • the target data set includes one or more data blocks.
  • the target data set is obtained based on a merge compression process.
  • the second acquisition module 1102 is used to acquire the target data set based on the index information of the target data set.
  • the decompression module 1103 is used to perform decompression, merging and compression processing on the target data set to obtain a decompressed, merging and compressed target data set.
  • the decompression module 1103 is specifically configured to: when the index information of the target data set indicates that the target data set is obtained through merging and compression, perform de-merging and compression processing on the target data set to obtain the de-merged and compressed target data set.
  • both the main index area and the online index area of the storage medium record index information
  • the main index area records first index information of all data stored in the storage medium
  • the online index area records second index information
  • the first index information is updated based on the second index information
  • the first acquisition module 1101 is specifically used to: preferentially obtain the index information of the target data set from the first index information recorded in the main index area.
  • the first acquisition module 1101 is further configured to: acquire index information of the target data set from second index information recorded in the online index area.
  • the target data set is obtained based on a data read instruction.
  • the data read instruction indicates to read the target data block in the target data set.
  • the first acquisition module 1101 is further used to acquire the index information of the target data block
  • the second acquisition module 1102 is further used to acquire the target data block from the decompressed, merged and compressed target data set based on the index information of the target data block
  • the second acquisition module 1102 is further used to feed back the target data block to the data read instruction.
  • the index information of the target data block indicates a logical index number of the target data block.
  • the second acquisition module 1102 is specifically configured to: Logical index number, used to obtain the target data block from the decompressed and merged target data set.
  • the computing device in the data decompression stage, after obtaining the index information of the target data set to be decompressed, the computing device will obtain the target data set based on the index information of the target data set, and then perform decompression, merging and compression processing on the target data set to obtain the target data set that has been decompressed, merged and compressed.
  • the target data set includes one or more data blocks, and the target data set is obtained based on the merging and compression processing.
  • 3N-2 decompression cycles including the decompression depth compression process of the reference block, the decompression depth compression process of the similar data block, and the decompression difference compression process of the similar data
  • the data processing device provided in the embodiment of the present application can simplify the decompression process to perform one compression. Therefore, the decompression process of the data is effectively simplified.
  • the acquisition module 1001, the merging module 1002, the compression module 1003, the first acquisition module 1101, the second acquisition module 1102 and the decompression module 1103 can all be implemented by software, or can be implemented by hardware. Exemplarily, the following takes the acquisition module 1001 as an example to introduce the implementation of the acquisition module 1001. Similarly, the implementation of the merging module 1002, the compression module 1003, the first acquisition module 1101, the second acquisition module 1102 and the decompression module 1103 can refer to the implementation of the acquisition module 1001.
  • the acquisition module 1001 may include code running on a computing instance.
  • the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Further, the above-mentioned computing instance may be one or more.
  • the acquisition module 1001 may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the code may be distributed in the same region (region) or in different regions.
  • the multiple hosts/virtual machines/containers used to run the code may be distributed in the same availability zone (AZ) or in different AZs, each AZ including one data center or multiple data centers with close geographical locations. Among them, usually a region may include multiple AZs.
  • VPC virtual private cloud
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same virtual private cloud (VPC) or in multiple VPCs.
  • VPC virtual private cloud
  • a VPC is set up in a region.
  • a communication gateway needs to be set up in each VPC to achieve interconnection between VPCs through the communication gateway.
  • the acquisition module 1001 may include at least one computing device, such as a server, etc.
  • the acquisition module 1001 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • the PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • the multiple computing devices included in the acquisition module 1001 can be distributed in the same region or in different regions.
  • the multiple computing devices included in the acquisition module 1001 can be distributed in the same AZ or in different AZs.
  • the multiple computing devices included in the acquisition module 1001 can be distributed in the same VPC or in multiple VPCs.
  • the multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • any of the acquisition module 1001, the merging module 1002, the compression module 1003, the first acquisition module 1101, the second acquisition module 1102 and the decompression module 1103 can be used to execute any step in the data processing method.
  • the steps that the acquisition module 1001, the merging module 1002, the compression module 1003, the first acquisition module 1101, the second acquisition module 1102 and the decompression module 1103 are responsible for implementing can be specified as needed, and the acquisition module 1001, the merging module 1002, the compression module 1003, the first acquisition module 1101, the second acquisition module 1102 and the decompression module 1103 respectively implement different steps in the data processing method to realize the full functions of the data processing device.
  • FIG. 12 is a schematic diagram of the structure of a computing device provided in the embodiment of the present application.
  • the computing device 1200 includes a processor 1201, a memory 1202, a communication interface 1203 and a bus 1204. Among them, the processor 1201, the memory 1202, and the communication interface 1203 are connected to each other through the bus 1204.
  • Processor 1201 may include a general processor and/or a dedicated hardware chip.
  • a general processor may include: a central processing unit (CPU), a microprocessor or a graphics processing unit (GPU).
  • the CPU is, for example, a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • a dedicated hardware chip is a hardware module for high-performance processing.
  • a dedicated hardware chip includes at least one of a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or a network processor (NP).
  • Processor 1201 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, some or all of the functions of the data processing method of the present application may be completed by hardware integrated logic circuits in processor 1201 or instructions in software form.
  • the memory 1202 is used to store computer programs, and the computer programs include an operating system 1202a and executable codes (i.e., program instructions) 1202b.
  • the memory 1202 is, for example, a read-only memory or other types of static storage devices that can store static information and instructions, or a random access memory or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory, a read-only optical disc or other optical disc storage, an optical disc storage (including a compressed optical disc, a laser disc, an optical disc, a digital versatile disc, a blue-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired executable code in the form of an instruction or data structure and can be accessed by a computer, but is not limited thereto.
  • the memory 1202 is used to store an outbound port queue, etc.
  • the memory 1202 is, for example, independent and connected to the processor 1201 via a bus 1204. Or the memory 1202 and the processor 1201 are integrated together.
  • the memory 1202 can store executable code. When the executable code stored in the memory 1202 is executed by the processor 1201, the processor 1201 is used to perform part or all of the functions of the data processing method provided in the embodiment of the present application. For the implementation of the processor 1201 to perform the process, please refer to the relevant description in the aforementioned embodiment.
  • the memory 1202 may also include software modules and data required for other running processes such as an operating system.
  • the communication interface 1203 uses a transceiver module such as, but not limited to, a transceiver to achieve communication with other devices or communication networks.
  • a transceiver module such as, but not limited to, a transceiver to achieve communication with other devices or communication networks.
  • the communication interface 1203 can be any one or any combination of the following devices: a network interface (such as an Ethernet interface), a wireless network card, and other devices with network access functions.
  • the bus 1204 is any type of communication bus for interconnecting the internal devices of the computing device (e.g., the memory 1202, the processor 1201, and the communication interface 1203).
  • a system bus for interconnecting the internal devices of the computing device.
  • the embodiment of the present application takes the interconnection of the above-mentioned devices inside the computing device through the bus 1204 as an example.
  • the above-mentioned devices inside the computing device 1200 can also be connected to each other in communication with each other using other connection methods other than the bus 1204.
  • the above-mentioned devices inside the computing device 1200 are interconnected through an internal logical interface.
  • the above-mentioned multiple devices can be respectively arranged on independent chips, or at least partially or completely arranged on the same chip. Whether each device is independently arranged on different chips, or integrated on one or more chips, often depends on the needs of product design.
  • the embodiments of the present application do not limit the specific implementation form of the above-mentioned devices.
  • the descriptions of the processes corresponding to the above-mentioned figures have different focuses. For the parts not described in detail in a certain process, please refer to the relevant descriptions of other processes.
  • all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof.
  • all or part of the embodiments may be implemented in the form of a computer program product.
  • the computer program product providing a program development platform includes one or more computer instructions, and when these computer program instructions are loaded and executed on a computing device, all or part of the functions of the data processing method provided in the embodiments of the present application are implemented.
  • computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center via wired (e.g., coaxial cable, optical fiber, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium stores computer program instructions that provide a program development platform.
  • the embodiment of the present application also provides a computing device cluster.
  • the computing device cluster includes at least one computing device.
  • the computing device can be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
  • the structure of at least one computing device included in the computing device cluster may refer to the computing device 1200 shown in Fig. 12.
  • the memory 1202 in one or more computing devices 1200 in the computing device cluster may store the same instructions for executing the data processing method.
  • the memory 1202 of one or more computing devices 1200 in the computing device cluster may also store partial instructions for executing the data processing method.
  • the combination of one or more computing devices 1200 may jointly execute instructions for executing the data processing method.
  • the memory 1202 in different computing devices 1200 in the computing device cluster may store different instructions, respectively. Used to execute part of the functions of the data processing device. That is, the instructions stored in the memory 1202 in different computing devices 1200 can implement the functions of one or more modules in the acquisition module 1001, the merging module 1002, the compression module 1003, the first acquisition module 1101, the second acquisition module 1102 and the decompression module 1103.
  • one or more computing devices in the computing device cluster can be connected via a network.
  • the network can be a wide area network or a local area network, etc.
  • Figure 13 shows a possible implementation.
  • two computing devices 1300A and 1300B are connected via a network.
  • the network is connected via a communication interface in each computing device.
  • computing devices 1300A and 1300B include a bus 1302, a processor 1304, a memory 1306 and a communication interface 1308.
  • the memory 1306 in the computing device 1300A there are stored instructions for executing the functions of the acquisition module 1001, the merging module 1002 and the compression module 1003.
  • the memory 1306 in the computing device 1300B there are stored instructions for executing the functions of the first acquisition module 1101, the second acquisition module 1102 and the decompression module 1103.
  • the function of the computing device 1300A shown in FIG13 may also be completed by multiple computing devices 1300.
  • the function of the computing device 1300B may also be completed by multiple computing devices 1300.
  • the deployment mode of the modules used to implement the data processing method in the computing device may also be adjusted according to the application requirements.
  • An embodiment of the present application also provides a computer-readable storage medium, which is a non-volatile computer-readable storage medium.
  • the computer-readable storage medium includes program instructions. When the program instructions are executed on a computing device, the computing device implements a data processing method as provided in the embodiment of the present application.
  • the embodiment of the present application also provides a computer program product including instructions.
  • the computer program product When the computer program product is executed on a computer, the computer implements the data processing method provided by the embodiment of the present application.
  • the information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of relevant countries and regions.
  • the original data and executable code involved in this application are all obtained with full authorization.
  • the terms “first”, “second” and “third” are used for descriptive purposes only and should not be understood as indicating or implying relative importance.
  • the term “at least one” means one or more, and the term “plurality” means two or more, unless otherwise expressly defined.
  • a and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone.
  • the character "/" in this article generally indicates that the associated objects before and after are in an "or" relationship.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据处理方法及装置,属于存储技术领域。该方法包括:获取多个待压缩数据块;将多个待压缩数据块合并;对合并后的多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。本申请采取对数据进行合并压缩的方式,简化了数据的压缩流程,有效地减小了数据压缩的时延,提升了数据压缩效率。

Description

数据处理方法及装置
本申请要求于2022年11月21日提交的申请号为202211462945.6、发明名称为“数据压缩、解压方法及装置”的中国专利申请的优先权,以及于2023年03月07日提交的申请号为202310212957.1、发明名称为“数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及存储技术领域,特别涉及一种数据处理方法及装置。
背景技术
数据重删压缩技术是降低存储成本最有效和最直接的方法。基于相似数据聚类的重删压缩技术,能将系统中在时间和空间上分布的相似数据聚类到一起,有利于提升系统的数据缩减率。
目前,数据重删压缩技术可以通过特征值抽样的方法,将时间和空间上分布非常离散和不均匀的相似数据聚类到一起,然后通过分层的数据缩减技术压缩数据。分层的数据缩减技术一方面在聚类的相似数据中寻找完全相同的数据进行去重,另一方面在聚类的相似数据中通过差量压缩方案压缩数据,得到差量块,并通过深度压缩算法对差量块进行压缩,以达到进一步提升缩减率的效果。
但是,该数据重删压缩技术的流程冗余,导致压缩和解压缩的时延较大。
发明内容
本申请提供了一种数据处理方法及装置。本申请简化了数据的压缩流程,有效地减小了数据压缩的时延,提升了数据压缩效率。本申请提供的技术方案如下:
第一方面,本申请提供了一种数据处理方法。该方法包括:获取多个待压缩数据块;将多个待压缩数据块合并;对合并后的多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。
通过该数据处理方法进行数据压缩时,在获取多个待压缩数据块后,将多个待压缩数据块合并,然后对合并后的多个待压缩数据块进行合并压缩处理,以得到经过合并压缩的数据集合。这样一来,由于在对数据进行压缩时,采取对数据进行合并压缩的方式,简化了数据的压缩流程,有效地减小了数据压缩的时延,提升了数据压缩效率,降低了压缩的资源(如CPU)消耗。
在一种实现方式中,多个待压缩数据块按序排列,多个待压缩数据块中任一待压缩数据块的编码分区包括:任一待压缩数据块,及多个待压缩数据块中排序在任一待压缩数据块之前的所有待压缩数据块。这样就扩大了编码分区的数据量,由于扩大了单个数据块压缩时寻找冗余数据的编码分区的数据量,有助于提升对待压缩的数据块时寻找冗余数据样本的在时间和空间上的利用率,有助于进一步提升压缩处理的缩减率,从而进一步降低存储成本。
可选的,对合并后的多个待压缩数据块进行压缩处理,包括:基于前向编码,对合并后的多个待压缩数据块进行压缩处理。这样一来,压缩算法的压缩次序可以较完美地匹配每个数据块的编码分区,这样就更加利于在保持较低压缩/解压缩开销的同时,进一步提升缩减率。另外,在选择压缩算法时,还可以参考压缩算法的压缩率,以进一步保证经过压缩的数据的压缩率,从而保证存储系统的存储成本。
待压缩数据块可以基于数据写入指令得到。例如,在存储领域中,可以基于数据写入指令,获取多个待写入数据块;基于多个待写入数据块,获取多个待压缩数据块。则在得到经过合并压缩的数据集合后,该方法还包括:将数据集合存储在存储介质上。
在获取经过合并压缩的数据集合中每个数据块在存储介质上的位置后,还可以基于该位置更新存储介质的索引信息,以便于在需要读取该数据集合中的数据时,能够按照索引信息索引到对应的数据块。存储介质的索引信息用于记载存储系统中存储的所有数据的索引信息。
在一种实现方式中,待压缩数据块的索引信息不仅可以指示经过压缩的数据的物理地址,还可以指示经过压缩的数据块的逻辑索引号(logic-idex)。该逻辑索引号可以在对多个待压缩数据块进行压缩处理前,对该多个待压缩数据库分配得到。
可选地,索引信息还指示数据集合经过合并压缩处理得到。由于存储系统存储的数据可以为经过合并压缩处理,也可以是为经过合并压缩处理,通过索引信息指示数据经过合并压缩处理得到,能够对经过合并压缩处理的数据和未经过合并压缩处理的数据进行区分。
类似的,经过重删处理被保留的待筛选数据块的索引信息还可以指示被保留的待筛选数据块经过重删处理得到。这样一来,就能够对存储系统中经过重删处理的数据和未经过重删处理的数据进行区分。
在一种实现方式中,多个待压缩的数据块可以为包含相似数据的数据块。则基于多个待写入数据块,获取多个待压缩数据块,包括:基于多个待写入数据块,获取多个待筛选数据块;基于多个待筛选数据块中存在相似数据的待筛选数据块,得到多个待压缩数据块。
在存储场景中,待写入的数据可以包括需要压缩的数据和不需要压缩的数据。则基于多个待写入数据块,获取多个待筛选数据块,包括:从多个待写入数据块中筛选出需要压缩的数据块,得到多个待筛选数据块。
可选地,基于多个待筛选数据块中存在相似数据的待筛选数据块,得到多个待压缩数据块,可以包括:在多个待筛选数据块中筛选出存在相似数据的待筛选数据块后,对存在相似数据的待筛选数据块执行重删处理,得到多个待压缩数据块。
此时,经过重删处理被保留的待筛选数据块的索引信息还指示被保留的待筛选数据块经过重删处理得到。
第二方面,本申请提供了一种数据处理方法。该方法包括:获取待解压缩的目标数据集合的索引信息,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到;基于目标数据集合的索引信息,获取目标数据集合;对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
通过该数据处理方法进行数据解压缩,在获取待解压缩的目标数据集合的索引信息后,会基于目标数据集合的索引信息,获取目标数据集合,然后,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。其中,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到。这样一来,由于在对数据进行解压缩时,采取对数据进行解合并压缩的方式,有效地减小了数据解压缩的时延,提升了数据解压缩效率,降低了解压缩的资源消耗。
可选的,目标数据集合的索引信息还指示目标数据集合中数据的处理类型。则对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合,包括:当目标数据集合的索引信息指示目标数据集合经过合并压缩处理得到时,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。在一种实现方式中,可以采用与对目标数据集合进行合并压缩的匹配的解压缩方法,对目标数据结合进行解合并压缩。例如,当目标数据集合基于前向编码合并压缩得到时,可以采用与前向编码匹配的解压缩方法,对目标数据结合进行解合并压缩。并且,当对按序排列的多个待压缩数据块中任一待压缩数据块进行压缩处理时,若该任一待压缩数据块的编码分区包括该任一待压缩数据块,及多个待压缩数据块中排序在该任一待压缩数据块之前的所有待压缩数据块,则在解合并压缩时,仅需执行一次解压缩即可完成对目标数据集合的解压缩。
在一种实现方式中,目标数据集合存储在存储介质上,存储介质的主索引区域和在线索引区域均记载有索引信息,主索引区域记载有存储介质中存储的所有数据的第一索引信息,在线索引区域记载有第二索引信息,第一索引信息基于第二索引信息更新得到,获取待解压缩的目标数据集合的索引信息,包括:优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息。通过优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息,就无需再通过元数据管理模块访问压缩模块得到该索引信息,能够减少一跳的访问行为,减少查找索引信息的网络开销,从而进一步降低读已存储数据的时延。
在获取目标数据集合的索引信息时,若优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息,则当无法基于第一索引信息获取目标数据集合时,可以继续从在线索引区域记载的第二索引信息中,获取目标数据集合的索引信息,并根据该索引信息获取目标数据集合。
目标数据集合可以基于数据读取指令令得到。例如,在存储领域中,目标数据集合基于数据读取指令得到。
相应的,数据读取指令指示读取目标数据集合中的目标数据块,该方法还包括:获取目标数据块的索引信息;基于目标数据块的索引信息,从经过解合并压缩的目标数据集合中,获取目标数据块;向数据读取指令反馈目标数据块。
在一种实现方式中,目标数据块的索引信息指示目标数据块的逻辑索引号,基于目标数据块的索引信息,从经过解合并压缩的目标数据集合中,获取目标数据块,包括:基于逻辑索引号,从经过解合并压缩的目标数据集合中获取目标数据块。
第三方面,本申请提供了一种数据处理装置。该装置包括:获取模块,用于获取多个待压缩数据块;合并模块,用于将多个待压缩数据块合并;压缩模块,用于对合并后的多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。
可选的,多个待压缩数据块按序排列,多个待压缩数据块中任一待压缩数据块的编码分区包括:任一待压缩数据块,及多个待压缩数据块中排序在任一待压缩数据块之前的所有待压缩数据块。
可选的,压缩模块,具体用于:基于前向编码,对合并后的多个待压缩数据块进行压缩处理。
可选的,获取模块,具体用于:基于数据写入指令,获取多个待写入数据块;基于多个待写入数据块,获取多个待压缩数据块。
可选的,压缩模块,还用于:将数据集合存储在存储介质上。
可选的,压缩模块,还用于:基于数据集合在存储介质上的位置,更新存储介质的索引信息。
可选的,索引信息还指示数据集合经过合并压缩处理得到。
可选的,获取模块,具体用于:获取多个待筛选数据块;基于多个待筛选数据块中存在相似数据的待筛选数据块,得到多个待压缩数据块。
可选的,获取模块,具体用于:对存在相似数据的待筛选数据块执行重删处理,得到多个待压缩数据块。
可选的,经过重删处理被保留的待筛选数据块的索引信息还指示被保留的待筛选数据块经过重删处理得到。
可选的,待压缩数据块的索引信息还指示多个待压缩数据块的逻辑索引号。
第四方面,本申请提供了一种数据处理装置。该装置包括:第一获取模块,用于获取待解压缩的目标数据集合的索引信息,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到;第二获取模块,用于基于目标数据集合的索引信息,获取目标数据集合;解压缩模块,用于对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
可选的,解压缩模块,具体用于:当目标数据集合的索引信息指示目标数据集合经过合并压缩处理得到时,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
可选的,目标数据集合存储在存储介质上,存储介质的主索引区域和在线索引区域均记载有索引信息,主索引区域记载有存储介质中存储的所有数据的第一索引信息,在线索引区域记载有第二索引信息,第一索引信息基于第二索引信息更新得到,第一获取模块,具体用于:优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息。
可选的,当无法基于第一索引信息获取目标数据集合时,第一获取模块,具体还用于:从在线索引区域记载的第二索引信息中,获取目标数据集合的索引信息。
可选的,目标数据集合基于数据读取指令得到。
可选的,数据读取指令指示读取目标数据集合中的目标数据块。则第一获取模块,还用于获取目标数据块的索引信息;第二获取模块,还用于基于目标数据块的索引信息,从经过解合并压缩的目标数据集合中,获取目标数据块;第二获取模块,还用于向数据读取指令反馈目标数据块。
可选的,目标数据块的索引信息指示目标数据块的逻辑索引号,第二获取模块,具体用于:基于逻辑索引号,从经过解合并压缩的目标数据集合中获取目标数据块。
第五方面,本申请提供了一种计算设备,包括存储器和处理器,存储器存储有程序指令,处理器运行程序指令以执行本申请第一、二方面以及其任一种可能的实现方式中提供的方法。
第六方面,本申请提供了一种计算机集群,包括多个计算设备,多个计算设备包括多个处理器和多个存储器,多个存储器中存储有程序指令,多个处理器运行程序指令,使得计算机集群执行本申请第一、二方面以及其任一种可能的实现方式中提供的方法。
第七方面,本申请提供了一种计算机可读存储介质,该计算机可读存储介质为非易失性计算机可读存储介质,该计算机可读存储介质包括程序指令,当程序指令在计算设备上运行时,使得计算设备执行本申请第一、二方面以及其任一种可能的实现方式中提供的方法。
第八方面,本申请提供了一种包含指令的计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行本申请第一、二方面以及其任一种可能的实现方式中提供的方法。
附图说明
图1是本申请实施例提供的一种数据处理方法涉及的实施环境的示意图;
图2是本申请实施例提供的另一种数据处理方法涉及的实施环境的示意图;
图3是本申请实施例提供的一种存储系统的架构示意图;
图4是本申请实施例提供的一种数据处理方法实现数据写入的流程图;
图5是本申请实施例提供的一种数据块的编码分区的示意图;
图6是本申请实施例提供的一种存储系统的结构示意图;
图7是本申请实施例提供的一种数据写入的过程示意图;
图8是本申请实施例提供的一种数据处理方法实现数据读取的流程图;
图9是本申请实施例提供的一种数据读取的过程示意图;
图10是本申请实施例提供的一种数据处理装置的示意图;
图11是本申请实施例提供的另一种数据处理装置的示意图;
图12是本申请实施例提供的一种计算设备的结构示意图;
图13是本申请实施例提供的一种计算设备集群的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
随着存储(如云存储)系统规模的不断扩大,以及部分新型应用和服务在性能提出了更高的需求,这就导致存储系统出现两个方面的显著变化。一个是存储容量的急剧扩大,需要降低存储成本来提升存储系统的核心竞争力。另一个是性能需求更高的应用和服务需要使用性能更好的存储器件,来满足这类应用和服务在性能上的需求,这就导致存储成本成倍的增加。这两方面的变化都要求寻求有效的方法,以降低存储系统的存储成本,从而提升存储系统的核心竞争力。
数据重删压缩技术是降低存储成本较有效和较直接的方法。其关键技术环节有两个:数据去重和数据压缩。通过数据去重和数据压缩能够减小存储系统中的冗余数据,起到对数据进行缩减效果,能大幅降低整个存储系统的存储成本。在一些场景中,存储系统需要处理的数据包括很多相似的数据,且这些相似的数据在时间和空间上的分布非常的离散与不均匀。考虑到这个特点,数据重删压缩技术可以通过特征值抽样的方法,将时间和空间上分布非常离散和不均匀的相似数据聚类到一起,然后通过分层的数据缩减技术压缩数据。分层的数据缩减技术一方面在聚类的相似数据中寻找完全相同的数据进行去重,另一方面在聚类的相似数据中通过差量压缩方案压缩数据,得到差量块,并通过深度压缩算法对差量块进行压缩,以达到进一步提升缩减率的效果。
但该数据重删压缩技术在整体方案设计上存在诸多的不足,有较大改善空间。例如,在压缩过程中,除了差量压缩的参考块只需要经过一次压缩流程外,剩余每个数据块均需要经过两次压缩流程,需要先经过差量压缩,再深度压缩,导致数据的压缩流程复杂冗余。而解压缩流程则更为复杂,除了差量压缩的参考块只需要经过一次解压缩流程外,每个数据块需要经过三次解压缩流程,其中,解压缩参考块一次,解压缩相似数据块两次。由于压缩和解压缩的流程冗余,增加了压缩和解压缩的时延,导致数据压缩和解压缩系统性能有较大提升空间。
本申请实施例提供了一种数据处理方法。在数据压缩阶段,计算设备获取多个待压缩数据块后,会将多个待压缩数据块合并,然后对合并后的多个待压缩数据块进行压缩处理(也称合并压缩处理),以得到经过合并压缩的数据集合。在数据解压缩阶段,计算设备在获取待解压缩的目标数据集合的索引信息后,会基于目标数据集合的索引信息,获取目标数据集合,然后,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。其中,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到。
这样一来,由于在对数据进行压缩时,采取对数据进行合并压缩的方式,简化了数据的压缩流程,有效地减小了数据压缩的时延,提升了数据压缩效率,降低了压缩的资源(如中央处理器(central processing unit,CPU))消耗。类似的,在对数据进行解压缩时,采取对数据进行解合并压缩的方式,有效地减小了数据解压缩的时延,提升了数据解压缩效率,降低了解压缩的资源消耗。
下面先对本申请实施例提供的一种数据处理方法涉及的实施环境进行说明。
图1是本申请实施例提供的一种数据处理方法涉及的实施环境的示意图。如图1所示,该实施环境包括:计算设备10。该计算设备10可以获取多个待压缩数据块,并对该多个待压缩数据块执行本申请实施例提供的数据处理方法,对该多个待压缩数据块进行压缩。或者,该计算设备10可以获取待解压缩的目标数据集合,并对该目标数据集合执行本申请实施例提供的数据处理方法,对该目标数据集合进行解压缩。
在一种实现方式中,本申请实施例提供的数据处理方法,可以通过计算设备10运行可执行程序实现。例如,该数据处理方法的可执行程序可以以应用程序安装包的形式呈现,计算设备10中安装该应用程序安装包后,能够通过运行该可执行程序实现该数据处理方法。此时,计算设备10可以为终端。该终端可以为计算机、个人电脑、便携式移动终端、多媒体播放器、电子书阅读器或可穿戴式设备等。
图2是本申请实施例提供的另一种数据处理方法涉及的实施环境的示意图。如图2所示,该实施环境可以包括:客户端01和存储系统02。存储系统02用于存储数据,并执行本申请实施例提供的数据处理方法。客户端01能够与存储系统02建立通信连接。例如,客户端01与存储系统02之间可以通过网络建立通信连接。可选的,该网络可以为局域网,也可以为互联网,还可以为其它网络,本申请实施例不作限定。
客户端01用于供用户与存储系统02进行交互。在一种实现方式中,客户端01用于按照用户的指示向存储系统02发送指令。例如,客户端01用于按照用户的指示向存储系统02发送数据写入指令,以指示向存储系统写入数据。又例如,客户端01用于按照用户的指示向存储系统02发送数据读取指令,以指示从存储系统读取数据。
在一种可实现方式中,客户端01可以为台式计算机、膝上型计算机、移动电话、智能手机、平板电脑、多媒体播放器、智能家电、人工智能设备、智能可穿戴设备、电子阅读器、智能车载设备或物联网设备等。
存储系统02用于接收客户端01发送的指令,并执行指令指示的操作。在一种实现方式中,本申请实施例提供的存储系统02在存储数据前,需要先对数据进行压缩,然后将经过压缩的数据存储在存储介质上。相应的,存储系统02在读取数据前,需要先从存储介质上获取目标数据集合,然后对目标数据集合进行解压缩,并在经过解压缩的目标数据集合中获取需要读取的数据块,然后返回该数据块。例如,存储系统02用于接收数据写入指令,按照本申请实施例提供的数据处理方法,基于数据写入指令将待写入的多个待压缩数据块进行合并,然后对合并后的多个待压缩数据块进行合并压缩处理,得到经过合并压缩的数据集合,然后将经过合并压缩的数据集合存储在存储介质上。又例如,存储系统02用于接收数据读取指令,按照本申请实施例提供的数据处理方法,基于数据读取指令获取待解压缩的目标数据集合的索引信息,然后基于索引信息获取目标数据集合,然后对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合,然后从经过解合并压缩的目标数据集合中获取目标数据块,并向数据读取指令反馈该目标数据块。其中,目标数据集合包括一个或多个数据块,且目标数据集合基于合并压缩处理得到。
在一种可实现方式中,存储系统02可以通过计算设备实现。并且,该计算设备可以为服务器(如云服务器)。通常地,存储系统02可以由若干台服务器组成的服务器集群,或者是一个云计算服务中心实现。其中,云计算服务中心中部署有云服务提供商拥有的大量基础资源。例如云计算服务中心中部署有计算资源、存储资源和网络资源等。云计算服务中心可以利用该大量基础资源,实现本申请实施例提供的数据处理方法。
当存储系统02通过云计算服务中心实现时。用户可以通过客户端01访问云平台,并通过云平台使用存储系统提供的存储服务。此时,本申请实施例提供的数据处理方法实现的功能,可以由云服务提供商在云平台抽象成一种存储云服务,云平台能够利用云计算中心中的资源向用户提供该存储云服务。用户在云平台购买该存储云服务后,能够通过该存储云服务为用户存储需要写入的数据,或者向用户提供已存储的数据。可选地,云平台可以是中心云的云平台、边缘云的云平台或包括中心云和边缘云的云平台,本申请实施例对其不做具体限定。
需要说明的是,在图2所示的实施环境中,存储系统02也可以通过除云平台外的其他资源平台实现,本申请实施例对其不做具体限定。此时,存储系统02可以通过其他资源平台中的资源实现,并向用户提供相关的存储服务。
在一种实现方式中,图3是本申请实施例提供的一种存储系统的架构示意图。可选的,该存储系统可以为分布式存储系统。如图3所示,该存储系统包括:服务层(service layer)、引导层(index layer)和持久层(persistence layer)。
服务层用于向用户提供统一的接口协议服务。服务层提供的服务可以包括:弹性卷服务(elastic volume service,EVS,也称云硬盘)、对象存储服务(object storage service,OBS)、弹性文件服务(scalable file service,SFS)、数据湖探索(data lake insight,DLI)服务、数据仓库服务(data warehouse service,DWS)。并且,为了保证服务性能,该服务层还配置有缓冲层(cache layer)。
引导层用于为分布式存储系统提供元数据管理服务。引导层可以运行数据库(data base,DB)和执行重删压缩处理,并通过对象引导和文件引导与服务层进行交互。其中,数据库可以为键值对数据库(key-value data base,KVDB)。
持久层用于为分布式系提供持久化存储服务。持久层可以通过ishard模式和PLOG模式实现写优化(write-optimized)和读优化(read-optimized)。ishard模式和PLOG模式可以共用存储池。该存储池可以为数据功能虚拟化(data function virtualisation,DFV)存储池。且该存储池的存储介质可以为非易失性存储器(non-volatile memory,NVM)、固态硬盘(solid state drive,SSD)、机械硬盘(hard disk drive,HDD)和光存储介质等。
应当理解的是,以上内容是对本申请实施例提供的数据处理方法的应用场景的示例性说明,并不构成对于该数据处理方法的应用场景的限定,本领域普通技术人员可知,随着业务需求的改变,其应用场景可以根据应用需求进行调整。例如,本申请实施例提供的数据处理方法还可以应用于网络传输领域,可以通过在待传输的数据中聚类相似数据,然后对聚类后的相似数据进行合并压缩,以达到减少网络传输数据量和提升网络传输速率的目的。并且,本申请实施例提供的数据处理方法还可以应用于多种压缩领域,如图像压缩领域、视频压缩领域以及专用数据库压缩领域等,以便于通过本申请实施例提供的合并压缩和解合并压缩功能,减小数据压缩和解压缩的时延,提升数据压缩和解压缩效率,降低压缩和解压缩的资源消耗。
下面以本申请实施例提供的数据处理方法应用于图2所示的应用场景为例,对本申请实施例提供的数据处理方法的实现过程进行说明。存储系统的数据访问过程包括:数据写入过程和数据读取过程。下面先对数据写入过程进行说明,然后再对数据读取过程进行说明。如图4所示,数据写入过程可以包括以下步骤:
步骤401、基于数据写入指令,获取多个待写入数据块。
用户需要向存储系统写入数据时,可以通过客户端向存储系统发送数据写入指令。该数据写入指令携带有待写入存储系统的数据,或者,数据写入指令可以指示需要写入存储系统的数据,因此,存储系统可以基于数据写入指令获取待写入的数据。待写入数据块可以基于存储系统对用户指示写入的数据进行切分得到。例如,存储系统可以按照定长或者非定长对待写入存储系统的数据进行切分,得到多个待写入数据块。存储系统存储待写入的数据之前,可以对数据进行重删压缩处理。且存储系统通常在一次重删压缩处理过程中对多个待写入数据块进行处理。在一次重删压缩处理过程中被压缩的多个待写入数据块可以视为存在同一个数据集合中的数据块。存储系统可以有多个需要进行重删压缩处理的数据集合,每个数据集合包括多个待写入数据块,其且同一数据集合中的多个数据块可以为具有相似数据的数据块。另外,同一数据集合中的多个待写入数据块可以来自同一个用户,也可以来自不同用户,本申请实施例对其不做具体限定。
步骤402、基于多个待写入数据块,获取多个待压缩数据块。
在一种实现方式中,多个待压缩的数据块可以为包含相似数据的数据块。则基于多个待写入数据块,获取多个待压缩数据块,包括:基于多个待写入数据块,获取多个待筛选数据块;基于多个待筛选数据块中存在相似数据的待筛选数据块,得到多个待压缩数据块。
在存储场景中,待写入的数据可以包括需要压缩的数据和不需要压缩的数据。则基于多个待写入数据块,获取多个待筛选数据块,包括:从多个待写入数据块中筛选出需要压缩的数据块,得到多个待筛选数 据块。
可选地,基于多个待筛选数据块中存在相似数据的待筛选数据块,得到多个待压缩数据块,可以包括:在多个待筛选数据块中筛选出存在相似数据的待筛选数据块后,对存在相似数据的待筛选数据块执行重删处理,得到多个待压缩数据块。对存在相似数据的待筛选数据块执行重删处理是指:在存在相似数据的多个待筛选数据块中,检索出完全相同的多个待筛选数据块,保留该完全相同的多个待筛选数据块中的一个待筛选数据块,将该完全相同的多个待筛选数据块中剩余的待筛选数据块删除。其中,当多个待筛选数据块的指纹相同时,可以确定该多个待筛选数据块为完全相同的数据块。数据块的指纹可以为数据块的全值哈希值。数据块的指纹可以根据强哈希算法得到。并且,由于存储系统处理的每个数据块都会预存在存储系统的数据库中,则无论是待写入数据块、待筛选数据块还是待压缩数据块都会预存在该数据库中,且存储系统中会记录该数据块的索引信息,在删除完全相同的多个待筛选数据块中剩余的待筛选数据块后,可以使用完全相同的多个待筛选数据块中被保留的数据块的存储地址,更新完全相同的多个待筛选数据块中剩余的待筛选数据块的索引信息。例如,当存储系统使用键值(key-value)对记录数据块的索引信息时,可以使用完全相同的多个待筛选数据块中被保留的数据块的物理地址(physical address,PA),更新完全相同的多个待筛选数据块中剩余的待筛选数据块的键值对中值记载的物理地址。
多个待筛选数据块中存在相似数据的待筛选数据块可以通过对多个待筛选数据块聚类得到。例如,可以根据多个待筛选数据块的特征值对多个待筛选数据块进行聚类,具有相同特征值的数据块会聚为一类,且聚类得到的每个类中的多个数据块为存在相似数据的待筛选数据块。可选地,数据块的特征值可以基于数据块的多组片段的哈希值按照特定方式组装得到。
步骤403、将多个待压缩数据块合并。
将多个待压缩数据块合并,包括:将多个待压缩数据块打包在一个数据包中。在一种实现方式中,当存储系统通过分布式部署的多个计算设备实现,且获取待压缩数据块和执行压缩处理分别在不同的计算设备中执行时,用于获取待压缩数据块的计算设备获取具有相似数据的多个待压缩数据块后,可以将该多个待压缩数据块打包在在一个数据包中,并向用于执行压缩处理的计算设备提供该数据包,使得用于执行压缩处理的计算设备直接将该数据包作为压缩处理算法的输入,以对该多个待压缩数据块进行合并压缩(Combine compression,CC)处理。
步骤404、对合并后的多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。
在获取合并后的多个待压缩数据块后,就可以对该多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。其中,该压缩处理可以为对多个待压缩数据块进行深度压缩。在一种实现方式中,由于多个待压缩数据块通常是按序排列的,在对该多个待压缩数据块中任一待压缩数据块进行压缩处理时,该任一待压缩数据块的编码分区可以包括:该任一待压缩数据块,及多个待压缩数据块中排序在该任一待压缩数据块之前的所有待压缩数据块。其中,编码分区用于指示寻找当前数据块的冗余数据所使用的样本数据的范围。例如,如图5所示,多个数据块分别为数据块blk1至数据块blk8,数据块blk4的编码分区为:数据块blk1、数据块blk2、数据块blk3和数据块blk4。此时,对数据块blk4进行压缩处理包括:在数据块blk1、数据块blk2、数据块blk3和数据块blk4中,查找数据块blk4的冗余数据,然后根据该冗余数据对该数据块blk4进行深度压缩。
一般情况下,当数据块的编码分区越大时,该数据块可用于寻找冗余数据的样本的数据量也越大,因此,对于同一个数据块而言,其编码分区越大,其缩减率也就越大。在本申请实施例中,由于扩大了单个数据块压缩时寻找冗余数据的编码分区的数据量,有助于提升对待压缩的数据块时寻找冗余数据样本的在时间和空间上的利用率,有助于进一步提升压缩处理的缩减率,从而进一步降低存储成本。
在一种实现方式中,可以基于前向编码,对合并后的多个待压缩数据块进行合并压缩处理。这样一来,压缩算法的压缩次序可以较完美地匹配每个数据块的编码分区,这样就更加利于在保持较低压缩/解压缩开销的同时,进一步提升缩减率。另外,在选择压缩算法时,还可以参考压缩算法的压缩率,以进一步保证经过压缩的数据的压缩率,从而保证存储系统的存储成本。
步骤405、将经过合并压缩的数据集合存储在存储介质上。
存储系统对数据集合进行合并压缩处理后,就可以将经过合并压缩的数据集合进行存盘,以将该数据集合进行持久化。在一种实现方式中,在获取经过合并压缩的数据集合后,可以在该数据集合的数据头部添加该数据集合的元数据信息,然后将添加有元数据信息的数据集合存储在存储介质上。
步骤406、基于经过合并压缩的数据集合在存储介质上的位置,更新存储介质的索引信息。
在获取经过合并压缩的数据集合中每个数据块在存储介质上的位置后,还可以基于该位置更新存储介质的索引信息,以便于在需要读取该数据集合中的数据时,能够按照索引信息索引到对应的数据块。存储介质的索引信息用于记载存储系统中存储的所有数据的索引信息。
在一种实现方式中,待压缩数据块的索引信息不仅可以指示经过压缩的数据的物理地址,还可以指示经过压缩的数据块的逻辑索引号(logic-idex)。该逻辑索引号可以在对多个待压缩数据块进行压缩处理前,对该多个待压缩数据库分配得到。例如,在对多个待筛选数据块进行重删处理后,可以对重删处理剩下的待筛选数据块分配逻辑索引号。又例如,当存储系统使用键值对记录经过合并压缩的数据集合的索引信息时,对于该数据集合的任一待压缩数据块,可以截取该任一待压缩数据块的指纹中的一部分,并基于该任一待压缩数据块截取的指纹,得到该任一待压缩数据块的索引信息的键,并基于该任一待压缩数据块的物理地址和逻辑索引号,得到该任一待压缩数据块的索引信息的值。
可选地,索引信息还指示数据集合经过合并压缩处理得到。由于存储系统存储的数据可以为经过合并压缩处理,也可以是为经过合并压缩处理,通过索引信息指示数据经过合并压缩处理得到,能够对经过合并压缩处理的数据和未经过合并压缩处理的数据进行区分。
类似的,经过重删处理被保留的待筛选数据块的索引信息还可以指示被保留的待筛选数据块经过重删处理得到。这样一来,就能够对存储系统中经过重删处理的数据和未经过重删处理的数据进行区分。
在一种实现方式中,由于存储介质的索引信息用于记载存储系统中存储的所有数据的索引信,当存储系统通过分布式部署的多个计算设备实现,且记载存储介质的计算设备和执行压缩处理的计算设备不同时,执行压缩处理的计算设备对数据执行合并压缩处理后,可以将经过合并压缩的数据集合存储在存储介质上,并在该计算设备中记录该经过合并压缩的数据集合的索引信息。则该步骤406可以包括:基于执行压缩处理的计算设备记录的经过合并压缩的数据集合的索引信息,更新存储介质的索引信息。
例如,图6是本申请实施例提供的一种存储系统的结构示意图。如图6所示,该存储系统为分布式存储系统。该存储系统可以包括元数据管理模块、压缩模块和持久化模块,元数据管理模块、压缩模块和持久化模块通过不同的计算设备实现。元数据管理模块用于接收用户的输入/输出(input/output,I/O)请求,并基于I/O请求执行对元数据的管理过程。压缩模块用于对数据进行合并压缩和/或解合并压缩。持久化模块用于对接收到的数据进行存储,实现对数据的持久化。元数据管理模块包括元数据管理单元(如LunMap)和元数据处理单元(如DusClient)。元数据管理单元用于记载存储系统的存储介质上存储的所有数据的元数据,且元数据包括存储介质的索引信息。元数据处理单元用于与元数据管理单元、压缩模块和持久化模块对接,负责计算每个数据块的指纹和特征值,将得到的指纹和特征值存储到持久化模块,并负责将得到指纹和特征值与数据块的其他元数据信息组合后发送给压缩模块。压缩模块包括汇聚单元(如OpTable)、分析单元(如后台数据分析(post data analysis,PDA)单元)、任务分配单元(如task mag)、压缩单元(如后台数据缩减(post data reduce,PDR)单元)和索引信息存储单元(如FPTable)。
如图6和图7所示,在向存储系统写入数据的过程中,汇聚单元用于接收元数据管理模块中元数据处理单元发送的指纹和特征值等,负责根据特征值汇聚具有相同特征值的数据,得到经过汇聚的具有相似性的数据,向压缩单元提供该经过汇聚的具有相似性的数据,并向压缩单元提供该具有相似性的数据的物理地址和逻辑索引号。分析单元用于定期分析由汇聚单元提供的经过汇聚的具有相似性的数据,并将分散在整个存储系统中的具有相似性的数据,组装成相似性数据链。任务分配单元用于接收分析单元发送的相似数据链,按照特征值在存储系统中分散的特点,组成不同的压缩任务,并向压缩单元分配压缩任务。压缩单元用于根据分配的压缩任务,执行重删处理和合并压缩处理,将经过合并压缩的数据集合持久化到持久化模块,生成经过合并压缩处理的数据集合中每个数据块的索引信息,并向索引信息存储单元提供该索引信息。可选地,压缩单元对数据执行合并压缩处理前,还可以先通过指纹对比找到相似数据链上完全相同的数据,将多份完全相同的数据只保留一份,然后对剩下的数据执行合并压缩处理。索引信息存储单元用于接收压缩单元发送的数据集合中每个数据块的索引信息,对该索引信息进行持久化,并基于其更新元数据管理单元中记载的索引信息。
综上所述,在本申请实施例提供的数据处理方法中,在数据压缩阶段,计算设备获取多个待压缩数据块后,会将多个待压缩数据块合并,然后对合并后的多个待压缩数据块进行合并压缩处理,以得到经过合并压缩的数据集合。这样一来,由于在对数据进行压缩时,采取对数据进行合并压缩的方式,简化了数据 的压缩流程,有效地减小了数据压缩的时延,提升了数据压缩效率,降低了压缩的资源(如CPU)消耗。例如,对于一个拥有N个相似数据块的集合,在采用目前的相似性重删压缩方案进行压缩时,需要执行2N-1次的压缩循环(包括差量压缩过程和深度压缩过程),而采用本申请实施例提供的数据处理方法,可将压缩过程简化为执行一次压缩。因此,有效地简化了数据的压缩流程。
下面对数据读取过程进行说明。如图8所示,数据读取过程可以包括以下步骤:
步骤801、基于数据读取指令获取待解压缩的目标数据集合的索引信息,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到。
在将数据存储在存储系统中之后,用户若需要从存储系统读取数据,可以通过客户端向存储系统发送数据读取指令。该数据写入指令携带待读取的目标数据块的指示信息。存储系统可以根据该指示信息获取目标数据块及其所在目标数据集合的索引信息。目标数据块的索引信息用于指示目标数据块的存储位置。目标数据集合的索引信息用于指示目标数据集合的存储位置。
存储介质的主索引区域和在线索引区域均记载有索引信息,主索引区域记载有存储介质中存储的所有数据的第一索引信息,在线索引区域记载有第二索引信息,第一索引信息基于第二索引信息更新得到。例如,如图6所示,主索引区域可以由元数据管理单元维护,在线索引区域可以由索引信息存储单元维护。则基于数据读取指令获取待读取的目标数据集合的索引信息,可以包括:优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息。其中,由于主索引区域记载有存储介质中存储的所有数据的第一索引信息,在线索引区域记载有第二索引信息,该第二索引信息可能仅指示存储介质中存储的部分数据,当优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息时,能够保证获取到目标数据集合的索引信息。并且,如图6所示,若元数据管理模块、压缩模块和持久化模块通过不同的计算设备实现,由于元数据管理模块用于接收数据读取指令,通过优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息,就无需再通过元数据管理模块访问压缩模块得到该索引信息,能够减少一跳的访问行为,减少查找索引信息的网络开销,从而进一步降低读已存储数据的时延。
步骤802、基于目标数据集合的索引信息,获取目标数据集合。
获取目标数据集合的索引信息后,可以先根据该索引信息获取目标数据集合在存储介质上的存储位置,然后从该存储位置上读取目标数据集合。在一种实现方式中,索引信息可以指示目标数据集合的存盘地址的逻辑编号(plogId)和目标数据集合在该逻辑编号指示的存储单元上的偏移量(offset),则可以根据该逻辑编号和该偏移量确定目标数据集合在存储介质上的存储位置,并然后从该存储位置上读取目标数据集合。
需要说明的是,在获取目标数据集合的索引信息时,若优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息,则当无法基于第一索引信息获取目标数据集合时,可以继续从在线索引区域记载的第二索引信息中,获取目标数据集合的索引信息,并根据该索引信息获取目标数据集合。此时若无法基于第一索引信息获取目标数据集合,可能是因为目标数据集合最新的索引信息还从在线索引区域更新到主索引区域。例如,当目标数据集合中的数据是经过重删处理的数据块时,出现这种情况可能是因为被去重数据块的参考块发生变化,导致被去重数据块对应的数据的物理地址发生变化,但新的物理地址还没及时更新到主索引区域中。又例如,当目标数据集合中的数据是经过合并压缩处理的数据块时,出现这种情况可能是因为合并数据块参与了新的合并压缩,由于存储系统会采用追加写存盘的逻辑对其再次存盘,会引起所有数据的物理地址发生变化,但新的物理地址还没及时更新到主索引区域中。
步骤803、基于目标数据集合的索引信息,获取目标数据集合中数据的处理类型。
在获取目标数据集合后,可以基于该目标数据集合的索引信息,获取目标数据集合中数据的处理类型,然后根据该处理类型确定从目标数据集合中获取目标数据块的实现方式。并且,在目标数据集合中数据的处理类型为经过合并压缩处理时,执行步骤804对目标数据集合进行解合并压缩处理,以获取目标数据集合中的目标数据块。需要说明的是,当目标数据集合中的数据的处理类型为未完全去重和/或未合并压缩时,可以直接根据目标数据块的索引信息,从目标数据集合中获取目标数据块。
根据前面描述可知,索引信息可以指示数据集合经过合并压缩处理得到,则可以根据索引信息确定目标数据集合中的数据是否经过合并压缩处理得到。在一种实现方式中,索引信息用于指示数据是否经过合并压缩处理的字段的不同赋值用于指示数据是否经过合并压缩处理得到,则可以根据索引信息用于指示数 据是否经过合并压缩处理的字段的赋值,确定目标数据集合中的数据是否经过合并压缩处理得到。
类似的,经过重删处理被保留的待筛选数据块的索引信息还可以指示被保留的待筛选数据块经过重删处理得到,则可以根据索引信息确定目标数据集合中的数据是否经过重删处理。在一种实现方式中,索引信息用于指示数据是否经过重删处理的字段的不同赋值用于指示数据是否经过重删处理得到,则可以根据索引信息用于指示数据是否经过重删处理的字段的赋值,确定目标数据集合中的数据是否经过重删处理得到。
步骤804、当目标数据集合的索引信息指示目标数据集合经过合并压缩处理得到时,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
当目标数据集合的索引信息指示目标数据集合经过合并压缩处理得到时,可以对目标数据集合进行解合并压缩处理,以得到经过解合并压缩的目标数据集合。在一种实现方式中,可以采用与对目标数据集合进行合并压缩的匹配的解压缩方法,对目标数据结合进行解合并压缩。例如,当目标数据集合基于前向编码合并压缩得到时,可以采用与前向编码匹配的解压缩方法,对目标数据结合进行解合并压缩。并且,当对按序排列的多个待压缩数据块中任一待压缩数据块进行压缩处理时,若该任一待压缩数据块的编码分区包括该任一待压缩数据块,及多个待压缩数据块中排序在该任一待压缩数据块之前的所有待压缩数据块,则在解合并压缩时,仅需执行一次解压缩即可完成对目标数据集合的解压缩。
步骤805、基于目标数据块的索引信息,从经过解合并压缩的目标数据集合中,获取目标数据块,并向数据读取指令反馈目标数据块。
在获取目标数据块的索引信息后,即可根据该索引信息从目标数据集合中获取目标数据块,然后向数据读取指令反馈目标数据块。在一种实现方式中,目标数据块的索引信息指示目标数据块的逻辑索引号,基于目标数据块的索引信息,从经过解合并压缩的目标数据集合中,获取目标数据块,包括:基于逻辑索引号,从经过解合并压缩的目标数据集合中获取目标数据块。
如图6和图9所示,在从存储系统读取数据的过程中,元数据管理模块接收到数据读取请求后,可以从元数据管理单元中获取数据读取请求指示的目标数据块及其所在的目标数据集合的索引信息,并向压缩模块提供该索引信息。压缩模块可以根据该索引信息从存储介质中获取该目标数据集合,并对该目标数据集合执行解合并压缩处理,然后从经过解合并压缩的目标数据集合中,获取目标数据块,并向元数据管理模块反馈目标数据块,从而完成目标数据块的读取过程。但是,当无法基于从元数据管理单元中获取的索引信息获取目标数据集合时,如图9所示,元数据管理模块可以从索引信息存储单元中,获取目标数据块及其所在的目标数据集合的新的索引信息,然后根据该新的索引信息获取目标数据块。
综上所述,在本申请实施例提供的数据处理方法中,在数据解压缩阶段,计算设备在获取待解压缩的目标数据集合的索引信息后,会基于目标数据集合的索引信息,获取目标数据集合,然后,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。其中,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到。这样一来,由于在对数据进行解压缩时,采取对数据进行解合并压缩的方式,有效地减小了数据解压缩的时延,提升了数据解压缩效率,降低了解压缩的资源消耗。例如,对于一个拥有N个相似数据块的集合的合并压缩数据集合,在采用目前的相似性重删解压缩方案对该合并压缩数据集合进行解压缩时,需要执行3N-2次的解压缩循环(包括参考块的解深度压缩过程、相似数据块的解深度压缩过程、相似数据的解差量压缩过程),而采用本申请实施例提供的数据处理方法,可将解压缩过程简化为执行一次压缩。因此,有效地简化了数据的解压缩流程。
需要说明的是,本申请实施例提供的数据处理方法的步骤先后顺序可以进行适当调整,步骤也可以根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内,因此不再赘述。
以上介绍了本申请实施例的数据处理方法,与上述方法对应,本申请实施例还提供了一种数据处理装置。图10是本申请实施例提供的一种数据处理装置的结构示意图。基于图10所示的如下多个模块,该图10所示的数据处理装置能够执行上述图4所示的全部或部分操作。应理解到,该装置可以包括比所示模块更多的附加模块或者省略其中所示的一部分模块,本申请实施例对此并不进行限制。可选的,该数据处理装置可配置于云平台。如图10所示,该数据处理装置1000包括:
获取模块1001,用于获取多个待压缩数据块。
合并模块1002,用于将多个待压缩数据块合并。
压缩模块1003,用于对合并后的多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。
可选的,多个待压缩数据块按序排列,多个待压缩数据块中任一待压缩数据块的编码分区包括:任一待压缩数据块,及多个待压缩数据块中排序在任一待压缩数据块之前的所有待压缩数据块。
可选的,压缩模块1003,具体用于:基于前向编码,对合并后的多个待压缩数据块进行压缩处理。
可选的,获取模块1001,具体用于:基于数据写入指令,获取多个待写入数据块;基于多个待写入数据块,获取多个待压缩数据块。
可选的,压缩模块1003,还用于:将数据集合存储在存储介质上。
可选的,压缩模块1003,还用于:基于数据集合在存储介质上的位置,更新存储介质的索引信息。
可选的,索引信息还指示数据集合经过合并压缩处理得到。
可选的,获取模块1001,具体用于:获取多个待筛选数据块;基于多个待筛选数据块中存在相似数据的待筛选数据块,得到多个待压缩数据块。
可选的,获取模块1001,具体用于:对存在相似数据的待筛选数据块执行重删处理,得到多个待压缩数据块。
可选的,经过重删处理被保留的待筛选数据块的索引信息还指示被保留的待筛选数据块经过重删处理得到。
可选的,待压缩数据块的索引信息还指示多个待压缩数据块的逻辑索引号。
综上所述,在本申请实施例提供的数据处理装置中,在数据压缩阶段,计算设备获取多个待压缩数据块后,会将多个待压缩数据块合并,然后对合并后的多个待压缩数据块进行合并压缩处理,以得到经过合并压缩的数据集合。这样一来,由于在对数据进行压缩时,采取对数据进行合并压缩的方式,简化了数据的压缩流程,有效地减小了数据压缩的时延,提升了数据压缩效率,降低了压缩的资源(如CPU)消耗。例如,对于一个拥有N个相似数据块的集合,在采用目前的相似性重删压缩方案进行压缩时,需要执行2N-1次的压缩循环(包括差量压缩过程和深度压缩过程),而采用本申请实施例提供的数据处理装置,可将压缩过程简化为执行一次压缩。因此,有效地简化了数据的压缩流程。
本申请实施例还提供了另一种数据处理装置。图11是本申请实施例提供的一种数据处理装置的结构示意图。基于图11所示的如下多个模块,该图11所示的数据处理装置能够执行上述图8所示的全部或部分操作。应理解到,该装置可以包括比所示模块更多的附加模块或者省略其中所示的一部分模块,本申请实施例对此并不进行限制。可选的,该数据处理装置可配置于云平台。如图11所示,该数据处理装置1100包括:
第一获取模块1101,用于获取待解压缩的目标数据集合的索引信息,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到。
第二获取模块1102,用于基于目标数据集合的索引信息,获取目标数据集合。
解压缩模块1103,用于对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
可选的,解压缩模块1103,具体用于:当目标数据集合的索引信息指示目标数据集合经过合并压缩处理得到时,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
可选的,存储介质的主索引区域和在线索引区域均记载有索引信息,主索引区域记载有存储介质中存储的所有数据的第一索引信息,在线索引区域记载有第二索引信息,第一索引信息基于第二索引信息更新得到,第一获取模块1101,具体用于:优先从主索引区域记载的第一索引信息中,获取目标数据集合的索引信息。
可选的,当无法基于第一索引信息获取目标数据集合时,第一获取模块1101,具体还用于:从在线索引区域记载的第二索引信息中,获取目标数据集合的索引信息。
可选的,目标数据集合基于数据读取指令得到。
可选的,数据读取指令指示读取目标数据集合中的目标数据块。则第一获取模块1101,还用于获取目标数据块的索引信息;第二获取模块1102,还用于基于目标数据块的索引信息,从经过解合并压缩的目标数据集合中,获取目标数据块;第二获取模块1102,还用于向数据读取指令反馈目标数据块。
可选的,目标数据块的索引信息指示目标数据块的逻辑索引号,第二获取模块1102,具体用于:基于 逻辑索引号,从经过解合并压缩的目标数据集合中获取目标数据块。
综上所述,在本申请实施例提供的数据处理装置中,在数据解压缩阶段,计算设备在获取待解压缩的目标数据集合的索引信息后,会基于目标数据集合的索引信息,获取目标数据集合,然后,对目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。其中,目标数据集合包括一个或多个数据块,目标数据集合基于合并压缩处理得到。这样一来,由于在对数据进行解压缩时,采取对数据进行解合并压缩的方式,有效地减小了数据解压缩的时延,提升了数据解压缩效率,降低了解压缩的资源消耗。例如,对于一个拥有N个相似数据块的集合的合并压缩数据集合,在采用目前的相似性重删解压缩方案对该合并压缩数据集合进行解压缩时,需要执行3N-2次的解压缩循环(包括参考块的解深度压缩过程、相似数据块的解深度压缩过程、相似数据的解差量压缩过程),而采用本申请实施例提供的数据处理装置,可将解压缩过程简化为执行一次压缩。因此,有效地简化了数据的解压缩流程。
其中,获取模块1001、合并模块1002、压缩模块1003、第一获取模块1101、第二获取模块1102和解压缩模块1103均可以通过软件实现,或者可以通过硬件实现。示例性地,接下来以获取模块1001为例,介绍获取模块1001的实现方式。类似的,合并模块1002、压缩模块1003、第一获取模块1101、第二获取模块1102和解压缩模块1103的实现方式可以参考获取模块1001的实现方式。
模块作为软件功能单元的一种举例,获取模块1001可以包括运行在计算实例上的代码。其中,计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地,上述计算实例可以是一台或者多台。例如,获取模块1001可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。进一步地,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个region可以包括多个AZ。
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个region内,同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。
模块作为硬件功能单元的一种举例,获取模块1001可以包括至少一个计算设备,如服务器等。或者,获取模块1001也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
获取模块1001包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。获取模块1001包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,获取模块1001包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。
需要说明的是,在其他实施例中,获取模块1001、合并模块1002、压缩模块1003、第一获取模块1101、第二获取模块1102和解压缩模块1103中任一模块可以用于执行数据处理方法中的任意步骤。获取模块1001、合并模块1002、压缩模块1003、第一获取模块1101、第二获取模块1102和解压缩模块1103负责实现的步骤可根据需要指定,通过获取模块1001、合并模块1002、压缩模块1003、第一获取模块1101、第二获取模块1102和解压缩模块1103分别实现数据处理方法中不同的步骤来实现数据处理装置的全部功能。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置和模块的具体工作过程,可以参考前述方法实施例中的对应内容,在此不再赘述。
本申请实施例提供了一种计算设备。该计算设备用于实现本申请实施例提供的数据处理方法中的部分或全部功能。图12是本申请实施例提供的一种计算设备的结构示意图。如图12所示,该计算设备1200包括处理器1201、存储器1202、通信接口1203和总线1204。其中,处理器1201、存储器1202、通信接口1203通过总线1204实现彼此之间的通信连接。
处理器1201可以包括通用处理器和/或专用硬件芯片。通用处理器可以包括:中央处理器(central processing unit,CPU)、微处理器或图形处理器(graphics processing unit,GPU)。CPU例如是一个单核处理器(single-CPU),又如是一个多核处理器(multi-CPU)。专用硬件芯片是一个高性能处理的硬件模块。专用硬件芯片包括数字信号处理器、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)或者网络处理器(network processer,NP)中的至少一项。处理器1201还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的数据处理方法的部分或全部功能,可以通过处理器1201中的硬件的集成逻辑电路或者软件形式的指令完成。
存储器1202用于存储计算机程序,计算机程序包括操作系统1202a和可执行代码(即程序指令)1202b。存储器1202例如是只读存储器或可存储静态信息和指令的其它类型的静态存储设备,又如是随机存取存储器或者可存储信息和指令的其它类型的动态存储设备,又如是电可擦可编程只读存储器、只读光盘或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的可执行代码并能够由计算机存取的任何其它介质,但不限于此。例如存储器1202用于存放出端口队列等。存储器1202例如是独立存在,并通过总线1204与处理器1201相连接。或者存储器1202和处理器1201集成在一起。存储器1202可以存储可执行代码,当存储器1202中存储的可执行代码被处理器1201执行时,处理器1201用于执行本申请实施例提供的数据处理方法的部分或全部功能。处理器1201执行该过程的实现方式请相应参考前述实施例中的相关描述。存储器1202中还可以包括操作系统等其他运行进程所需的软件模块和数据等。
通信接口1203使用例如但不限于收发器一类的收发模块,来实现与其他设备或通信网络之间的通信。例如,通信接口1203可以是以下器件的任一种或任一种组合:网络接口(如以太网接口)、无线网卡等具有网络接入功能的器件。
总线1204是任何类型的,用于实现计算设备的内部器件(例如,存储器1202、处理器1201、通信接口1203)互连的通信总线。例如系统总线。本申请实施例以计算设备内部的上述器件通过总线1204互连为例说明,可选地,计算设备1200内部的上述器件还可以采用除了总线1204之外的其他连接方式彼此通信连接。例如,计算设备1200内部的上述器件通过内部的逻辑接口互连。
需要说明的是,上述多个器件可以分别设置在彼此独立的芯片上,也可以至少部分的或者全部的设置在同一块芯片上。将各个器件独立设置在不同的芯片上,还是整合设置在一个或者多个芯片上,往往取决于产品设计的需要。本申请实施例对上述器件的具体实现形式不做限定。且上述各个附图对应的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其他流程的相关描述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。提供程序开发平台的计算机程序产品包括一个或多个计算机指令,在计算设备上加载和执行这些计算机程序指令时,全部或部分地实现本申请实施例提供的数据处理方法的功能。
并且,计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质存储有提供程序开发平台的计算机程序指令。
本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。
可选地,计算设备集群包括的至少一个计算设备的结构可参见图12示出的计算设备1200。计算设备集群中的一个或多个计算设备1200中的存储器1202中可以存有相同的用于执行数据处理方法的指令。
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备1200的存储器1202中也可以分别存有用于执行数据处理方法的部分指令。换言之,一个或多个计算设备1200的组合可以共同执行用于执行数据处理方法的指令。
需要说明的是,计算设备集群中的不同的计算设备1200中的存储器1202可以存储不同的指令,分别 用于执行数据处理装置的部分功能。也即,不同的计算设备1200中的存储器1202存储的指令可以实现获取模块1001、合并模块1002、压缩模块1003、第一获取模块1101、第二获取模块1102和解压缩模块1103中的一个或多个模块的功能。
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图13示出了一种可能的实现方式。如图13所示,两个计算设备1300A和1300B之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中,计算设备1300A和1300B包括总线1302、处理器1304、存储器1306和通信接口1308。计算设备1300A中的存储器1306中存有执行获取模块1001、合并模块1002和压缩模块1003的功能的指令。同时,计算设备1300B中的存储器1306中存有执行第一获取模块1101、第二获取模块1102和解压缩模块1103的功能的指令。
应理解,图13中示出的计算设备1300A的功能也可以由多个计算设备1300完成。同样,计算设备1300B的功能也可以由多个计算设备1300完成。且用于实现数据处理方法的模块在计算设备中的部署方式也可以根据应用需求进行调整。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质为非易失性计算机可读存储介质,该计算机可读存储介质包括程序指令,当程序指令在计算设备上运行时,使得计算设备实现如本申请实施例提供的数据处理方法。
本申请实施例还提供了一种包含指令的计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机实现本申请实施例提供的数据处理方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的原始数据和可执行代码等都是在充分授权的情况下获取的。
在本申请实施例中,术语“第一”、“第二”和“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。术语“至少一个”是指一个或多个,术语“多个”指两个或两个以上,除非另有明确的限定。
本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的构思和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (39)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    获取多个待压缩数据块;
    将所述多个待压缩数据块合并;
    对合并后的多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。
  2. 如权利要求1所述的方法,其特征在于,所述多个待压缩数据块按序排列,所述多个待压缩数据块中任一待压缩数据块的编码分区包括:所述任一待压缩数据块,及所述多个待压缩数据块中排序在所述任一待压缩数据块之前的所有待压缩数据块。
  3. 如权利要求1或2所述的方法,其特征在于,所述对合并后的多个待压缩数据块进行压缩处理,包括:
    基于前向编码,对合并后的多个待压缩数据块进行压缩处理。
  4. 如权利要求1至3任一所述的方法,其特征在于,所述获取多个待压缩数据块,包括:
    基于数据写入指令,获取多个待写入数据块;
    基于所述多个待写入数据块,获取所述多个待压缩数据块。
  5. 如权利要求4所述的方法,其特征在于,所述方法,还包括:
    将所述数据集合存储在存储介质上。
  6. 如权利要求5所述的方法,其特征在于,所述方法还包括:
    基于所述数据集合在所述存储介质上的位置,更新所述存储介质的索引信息。
  7. 如权利要求6所述的方法,其特征在于,所述索引信息还指示所述数据集合经过合并压缩处理得到。
  8. 如权利要求1至7任一所述的方法,其特征在于,所述获取多个待压缩数据块,包括:
    获取多个待筛选数据块;
    基于所述多个待筛选数据块中存在相似数据的待筛选数据块,得到所述多个待压缩数据块。
  9. 如权利要求8所述的方法,其特征在于,所述基于所述多个待筛选数据块中存在相似数据的待筛选数据块,得到所述多个待压缩数据块,包括:
    对存在相似数据的待筛选数据块执行重删处理,得到所述多个待压缩数据块。
  10. 如权利要求9所述的方法,其特征在于,经过重删处理被保留的待筛选数据块的索引信息还指示所述被保留的待筛选数据块经过重删处理得到。
  11. 如权利要求1至10任一所述的方法,其特征在于,所述待压缩数据块的索引信息还指示所述多个待压缩数据块的逻辑索引号。
  12. 一种数据处理方法,其特征在于,所述方法包括:
    获取待解压缩的目标数据集合的索引信息,所述目标数据集合包括一个或多个数据块,所述目标数据集合基于合并压缩处理得到;
    基于所述目标数据集合的索引信息,获取所述目标数据集合;
    对所述目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
  13. 如权利要求12所述的方法,其特征在于,所述目标数据集合的索引信息还指示所述目标数据集合中数据的处理类型,所述对所述目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合,包括:
    当所述目标数据集合的索引信息指示所述目标数据集合经过合并压缩处理得到时,对所述目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
  14. 如权利要求12或13所述的方法,其特征在于,所述目标数据集合存储在存储介质上,所述存储介质的主索引区域和在线索引区域均记载有索引信息,所述主索引区域记载有所述存储介质中存储的所有数据的第一索引信息,所述在线索引区域记载有第二索引信息,所述第一索引信息基于所述第二索引信息更新得到,所述获取待解压缩的目标数据集合的索引信息,包括:
    优先从所述主索引区域记载的第一索引信息中,获取所述目标数据集合的索引信息。
  15. 如权利要求14所述的方法,其特征在于,所述方法还包括:
    当无法基于所述第一索引信息获取所述目标数据集合时,从所述在线索引区域记载的第二索引信息中,获取所述目标数据集合的索引信息。
  16. 如权利要求12至15任一所述的方法,其特征在于,所述目标数据集合基于数据读取指令得到。
  17. 如权利要求16所述的方法,其特征在于,所述数据读取指令指示读取所述目标数据集合中的目标数据块,所述方法还包括:
    获取所述目标数据块的索引信息;
    基于所述目标数据块的索引信息,从所述经过解合并压缩的目标数据集合中,获取所述目标数据块;
    向所述数据读取指令反馈所述目标数据块。
  18. 如权利要求17所述的方法,其特征在于,所述目标数据块的索引信息指示所述目标数据块的逻辑索引号,所述基于所述目标数据块的索引信息,从所述经过解合并压缩的目标数据集合中,获取所述目标数据块,包括:
    基于所述逻辑索引号,从所述经过解合并压缩的目标数据集合中获取所述目标数据块。
  19. 一种数据处理装置,其特征在于,所述装置包括:
    获取模块,用于获取多个待压缩数据块;
    合并模块,用于将所述多个待压缩数据块合并;
    压缩模块,用于对合并后的多个待压缩数据块进行压缩处理,得到经过合并压缩的数据集合。
  20. 如权利要求19所述的装置,其特征在于,所述多个待压缩数据块按序排列,所述多个待压缩数据块中任一待压缩数据块的编码分区包括:所述任一待压缩数据块,及所述多个待压缩数据块中排序在所述任一待压缩数据块之前的所有待压缩数据块。
  21. 如权利要求19或20所述的装置,其特征在于,所述压缩模块,具体用于:基于前向编码,对合并后的多个待压缩数据块进行压缩处理。
  22. 如权利要求19至21任一所述的装置,其特征在于,所述获取模块,具体用于:
    基于数据写入指令,获取多个待写入数据块;
    基于所述多个待写入数据块,获取所述多个待压缩数据块。
  23. 如权利要求22所述的装置,其特征在于,所述压缩模块,还用于:将所述数据集合存储在存储介 质上。
  24. 如权利要求23所述的装置,其特征在于,所述压缩模块,还用于:基于所述数据集合在所述存储介质上的位置,更新所述存储介质的索引信息。
  25. 如权利要求24所述的装置,其特征在于,所述索引信息还指示所述数据集合经过合并压缩处理得到。
  26. 如权利要求19至25任一所述的装置,其特征在于,所述获取模块,具体用于:
    获取多个待筛选数据块;
    基于所述多个待筛选数据块中存在相似数据的待筛选数据块,得到所述多个待压缩数据块。
  27. 如权利要求26所述的装置,其特征在于,所述获取模块,具体用于:对存在相似数据的待筛选数据块执行重删处理,得到所述多个待压缩数据块。
  28. 如权利要求27所述的装置,其特征在于,经过重删处理被保留的待筛选数据块的索引信息还指示所述被保留的待筛选数据块经过重删处理得到。
  29. 如权利要求19至28任一所述的装置,其特征在于,所述待压缩数据块的索引信息还指示所述多个待压缩数据块的逻辑索引号。
  30. 一种数据处理装置,其特征在于,所述装置包括:
    第一获取模块,用于获取待解压缩的目标数据集合的索引信息,所述目标数据集合包括一个或多个数据块,所述目标数据集合基于合并压缩处理得到;
    第二获取模块,用于基于所述目标数据集合的索引信息,获取所述目标数据集合;
    解压缩模块,用于对所述目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
  31. 如权利要求30所述的装置,其特征在于,所述解压缩模块,具体用于:当所述目标数据集合的索引信息指示所述目标数据集合经过合并压缩处理得到时,对所述目标数据集合进行解合并压缩处理,得到经过解合并压缩的目标数据集合。
  32. 如权利要求30或31所述的装置,其特征在于,所述目标数据集合存储在存储介质上,所述存储介质的主索引区域和在线索引区域均记载有索引信息,所述主索引区域记载有所述存储介质中存储的所有数据的第一索引信息,所述在线索引区域记载有第二索引信息,所述第一索引信息基于所述第二索引信息更新得到,所述第一获取模块,具体用于:优先从所述主索引区域记载的第一索引信息中,获取所述目标数据集合的索引信息。
  33. 如权利要求32所述的装置,其特征在于,当无法基于所述第一索引信息获取所述目标数据集合时,所述第一获取模块,具体还用于:从所述在线索引区域记载的第二索引信息中,获取所述目标数据集合的索引信息。
  34. 如权利要求30至33任一所述的装置,其特征在于,所述目标数据集合基于数据读取指令得到。
  35. 如权利要求34所述的装置,其特征在于,所述数据读取指令指示读取所述目标数据集合中的目标数据块;
    所述第一获取模块,还用于获取所述目标数据块的索引信息;
    所述第二获取模块,还用于基于所述目标数据块的索引信息,从所述经过解合并压缩的目标数据集合 中,获取所述目标数据块;
    所述第二获取模块,还用于向所述数据读取指令反馈所述目标数据块。
  36. 如权利要求35所述的装置,其特征在于,所述目标数据块的索引信息指示所述目标数据块的逻辑索引号,所述第二获取模块,具体用于:基于所述逻辑索引号,从所述经过解合并压缩的目标数据集合中获取所述目标数据块。
  37. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器,所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求1至18任一项所述的方法。
  38. 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算设备集群运行时,使得所述计算设备集群执行如权利要求的1至18任一项所述的方法。
  39. 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备集群执行时,所述计算设备集群执行如权利要求1至18任一项所述的方法。
PCT/CN2023/104582 2022-11-21 2023-06-30 数据处理方法及装置 WO2024109066A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202211462945 2022-11-21
CN202211462945.6 2022-11-21
CN202310212957.1A CN118093532A (zh) 2022-11-21 2023-03-07 数据处理方法及装置
CN202310212957.1 2023-03-07

Publications (1)

Publication Number Publication Date
WO2024109066A1 true WO2024109066A1 (zh) 2024-05-30

Family

ID=91158039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/104582 WO2024109066A1 (zh) 2022-11-21 2023-06-30 数据处理方法及装置

Country Status (2)

Country Link
CN (1) CN118093532A (zh)
WO (1) WO2024109066A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210141729A1 (en) * 2019-11-08 2021-05-13 EMC IP Holding Company LLC Pre-decompressing a compressed form of data that has been pre-fetched into a cache to facilitate subsequent retrieval of a decompressed form of the data from the cache
CN113806341A (zh) * 2020-06-11 2021-12-17 华为技术有限公司 数据处理方法及存储设备
CN114723033A (zh) * 2022-06-10 2022-07-08 成都登临科技有限公司 数据处理方法、装置、ai芯片、电子设备及存储介质
CN115145467A (zh) * 2021-03-30 2022-10-04 华为技术有限公司 数据压缩方法、控制器、设备、介质及程序产品

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210141729A1 (en) * 2019-11-08 2021-05-13 EMC IP Holding Company LLC Pre-decompressing a compressed form of data that has been pre-fetched into a cache to facilitate subsequent retrieval of a decompressed form of the data from the cache
CN113806341A (zh) * 2020-06-11 2021-12-17 华为技术有限公司 数据处理方法及存储设备
CN115145467A (zh) * 2021-03-30 2022-10-04 华为技术有限公司 数据压缩方法、控制器、设备、介质及程序产品
CN114723033A (zh) * 2022-06-10 2022-07-08 成都登临科技有限公司 数据处理方法、装置、ai芯片、电子设备及存储介质

Also Published As

Publication number Publication date
CN118093532A (zh) 2024-05-28

Similar Documents

Publication Publication Date Title
US11082206B2 (en) Layout-independent cryptographic stamp of a distributed dataset
US9317519B2 (en) Storage system for eliminating duplicated data
US20160132541A1 (en) Efficient implementations for mapreduce systems
US11442627B2 (en) Data compression utilizing low-ratio compression and delayed high-ratio compression
US7657533B2 (en) Data management systems, data management system storage devices, articles of manufacture, and data management methods
US10089131B2 (en) Compute cluster load balancing based on disk I/O cache contents
CN106570113B (zh) 一种海量矢量切片数据云存储方法及系统
KR102471966B1 (ko) 스토리지 노드 기반의 키-값 스토어를 이용하는 데이터 입출력 방법
CN113535068A (zh) 数据读取方法和系统
CN115129621A (zh) 一种内存管理方法、设备、介质及内存管理模块
CN108304142A (zh) 一种数据管理方法和装置
US9703788B1 (en) Distributed metadata in a high performance computing environment
US11290532B2 (en) Tape reconstruction from object storage
WO2024109066A1 (zh) 数据处理方法及装置
US10067678B1 (en) Probabilistic eviction of partial aggregation results from constrained results storage
US20230128077A1 (en) System and Method for Aggregation of Write Commits To Control Written Block Size
CN116257180A (zh) 数据访问方法及装置
CN113051244A (zh) 数据访问方法和装置、数据获取方法和装置
He et al. Research on key technologies of NBD storage service system based on load classification
CN112000289A (zh) 全闪存储服务器系统数据管理方法及相关组件
WO2022032532A1 (en) Sharding for workflow applications in serverless architectures
US11627085B2 (en) Non-transitory computer-readable recording medium, service management device, and service management method
WO2022206334A1 (zh) 一种数据压缩方法及装置
CN117349232A (zh) 数据访问方法、装置及系统
WO2024037223A1 (zh) 数据操作方法及装置