WO2023024459A1 - 一种数据处理的方法和装置 - Google Patents

一种数据处理的方法和装置 Download PDF

Info

Publication number
WO2023024459A1
WO2023024459A1 PCT/CN2022/077413 CN2022077413W WO2023024459A1 WO 2023024459 A1 WO2023024459 A1 WO 2023024459A1 CN 2022077413 W CN2022077413 W CN 2022077413W WO 2023024459 A1 WO2023024459 A1 WO 2023024459A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
compression
compression algorithm
compressed
storage
Prior art date
Application number
PCT/CN2022/077413
Other languages
English (en)
French (fr)
Inventor
王志兵
刘珍宝
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22859818.1A priority Critical patent/EP4354310A4/en
Publication of WO2023024459A1 publication Critical patent/WO2023024459A1/zh
Priority to US18/412,995 priority patent/US20240154623A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor
    • H03M7/6082Selection strategies
    • H03M7/6088Selection strategies according to the data type
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • G06F3/0649Lifecycle management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0661Format or protocol conversion arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3002Conversion to or from differential modulation
    • H03M7/3004Digital delta-sigma modulation
    • H03M7/3015Structural details of digital delta-sigma modulators
    • H03M7/302Structural details of digital delta-sigma modulators characterised by the number of quantisers and their type and resolution
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3066Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction by means of a mask or a bit-map
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6017Methods or arrangements to increase the throughput
    • H03M7/6029Pipelining

Definitions

  • the present application relates to the field of data storage, and more specifically, to a data processing method and device.
  • Unstructured data refers to data whose data structure is irregular or incomplete, has no predefined data model, and is inconvenient to use databases and two-dimensional logical tables to represent data.
  • Massive office documents, hypertext markup language (HTML), images, audio and video, etc. generated by current computer information systems are all unstructured data, which account for about 80% of the total data volume. Moreover, The amount of unstructured data continues to grow at a rate of approximately doubling every year.
  • recompression schemes based on unstructured data are generally limited to several specific algorithms and require user participation, which requires users to understand the relevant knowledge of data compression. More importantly, the current recompression schemes cannot fully satisfy users. Compression requirements for many types of unstructured data.
  • the present application provides a data processing method and device, which can meet the re-compression requirements of different types of unstructured data, and do not require users to participate in the data re-compression process, so as to better achieve the goal of saving storage space.
  • a data processing method is provided.
  • the method is applied to a storage system.
  • the storage system includes: a storage device and a processing device.
  • the method is executed by the processing device.
  • the method includes: obtaining the hierarchical storage characteristics and data of the first data Features, the hierarchical storage features include at least one of the following features: importance, access frequency, retention time, the data features include at least one of the following features: data type, data dimension, data size, data content features; according to the hierarchical storage features and data features , determine a first compression algorithm; according to the first compression algorithm, compress the first data to obtain compressed data.
  • a compression algorithm suitable for the first data is selected, and the first data is compressed using the compression algorithm.
  • This method can achieve reasonable compression of any type of data without user participation, and can meet the re-compression requirements of many types of unstructured data, thereby realizing the re-compression of data in the storage system more efficiently (or compressed).
  • the determining the first compression algorithm according to the hierarchical storage characteristics and data characteristics includes: determining the compression algorithm performance parameters according to the hierarchical storage characteristics; according to the compression algorithm performance parameters With data characteristics, a first compression algorithm is determined.
  • the present application can better screen the first compression algorithm suitable for the first data, so as to realize the re-compression of many types of unstructured data. Compression requirements, thereby avoiding the shortcomings of heavy compression schemes related to strong compression algorithms in existing technologies.
  • the method before determining the first compression algorithm, further includes: determining that the first data needs to be compressed according to hierarchical storage characteristics.
  • the first data is compressed only when the first data needs to be compressed, so as to avoid compressing the first data when no compression processing is required. Compression processing is performed, so that the user does not need to choose whether to compress the first data, and then helps the user determine whether the first data needs to be compressed, thereby avoiding user participation.
  • the first data is compressed data
  • the compressing the first data according to the first compression algorithm includes: decompressing the first data to obtain The intermediate data is stored, and the first operation parameter during decompression is stored; according to the first compression algorithm, the intermediate data is compressed to obtain the compressed data.
  • this application realizes the recompression processing of the first data without user participation, and saves the corresponding operation parameters, so that when the user needs to call the first data, the compressed The data is restored to the first data, thereby avoiding damage to the first data.
  • the method further includes: decompressing the compressed data to obtain intermediate data; and compressing the intermediate data according to the first operation parameter to obtain the first data.
  • the present application supports restoring the compressed data to the first data, thereby avoiding damage to the first data.
  • the method further includes: decompressing the compressed data to obtain the first data.
  • the method further includes: storing the compressed data in a storage device.
  • the present application can reduce the occupied space of data storage.
  • the performance parameters of the compression algorithm include at least one of the following: compression ratio and throughput.
  • a data processing device which includes: a file processing module, a compression module, an algorithm selection module, and a storage classification module; the storage classification module is used to obtain the hierarchical storage characteristics of the first data, and the hierarchical storage
  • the features include at least one of the following features: importance, access frequency, and retention time; the file processing module is used to obtain the data features of the first data, and the data features include at least one of the following features: data type, data dimension, data size, and data content Features; an algorithm selection module, used to determine the first compression algorithm according to the hierarchical storage characteristics and data characteristics; a compression module, used to compress the first data according to the first compression algorithm, and obtain compressed data.
  • the algorithm selection module is configured to determine compression algorithm performance parameters according to hierarchical storage characteristics; determine the first compression algorithm according to compression algorithm performance parameters and data characteristics.
  • the storage tiering module is further configured to determine that the first data needs to be compressed according to the tiered storage characteristics.
  • the first data is compressed data
  • the compression module is used to decompress the first data, obtain intermediate data, and save the first operation during decompression Parameter; according to the first compression algorithm, the intermediate data is compressed to obtain the compressed data.
  • the compression module is also used to decompress the compressed data to obtain intermediate data; according to the first operating parameter, compress the intermediate data to obtain the first data .
  • the compression module is further configured to decompress the compressed data to obtain the first data.
  • the storage device is used to store compressed data.
  • the compression algorithm performance parameters include at least one of the following: compression rate and throughput.
  • a computer-readable storage medium which stores instructions, and when the instructions are run on a computer, the computer executes any one of the first aspect and any possible implementation manner of the first aspect. The data processing method described.
  • a computer program product is provided.
  • the computer program product runs on a computer, the computer executes the data processing described in any one of the first aspect and any possible implementation manner of the first aspect. method.
  • a computer device in a fifth aspect, includes a processor and a memory; the memory is used to store computer program instructions; the processor executes and invokes the computer program instructions in the memory to perform the operations described in the first aspect and the first The data processing method described in any one of the possible implementation manners of the aspect.
  • FIG. 1 is a schematic diagram of an application scenario provided by this application.
  • FIG. 2 is a schematic diagram of an existing data processing method.
  • FIG. 3 is a schematic diagram of a data processing method provided by the present application.
  • Fig. 4 is a schematic diagram of a data processing device provided by the present application.
  • FIG. 5 is a schematic diagram of another data processing device provided by the present application.
  • Fig. 1 shows a schematic diagram of an application scenario provided by this application.
  • the application server 100 may be a physical machine or a virtual machine. Physical application servers include, but are not limited to, desktops, servers, laptops, and mobile devices.
  • the application server accesses the storage system through the optical fiber switch 110 to access data.
  • the switch 110 is only an optional device, and the application server 100 can also directly communicate with the storage system 120 through the network.
  • the optical fiber switch 110 can also be replaced with an Ethernet switch, an InfiniBand switch, a RoCE (RDMA over converged ethernet) switch, and the like.
  • the storage system 120 shown in FIG. 1 is a centralized storage system.
  • the characteristic of the centralized storage system is that there is a unified entrance, and all data from external devices must pass through this entrance, and this entrance is the engine 121 of the centralized storage system.
  • the engine 121 is the most core component in the centralized storage system, where many advanced functions of the storage system are implemented.
  • controllers in the engine 121 there are one or more controllers in the engine 121 .
  • the engine includes two controllers as an example for illustration.
  • controller 0 fails, controller 1 can take over the business of controller 0.
  • controller 1 fails, controller 0 can take over the business of controller 1. business, so as to avoid the unavailability of the entire storage system 120 caused by hardware failure.
  • four controllers are deployed in the engine 121, there is a mirroring channel between any two controllers, so any two controllers are mutual backups.
  • the engine 121 also includes a front-end interface 125 and a back-end interface 126 , wherein the front-end interface 125 is used to communicate with the application server 100 to provide storage services for the application server 100 .
  • the back-end interface 126 is used to communicate with the hard disk 134 to expand the capacity of the storage system. Through the back-end interface 126, the engine 121 can be connected with more hard disks 134, thereby forming a very large storage resource pool.
  • the controller 0 includes at least a processor 123 and a memory 124 .
  • Processor 123 is a central processing unit (central processing unit, CPU), used for processing data access requests from outside the storage system (server or other storage systems), and also used for processing requests generated inside the storage system.
  • CPU central processing unit
  • the processor 123 receives the write data request sent by the application server 100 through the front-end port 125 , it will temporarily save the data in the write data request in the memory 124 .
  • the processor 123 sends the data stored in the memory 124 to the hard disk 134 for persistent storage through the back-end port.
  • the memory 124 refers to an internal memory directly exchanging data with the processor. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for an operating system or other running programs.
  • Memory includes at least two kinds of memory, for example, memory can be either random access memory or read only memory (ROM).
  • the random access memory is, for example, dynamic random access memory (DRAM), or storage class memory (SCM).
  • DRAM is a semiconductor memory that, like most random access memory (RAM), is a volatile memory device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory. Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM. .
  • the DRAM and the SCM are only illustrative examples in this embodiment, and the memory may also include other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like.
  • the memory 124 can also be a dual-in-line memory module or a dual-line memory module (dual in-line memory module, DIMM), that is, a module composed of a dynamic random access memory (DRAM), or a solid-state hard disk ( solid state disk, SSD).
  • DIMM dual in-line memory module
  • multiple memories 124 and different types of memories 124 may be configured in the controller 0 .
  • This embodiment does not limit the quantity and type of the memory 113 .
  • the memory 124 can be configured to have a power saving function.
  • the power saving function means that the data stored in the internal memory 124 will not be lost when the system is powered off and then powered on again.
  • Memory with a power saving function is called non-volatile memory.
  • a software program is stored in the internal memory 124, and the processor 123 runs the software program in the internal memory 124 to manage the hard disk.
  • hard disks are abstracted into storage resource pools, and then divided into LUNs for use by servers.
  • the LUN here is actually the hard disk seen on the server.
  • some centralized storage systems are also file servers themselves, which can provide shared file services for servers.
  • controller 1 The hardware components and software structure of the controller 1 (and other controllers not shown in FIG. 1 ) are similar to those of the controller 0 and will not be repeated here.
  • Figure 1 shows a centralized storage system with separate disk control.
  • the engine 121 may not have a hard disk slot, the hard disk 134 needs to be placed in the hard disk enclosure 130 , and the back-end interface 126 communicates with the hard disk enclosure 130 .
  • the back-end interface 126 exists in the engine 121 in the form of an adapter card, and one engine 121 can use two or more back-end interfaces 126 to connect multiple hard disk enclosures at the same time.
  • the adapter card can also be integrated on the motherboard, and at this time the adapter card can communicate with the processor 112 through the PCIE bus.
  • the storage system may include two or more engines 121 , and redundancy or load balancing is performed among the multiple engines 121 .
  • the hard disk enclosure 130 includes a control unit 131 and several hard disks 134 .
  • the control unit 131 can have various forms.
  • the hard disk enclosure 130 belongs to the smart disk enclosure, and as shown in FIG. 1 , the control unit 131 includes a CPU and a memory.
  • the CPU is used to perform operations such as address translation and reading and writing data.
  • the memory is used for temporarily storing data to be written into the hard disk 134, or data read from the hard disk 134 to be sent to the controller.
  • the control unit 131 is a programmable electronic component, such as a data processing unit (data processing unit, DPU).
  • DPU data processing unit
  • a DPU has the general purpose and programmability of a CPU, but is more specialized, operating efficiently on network packets, storage requests, or analytics requests.
  • DPUs are distinguished from CPUs by a greater degree of parallelism (the need to process a large number of requests).
  • the DPU here can also be replaced with a graphics processing unit (graphics processing unit, GPU), an embedded neural network processor (neural-network processing units, NPU) and other processing chips.
  • the number of control units 131 can be one, or two or more.
  • the hard disk enclosure 130 includes at least two control units 131 , there may be an ownership relationship between the hard disk 134 and the control units 131 .
  • each control unit can only access the hard disk belonging to it, which often involves forwarding read/write data requests between the control units 131, resulting in a relatively narrow path for data access. long.
  • the storage space is insufficient, when adding a new hard disk 134 to the hard disk enclosure 130, it is necessary to re-bind the ownership relationship between the hard disk 134 and the control unit 131, and the operation is complicated, resulting in poor expandability of the storage space. Therefore, in another implementation manner, the functions of the control unit 131 can be offloaded to the network card 104 .
  • the hard disk enclosure 130 does not have the control unit 131 inside, but the network card 104 completes data reading and writing, address conversion and other computing functions.
  • the network card 104 is an intelligent network card. It can contain CPU and memory. In some application scenarios, the network card 104 may also have a persistent memory medium, such as persistent memory (persistent memory, PM), or non-volatile random access memory (non-volatile random access memory, NVRAM), or phase change Memory (phase change memory, PCM), etc.
  • the CPU is used to perform operations such as address translation and reading and writing data.
  • the memory is used for temporarily storing data to be written into the hard disk 134, or data read from the hard disk 134 to be sent to the controller.
  • DPU data processing unit
  • a DPU has the general purpose and programmability of a CPU, but is more specialized, operating efficiently on network packets, storage requests, or analytics requests. DPUs are distinguished from CPUs by a greater degree of parallelism (the need to process a large number of requests).
  • the DPU here can also be replaced with a graphics processing unit (graphics processing unit, GPU), an embedded neural network processor (neural-network processing units, NPU) and other processing chips.
  • graphics processing unit graphics processing unit, GPU
  • NPU embedded neural network processor
  • the disk enclosure 130 may be a SAS disk enclosure, or an NVMe disk enclosure, an IP disk enclosure, or other types of disk enclosures.
  • the SAS hard disk enclosure adopts the SAS3.0 protocol, and each enclosure supports 25 SAS hard disks.
  • the engine 121 is connected to the hard disk enclosure 130 through an onboard SAS interface or a SAS interface module.
  • the NVMe disk enclosure is more like a complete computer system, and the NVMe disk is inserted into the NVMe disk enclosure. The NVMe disk enclosure is then connected to the engine 121 through the RDMA port.
  • FIG. 1 is only understood as an example, and the technical solution of the embodiment of the present application can also be applied to other types of storage systems, for example, a storage system with a disk-control integrated architecture, and a disk-control separation Architectural storage systems, distributed storage systems, and more.
  • unstructured data refers to data whose data structure is irregular or incomplete, without a predefined data model, and inconvenient to use databases, two-dimensional logical tables, etc. to represent data.
  • hierarchical storage means that the storage system uses different storage methods to store the stored data in different performance storage devices according to the importance of the stored data, access frequency, retention time, capacity and other indicators.
  • the compression rate refers to the ratio between the data size before and after compression, and is one of the main indicators used to evaluate the quality of the compression algorithm.
  • recompression refers to further compression of data stored in a fixed format. Generally speaking, it is necessary to decode and restore the data stored in a fixed format, and then use a new algorithm to recompress the data, thereby reducing the storage space.
  • Data compression refers to a technical method to reduce the amount of data to improve its transmission, storage and processing efficiency without losing useful information, or to reorganize data according to a certain method to reduce data redundancy and storage space. .
  • unstructured data usually adopt different storage formats (or, encapsulation formats).
  • office document data is usually stored in PDF or ZIP, and DEFLATE, LZW and other methods are usually used to compress such data
  • image data is usually stored in TIFF, PNG or JPEG, and usually DEFLATE, JPEG or WEBP and other methods to compress this type of data
  • audio and video data are usually stored in MPEG, AVI or RMVB and other methods, and usually use MPEG4, H264, H265 and other methods to compress such data.
  • Hot data refers to online data that needs to be frequently accessed by computing nodes. For example, it can be data within half a year, and users often query it.
  • Cold data refers to offline infrequently accessed data, such as backups used for disaster recovery or data that must be retained for a period of time to comply with legal regulations, such as enterprise backup data, business and operation log data, bills and statistical data .
  • Hot data and cold data are stored in different ways in the storage system.
  • the current storage system widely uses hierarchical storage technology to store data hierarchically, which can not only reduce the space occupied by non-important data on the first-level local disk, but also improve the storage performance of the entire system.
  • the storage system will further subdivide the data, and then store a type of data with a higher access frequency in a high-performance storage device, and store a type of data with a lower access frequency in a relatively low-performance storage device. , thereby reducing storage costs.
  • most of the unstructured data belongs to the type of data with low access frequency.
  • the technical solution for re-compression of compressed data generally adopts a certain degree of restoration operation on the compressed data, and on this basis, re-compression is performed on the restored data.
  • FIG. 2 shows an existing data recompression method.
  • Figure 2(a) shows the flow of the re-compression method for JPEG image data, the re-compression method mainly realizes the re-compression of the compressed data by replacing the original encoding method, that is, as shown in Figure 2(a), the data Stream 1 is restored to the level of encoding method 1, then select encoding method 2 to re-encode the data, and then output data stream 2.
  • Figure 2(b) shows the flow of the recompression method for DEFLATE data, that is, as shown in Figure 2(b), data stream 1 is restored to the extent of the original input data, and then a new compression algorithm is selected for the original input data Do heavy compression, then output stream 2.
  • Figure 2(c) shows a conventional data recompression process. Specifically, after the user obtains the original data (uncompressed data), he will first select a storage format, such as TIFF, PDF, JPEG, MPEG, etc. Then select or specify a compression algorithm by the storage format to compress and store the data. If users want to further compress the data, they need to choose a lossy or lossless compression algorithm according to the importance of the data, and choose a matching storage format, and then store the data after recompression.
  • a storage format such as TIFF, PDF, JPEG, MPEG, etc.
  • the present application provides a data processing method and device.
  • the present application can meet users' recompression requirements for different types of unstructured data, and It does not require users to participate in the process of data re-compression, thereby helping users to easily achieve the purpose of saving storage space.
  • FIG. 3 shows a data processing method provided by the present application.
  • the method is applied to a storage system, and the storage system includes: a storage device and a processing device, and the method is executed by the processing device.
  • the specific content is shown in Figure 3.
  • the first stored data includes first data and first metadata
  • the first metadata is related information used to describe the first data
  • the first data is user Data
  • the first stored data is a form of data obtained after the storage system processes and stores the first data.
  • the first data may be compressed data or uncompressed data. It is described in a unified manner here, and will not be repeated in the following text.
  • the hierarchical storage feature includes at least one of the following features: importance, access frequency, and retention time.
  • the data feature includes at least one of the following features: data type, data dimension, data size, and data content feature.
  • the first data is used to represent user data with substantial content.
  • the storage system will determine the hierarchical storage characteristics of the first data based on information such as data importance, access frequency, and retention time.
  • the hierarchical storage feature can be used to screen a compression algorithm suitable for the first data.
  • the data dimension refers to information such as length, width, and height of the first data.
  • the data type refers to the type of the first data, including information such as integer type, floating point number type, or character type.
  • the data content feature refers to feature information possessed by the first data, and the data content feature is related to the feature analysis method used.
  • the method for analyzing the data content characteristics of the first data includes at least one of the following: statistical analysis, principal components analysis (principal components analysis, PCA), cluster analysis, hypothesis testing, and the like.
  • the data content characteristics of the first data are different.
  • the data content characteristics of the first data mainly refer to the fitted data distribution, mean, variance, etc.
  • the data content characteristics of the first data mainly refer to The degree of correlation between them
  • the data content characteristics of the first data mainly refer to several important dimensions in the obtained data; others include hypothesis testing, cluster analysis, etc.
  • the device By interfacing with the storage system (or hierarchical storage system), the device obtains the hierarchical storage characteristics configured by the storage system for the first data, and then, the device can use the hierarchical storage characteristics to screen a compression algorithm suitable for the first data.
  • the storage system or hierarchical storage system
  • the hierarchical storage feature configured by the storage system for the first data can not only be used to determine whether to recompress the first data, but also can be used to determine the degree of recompression for the first data, or in other words, It can be used to determine the performance requirements to be met by the compression algorithm for re-compressing the first data, for example, throughput, compression rate and so on.
  • the device can perform hierarchical storage of the first data
  • the multiple features involved in the feature are weighted to obtain a comprehensive score for evaluating the first data, or a comprehensive performance requirement parameter is determined, and based on the comprehensive score or performance requirement parameter, it is judged whether it is necessary to The first data is recompressed, and the degree of recompression is determined for the first data.
  • the device obtains the first metadata by restoring the first stored data to a certain extent, obtains the first data based on the first metadata, and obtains the first metadata based on the first metadata. Part of the data feature information of the data, and based on the data content analysis of the first data, the data content feature is obtained. Therefore, the device obtains the data content feature information of the first data based on the first metadata and the first data.
  • the apparatus obtains the first metadata by parsing the storage format of the first stored data, and obtains the first data based on the first metadata.
  • the first data is generally in the form of a one-dimensional 8-byte array in the memory. Therefore, the first metadata needs to be used to describe the relevant information of the first data, such as the data type, size, Dimensions, and information such as the compression algorithm used.
  • the storage format of the first stored data can be understood as: storing the first metadata and the first data at a specified location in a prescribed manner, so as to obtain the first stored data, in other words, the storage format is the first metadata and the storage method of the first data.
  • the tagged image file format (TIFF) mainly includes three parts: image file header (image file header, IFH), image file directory (image file directory, IFD) and directory entry (directory entry, DE) .
  • IFH records version information, byte order and the offset of the first IFD;
  • IFD records the number of DEs and the offset address of each DE; each DE records the relevant first metadata information, and points to the specific first A data pointer and other information.
  • the device determines a first compression algorithm based on hierarchical storage features and data features, including:
  • the compression algorithm performance parameters include at least one of the following: compression rate and throughput.
  • the device uses the hierarchical storage characteristics fed back by the storage system to determine the performance parameters of the compression algorithm. Compression throughput requirements, in turn identifying at least one compression algorithm that satisfies the compression algorithm performance parameters.
  • the device determines a first compression algorithm according to the compression algorithm performance parameters and data characteristics, and the first compression algorithm needs to satisfy the compression algorithm performance parameters and data characteristics at the same time.
  • the device determines the first compression algorithm based on hierarchical storage characteristics and data characteristics, including:
  • #A Determine a first compression algorithm set based on hierarchical storage features, where the first compression algorithm set includes at least one compression algorithm
  • #B Determine a second compression algorithm set based on data characteristics, where the second compression algorithm set includes at least one compression algorithm
  • #C Determine the first compression algorithm, and the first compression algorithm matches the hierarchical storage feature and the data feature at the same time.
  • the device uses the data importance, access frequency, retention time and other parameters fed back by the storage system to quantify these indicators as the requirements for the minimum compression and decompression throughput or compression rate of the compression algorithm, so as to screen out the first A collection of compression algorithms. And, the device utilizes data features to screen out a second set of compression algorithms, and determines an intersection from the first set of compression algorithms and the second set of compression algorithms, and the intersection includes at least one compression algorithm.
  • the device determines a first selection criterion of the compression algorithm based on the hierarchical storage feature, for example, the first selection criterion is the compression throughput and reconstruction throughput conditions required to be achieved by the compression algorithm; based on the data feature, A second selection criterion of the compression algorithm is determined, for example, the second selection criterion is that the compression algorithm needs to support the data type, dimension and so on.
  • the device selects one compression algorithm from multiple compression algorithms meeting the first selection standard and the second selection standard to compress the first stored data.
  • the device can sort multiple compression algorithms that meet the above-mentioned hierarchical storage characteristics and data characteristics, and select the best compression algorithm therefrom.
  • the sorting index may include at least one of the following: compression ratio, compression throughput, reconstruction throughput, and feature matching degree.
  • the device can perform data feature matching on the intersection between the first compression algorithm set and the second compression algorithm set to determine a third compression algorithm set, and sort the third compression algorithm set,
  • the sorting index may be the index parameters listed above, and the best compression algorithm is selected therefrom.
  • the device makes a score for each compression algorithm according to the degree of matching between each compression algorithm and the feature analysis result, and obtains a batch of compression algorithms higher than a certain scoring threshold.
  • the device may determine the first compression algorithm based on the hierarchical storage feature and the data feature at the same time, or may first determine the first compression algorithm based on the hierarchical storage feature and then the data feature, or may also first determine the first compression algorithm.
  • the first compression algorithm is determined based on the data feature and the hierarchical storage feature, and the sequence is not limited in this embodiment of the present application.
  • the present application can better select the first compression algorithm suitable for the first data, so as to realize the re-compression of many types of unstructured data. Compression requirements, thereby avoiding the shortcomings of heavy compression schemes related to strong compression algorithms in existing technologies.
  • S330 Compress the first data according to the first compression algorithm to obtain compressed data.
  • the device compresses the first data according to a first compression algorithm to obtain compressed data, including:
  • #D decompress the first data, obtain intermediate data, and save the first operating parameters during decompression
  • #E Compress the intermediate data according to the first compression algorithm to obtain compressed data.
  • the intermediate data may include the original uncompressed data, or the data obtained after decoding the first data, which depends on the degree of decompression of the first data, and this application does not make a specific form of the intermediate data.
  • the first operating parameter may refer to storage format information of the first data, first metadata information, and related parameters of the original compression algorithm.
  • the first operating parameter can be understood as the information involved in the parsing process of the first data, which helps to restore the intermediate data to the first data.
  • the compressed data is restored to the first data, the compressed data is firstly decompressed, and then the intermediate data obtained after decompression is compressed using the first operation parameter recorded in the parsing process to obtain the first a data.
  • this application realizes the recompression processing of the first data without user participation, and saves the corresponding operation parameters, so that when the user needs to call the first data, the compressed The data is restored to the first data, thereby avoiding damage to the first data.
  • the device decompresses the compressed data to obtain intermediate data; according to the first operating parameter, compresses the intermediate data and then stores it in an original storage format to obtain the first data.
  • the present application supports restoring the compressed data to the first data, thereby avoiding damage to the first data.
  • the device decompresses the compressed data to obtain the first data.
  • the decompression process may be understood as directly decompressing the compressed data to obtain the first data, or it may be that the process of decompressing the compressed data requires multiple steps to obtain the first data.
  • the device determines that the first data needs to be compressed based on hierarchical storage characteristics.
  • the first data is compressed only when the first data needs to be compressed, so as to avoid compressing the first data when no compression processing is required. Compression processing is performed, so that the user does not need to choose whether to compress the first data, and then helps the user determine whether the first data needs to be compressed, thereby avoiding user participation.
  • the device stores the compressed data in a storage device.
  • the present application can reduce the occupied space of data storage.
  • this application can achieve In this way, the independent judgment and compression of the first data can be realized, so as to meet the re-compression requirements of many types of unstructured data, thereby helping users to easily achieve the purpose of saving storage space.
  • the storage system includes: a storage device #410 and a processing device #420.
  • the processing device #420 is configured to execute the data processing methods described in the above method embodiments.
  • the processing means #420 can be used to obtain the hierarchical storage characteristics and data characteristics of the first data; it can also be used to determine the first compression algorithm according to the hierarchical storage characteristics and data characteristics; it can also be used The first data is compressed according to the first compression algorithm to obtain compressed data.
  • FIG. 5 is a schematic diagram of a data processing device #500 provided by the present application. The content shown in FIG. 5 is a further description of the processing device #420 in FIG. 4 .
  • the processing device #420 includes a storage classification module #510, a file processing module #520, an algorithm selection module #530, and a compression module #540.
  • the processing device #420 also includes a main control module #550.
  • the storage grading module #510 is configured to acquire the tiered storage feature of the first data. Specifically, the storage tiering module #510 interfaces with the storage system, so as to acquire tiered storage information of the first data by the storage system.
  • the file processing module #520 is configured to acquire data feature information of the first data, such as data type, data dimension, data size, and data content feature. It should be understood that in this embodiment of the application, the file processing module #520 has the following advantages:
  • a unified parsing and reconstruction interface is used to perform parsing work on the first stored data.
  • the device encapsulates a unified format analysis and reconstruction interface by the main control module #550 for use by each format analysis algorithm, so that all algorithms adopt a unified analysis and restoration interface. Due to the unified interface, the main control module #550 of the device can call the specified algorithm only by the name of the format analysis algorithm, so as to realize the decoupling between various format analysis algorithms and the main control module #550.
  • the application can compile it into an independent dynamic library for the main control module to load and call.
  • the device Since the data is generally compressed before storage, when the device recompresses the first data, it needs to decide whether to decompress the first data according to the original compression method and the alternative recompression algorithm: 1) Recompression Algorithm recompresses data compressed by the specified algorithm. At this time, the application only needs to obtain the metadata and the first data by analyzing the format of the first stored data; 2) The re-compression algorithm is aimed at uncompressed data. At this time, after the application obtains the first data by analyzing the format of the first stored data, the main control module #550 of the device loads the corresponding original compression algorithm to decompress the first data to obtain intermediate data.
  • the compressed data can be accurately restored to the first data.
  • Compression algorithms usually have multiple optional parameters, such as compression level, prediction mode, etc.
  • the compressed results are usually different under different parameter combinations, but these parameters are generally not recorded during storage. Due to the difference in the selected compressor parameters, there may be differences between the restored user data and the original user data. Although the data stored by the user is not lost, due to the lack of knowledge about data compression and storage, the user may think that heavy compression caused the data loss.
  • this application analyzes the parameter selection of the compressor and makes a record, so that when the format is restored, it can be based on the recorded parameters. Realize accurate restoration of user data.
  • the algorithm selection module #530 is configured to determine a first compression algorithm based on hierarchical storage features and data features. The specific process is as described in the foregoing method embodiments, and will not be repeated here.
  • the algorithm selection module #530 can be used in the following scenarios:
  • the algorithm selection module #530 of the present application supports selection of a compression algorithm that best matches the overall data characteristics and performance requirements of the first data.
  • each data block is compressed.
  • the compression algorithm used by each data block is the same.
  • the characteristics of different parts of a complete data may be quite different. For example, in an image of a person, the person is usually located in the middle area of the image, while the surrounding images are generally quite different from the person.
  • the algorithm selection module #530 of the present application can select different compression algorithms for different data blocks, thereby improving the overall compression rate.
  • the compression module #540 is configured to use the first compression algorithm to re-compress the first data to obtain compressed data.
  • the compression module #540 is responsible for compressing and decompressing data.
  • the compressor module #540 has the following features:
  • This application supports the encapsulation of a unified compression and decompression interface by the main control module #550 for use by each compressor, so that all compressors use a unified compression and decompression interface. Due to the unified interface, the main control module #550 can call the designated compressor only by the name of the compressor, so as to realize the decoupling between the compressor and the main control module #550.
  • each compressor Since each compressor is independent, the upgrade and maintenance of a certain compressor will not affect other compressors, and each compressor can evolve independently to achieve decoupling within the compressor module.
  • the main control module #550 is responsible for command line parameter parsing, multi-thread scheduling, file processing and corresponding algorithm loading and calling of the compression algorithm module.
  • the main control module #550 has the following characteristics:
  • the main control module #550 provides a unified base class, which contains various common parameters and methods to be implemented such as parameter analysis, compression, and decompression.
  • Each compression algorithm is based on its own characteristics. Rewrite the inherited method, so as to achieve the decoupling of the main control module #550 and the compressor module, and the unified calling of different compression algorithms.
  • this application can add a thread pool to the main control module #550, and then assign tasks that can be processed in parallel, including format parsing and restoration, data compression and decompression, to each thread for processing, thereby improving the overall re-compression efficiency.
  • the main control module #550 needs to preliminarily screen the compressors according to the type of data and the characteristics of each compressor.
  • the type of data can generally be obtained through relevant information such as metadata, but it is relatively difficult to obtain relevant indicators of the compressor, such as compression ratio, compression throughput, and decompression throughput.
  • An embodiment of the present application provides a computer-readable storage medium, which stores instructions, and when the instructions are run on a computer, the computer is made to execute the data processing method executed by the processing device in the foregoing method embodiments.
  • An embodiment of the present application provides a computer program product, which, when running on a computer, causes the computer to execute the data processing method executed by the processing device in the foregoing method embodiments.
  • a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device can be components.
  • One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more packets of data (e.g., data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet via a signal interacting with other systems). Communicate through local and/or remote processes.
  • packets of data e.g., data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet via a signal interacting with other systems.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种数据处理的方法和装置,该方法应用于存储系统,存储系统包括:存储装置和处理装置,方法由处理装置执行,方法包括:获取第一数据的分级存储特征与数据特征,所述分级存储特征包括以下至少一个特征:重要性、访问频率、保留时间,所述数据特征包括以下至少一个特征:数据类型、数据维度、数据大小、数据内容特征;根据所述分级存储特征与所述数据特征,确定第一压缩算法;根据所述第一压缩算法,对所述第一数据进行压缩,获得压缩后的数据。通过该方法和装置,本申请能够满足不同类型的非结构化数据的重压缩需求,且不需要用户参与数据重压缩的过程,从而更好地实现节约存储空间的目标。

Description

一种数据处理的方法和装置
本申请要求于2021年08月25日提交中国专利局、申请号为202110978326.1、申请名称为“一种数据处理的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据存储领域,更具体地,涉及一种数据处理的方法和装置。
背景技术
非结构化数据是指数据结构不规则或不完整,没有预定义的数据模型,不方便使用数据库和二维逻辑表等来表现的数据。当前计算机信息化系统产生的海量办公文档、超文本标记语言(hypertext markup language,HTML)、图像与音视频等均属于非结构化数据,这类数据占据了总数据量的80%左右,而且,非结构化数据的数据量大约以每年翻一倍的速度在持续增长。
虽然用户在存储数据之前会选择对数据进行压缩,但若能对用户存储后访问频率较低的数据做进一步的重压缩,则还能够进一步的节约存储空间,降低存储成本。
目前,基于非结构化数据的重压缩方案一般限定于特定的几种算法,且需要用户的参与,这就要求用户了解数据压缩的相关知识,更重要的是,当前重压缩方案无法充分满足用户对诸多类型的非结构化数据的压缩需求。
因此,亟需一种能够适用于多种类型的非结构化数据且又无需用户参与的数据处理的方法。
发明内容
本申请提供了一种数据处理的方法和装置,能够满足不同类型的非结构化数据的重压缩需求,且不需要用户参与数据重压缩的过程,从而更好地实现节约存储空间的目标。
第一方面,提供了一种数据处理的方法,该方法应用于存储系统,存储系统包括:存储装置与处理装置,该方法由处理装置执行,方法包括:获取第一数据的分级存储特征与数据特征,该分级存储特征包括以下至少一个特征:重要性、访问频率、保留时间,该数据特征包括以下至少一个特征:数据类型、数据维度、数据大小、数据内容特征;根据分级存储特征与数据特征,确定第一压缩算法;根据第一压缩算法,对第一数据进行压缩,获得压缩后的数据。
本申请实施例基于第一数据的分级存储特征与数据特征,筛选出适用于第一数据的压缩算法,并使用该压缩算法对第一数据进行压缩。该方法在不需要用户参与的情况下,可以实现对任意类型数据进行合理的压缩,能够满足诸多类型的非结构化数据的重压缩需求,从而更高效的实现了存储系统中对数据的重压缩(或压缩)。
结合第一方面,在第一方面的某些实现方式中,所述根据分级存储特征与数据特征,确定第一压缩算法,包括:根据分级存储特征,确定压缩算法性能参数;根据压缩算法性能参数与数据特征,确定第一压缩算法。
通过基于分级存储特征确定压缩算法需要满足的性能要求,并结合数据特征,本申请能够更好地筛选适用于第一数据的第一压缩算法,从而能够实现满足诸多类型的非结构化数据的重压缩需求,从而避免了现存技术中强压缩算法相关的重压缩方案中存在的缺点。
结合第一方面,在第一方面的某些实现方式中,在确定第一压缩算法之前,该方法还包括:根据分级存储特征,确定需要对第一数据进行压缩。
本申请通过对第一数据的分级存储特征做判断,确定在需要对第一数据进行压缩处理的情况下,才对第一数据进行压缩,从而避免在不需要进行压缩处理时却对第一数据进行了压缩处理,从而能够实现不需要用户选择是否需要对第一数据进行压缩处理,继而,能够实现帮助用户来判断是否需要对第一数据进行压缩处理,从而避免了用户的参与。
结合第一方面,在第一方面的某些实现方式中,第一数据是已压缩数据,所述根据第一压缩算法,对第一数据进行压缩,包括:对第一数据进行解压缩,获得中间数据,并保存解压缩时的第一操作参数;根据第一压缩算法,对中间数据进行压缩,获得压缩后的数据。
通过上述技术方案,本申请在没有用户参与的情况下实现了对第一数据的再压缩处理,并保存了相应的操作参数,从而能够实现在用户需要调用第一数据的时候,能够将压缩后的数据还原成第一数据,从而避免了对第一数据的损伤。
结合第一方面,在第一方面的某些实现方式中,该方法还包括:对压缩后的数据进行解压,获得中间数据;根据第一操作参数,对中间数据进行压缩,获得第一数据。
通过上述技术方案,本申请支持将压缩后的数据还原成第一数据,从而能够避免对第一数据造成损伤。
结合第一方面,在第一方面的某些实现方式中,该方法还包括:对压缩后的数据进行解压,获得第一数据。
结合第一方面,在第一方面的某些实现方式中,该方法还包括:将压缩后的数据存储到存储装置。
通过上述技术方案,本申请能够实现降低数据存储的占用空间。
结合第一方面,在第一方面的某些实现方式中,该压缩算法性能参数包括以下至少一项:压缩率和吞吐量。
第二方面,提供了一种数据处理的装置,该装置包括:文件处理模块、压缩模块、算法选择模块、存储分级模块;存储分级模块,用于获取第一数据的分级存储特征,该分级存储特征包括以下至少一个特征:重要性、访问频率、保留时间;文件处理模块,用于获取第一数据的数据特征,该数据特征包括以下至少一个特征:数据类型、数据维度、数据大小、数据内容特征;算法选择模块,用于根据分级存储特征与数据特征,确定第一压缩算法;压缩模块,用于根据第一压缩算法,对第一数据进行压缩,获得压缩后的数据。
结合第二方面,在第二方面的某些实现方式中,该算法选择模块用于根据分级存储特征,确定压缩算法性能参数;根据压缩算法性能参数与数据特征,确定第一压缩算法。
结合第二方面,在第二方面的某些实现方式中,该存储分级模块还用于根据分级存储 特征,确定需要对第一数据进行压缩。
结合第二方面,在第二方面的某些实现方式中,第一数据是已压缩数据,该压缩模块用于对第一数据进行解压缩,获得中间数据,并保存解压缩时的第一操作参数;根据第一压缩算法,对中间数据进行压缩,获得压缩后的数据。
结合第二方面,在第二方面的某些实现方式中,该压缩模块还用于对压缩后的数据进行解压,获得中间数据;根据第一操作参数,对中间数据进行压缩,获得第一数据。
结合第二方面,在第二方面的某些实现方式中,该压缩模块还用于对压缩后的数据进行解压,获得该第一数据。
结合第二方面,在第二方面的某些实现方式中,该存储装置用于存储压缩后的数据。
结合第二方面,在第二方面的某些实现方式中,该压缩算法性能参数包括以下至少一项:压缩率和吞吐量。
第三方面,提供了一种计算机可读存储介质,存储有指令,当该指令在计算机上运行时,使得该计算机执行如第一方面以及第一方面的任一种可能实现方式中任一项所述的数据处理方法。
第四方面,提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得该计算机执行如第一方面以及第一方面的任一种可能实现方式中任一项所述的数据处理方法。
第五方面,提供了一种计算机设备,该计算机设备包括处理器和存储器;该存储器用于存储计算机程序指令;该处理器执行调用所述存储器中的计算机程序指令执行如第一方面以及第一方面的任一种可能实现方式中任一项所述的数据处理方法。
附图说明
图1是本申请提供的一种应用场景的示意图。
图2是现有的一种数据处理的方法的示意图。
图3是本申请提供的一种数据处理的方法的示意图。
图4是本申请提供的一种数据处理的装置的示意图。
图5是本申请提供的另一种数据处理的装置的示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
图1示出了本申请提供的一种应用场景示意图。在图1所示的应用场景中,用户通过应用程序来存取数据。运行这些应用程序的计算机被称为“应用服务器”。应用服务器100可以是物理机,也可以是虚拟机。物理应用服务器包括但不限于桌面电脑、服务器、笔记本电脑以及移动设备。应用服务器通过光纤交换机110访问存储系统以存取数据。然而,交换机110只是一个可选设备,应用服务器100也可以直接通过网络与存储系统120通信。或者,光纤交换机110也可以替换成以太网交换机、InfiniBand交换机、RoCE(RDMA over converged ethernet)交换机等。
图1所示的存储系统120是一个集中式存储系统。集中式存储系统的特点是有一个统一的入口,所有从外部设备来的数据都要经过这个入口,这个入口就是集中式存储系统的 引擎121。引擎121是集中式存储系统中最为核心的部件,许多存储系统的高级功能都在其中实现。
如图1所示,引擎121中有一个或多个控制器,图1以引擎包含两个控制器为例予以说明。控制器0与控制器1之间具有镜像通道,那么当控制器0将一份数据写入其内存124后,可以通过所述镜像通道将所述数据的副本发送给控制器1,控制器1将所述副本存储在自己本地的内存124中。由此,控制器0和控制器1互为备份,当控制器0发生故障时,控制器1可以接管控制器0的业务,当控制器1发生故障时,控制器0可以接管控制器1的业务,从而避免硬件故障导致整个存储系统120的不可用。当引擎121中部署有4个控制器时,任意两个控制器之间都具有镜像通道,因此任意两个控制器互为备份。
引擎121还包含前端接口125和后端接口126,其中前端接口125用于与应用服务器100通信,从而为应用服务器100提供存储服务。而后端接口126用于与硬盘134通信,以扩充存储系统的容量。通过后端接口126,引擎121可以连接更多的硬盘134,从而形成一个非常大的存储资源池。
在硬件上,如图1所示,控制器0至少包括处理器123、内存124。处理器123是一个中央处理器(central processing unit,CPU),用于处理来自存储系统外部(服务器或者其他存储系统)的数据访问请求,也用于处理存储系统内部生成的请求。示例性的,处理器123通过前端端口125接收应用服务器100发送的写数据请求时,会将这些写数据请求中的数据暂时保存在内存124中。当内存124中的数据总量达到一定阈值时,处理器123通过后端端口将内存124中存储的数据发送给硬盘134进行持久化存储。
内存124是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存包括至少两种存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(read only memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(random access memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但存取速度上比DRAM慢,在成本上也比DRAM更为便宜。然而,DRAM和SCM在本实施例中只是示例性的说明,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。另外,内存124还可以是双列直插式存储器模块或双线存储器模块(dual in-line memory module,DIMM),即由动态随机存取存储器(DRAM)组成的模块,还可以是固态硬盘(solid state disk,SSD)。实际应用中,控制器0中可配置多个内存124,以及不同类型的内存124。本实施例不对内存113的数量和类型进行限定。此外,可对内存124进行配置使其具有保电功能。保电功能是指系统发生掉电又重新上电时,内存124中存储的数据也不会丢失。具有保电功能的内存被称为非易失性存储器。
内存124中存储有软件程序,处理器123运行内存124中的软件程序可实现对硬盘的 管理。例如将硬盘抽象化为存储资源池,然后划分为LUN提供给服务器使用等。这里的LUN其实就是在服务器上看到的硬盘。当然,一些集中式存储系统本身也是文件服务器,可以为服务器提供共享文件服务。
控制器1(以及其他图1中未示出的控制器)的硬件组件和软件结构与控制器0类似,这里不再赘述。
图1所示的是一种盘控分离的集中式存储系统。在该系统中,引擎121可以不具有硬盘槽位,硬盘134需要放置在硬盘框130中,后端接口126与硬盘框130通信。后端接口126以适配卡的形态存在于引擎121中,一个引擎121上可以同时使用两个或两个以上后端接口126来连接多个硬盘框。或者,适配卡也可以集成在主板上,此时适配卡可通过PCIE总线与处理器112通信。
需要说明的是,图1中只示出了一个引擎121,然而在实际应用中,存储系统中可包含两个或两个以上引擎121,多个引擎121之间做冗余或者负载均衡。
硬盘框130包括控制单元131和若干个硬盘134。控制单元131可具有多种形态。一种情况下,硬盘框130属于智能盘框,如图1所示,控制单元131包括CPU和内存。CPU用于执行地址转换以及读写数据等操作。内存用于临时存储将要写入硬盘134的数据,或者从硬盘134读取出来将要发送给控制器的数据。另一种情况下,控制单元131是一个可编程的电子部件,例如数据处理单元(data processing unit,DPU)。DPU具有CPU的通用性和可编程性,但更具有专用性,可以在网络数据包,存储请求或分析请求上高效运行。DPU通过较大程度的并行性(需要处理大量请求)与CPU区别开来。可选的,这里的DPU也可以替换成图形处理单元(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)等处理芯片。通常情况下,控制单元131的数量可以是一个,也可以是两个或两个以上。当硬盘框130包含至少两个控制单元131时,硬盘134与控制单元131之间可具有归属关系。如果硬盘134与控制单元131之间具有归属关系,那么每个控制单元只能访问归属于它的硬盘,这往往涉及到在控制单元131之间转发读/写数据请求,导致数据访问的路径较长。另外,如果存储空间不足,在硬盘框130中增加新的硬盘134时需要重新绑定硬盘134与控制单元131之间的归属关系,操作复杂,导致存储空间的扩展性较差。因此在另一种实施方式中,控制单元131的功能可以卸载到网卡104上。换言之,在该种实施方式中,硬盘框130内部不具有控制单元131,而是由网卡104来完成数据读写、地址转换以及其他计算功能。此时,网卡104是一个智能网卡。它可以包含CPU和内存。在某些应用场景中,网卡104也可能具有持久化内存介质,如持久性内存(persistent memory,PM),或者非易失性随机访问存储器(non-volatile random access memory,NVRAM),或者相变存储器(phase change memory,PCM)等。CPU用于执行地址转换以及读写数据等操作。内存用于临时存储将要写入硬盘134的数据,或者从硬盘134读取出来将要发送给控制器的数据。也可以是一个可编程的电子部件,例如数据处理单元(data processing unit,DPU)。DPU具有CPU的通用性和可编程性,但更具有专用性,可以在网络数据包,存储请求或分析请求上高效运行。DPU通过较大程度的并行性(需要处理大量请求)与CPU区别开来。可选的,这里的DPU也可以替换成图形处理单元(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)等处理芯片。硬盘框130中的网卡104和硬盘134之间没有归属 关系,网卡104可访问该硬盘框130中任意一个硬盘134,因此在存储空间不足时扩展硬盘会较为便捷。
按照引擎121与硬盘框130之间通信协议的类型,硬盘框130可能是SAS硬盘框,也可能是NVMe硬盘框,IP硬盘框以及其他类型的硬盘框。SAS硬盘框,采用SAS3.0协议,每个框支持25块SAS硬盘。引擎121通过板载SAS接口或者SAS接口模块与硬盘框130连接。NVMe硬盘框,更像一个完整的计算机系统,NVMe硬盘插在NVMe硬盘框内。NVMe硬盘框再通过RDMA端口与引擎121连接。
应理解,图1所示的一种应用场景仅作为一种示例性理解,本申请实施例的技术方案还能适用于其他类型的存储系统,例如,盘控一体架构的存储系统、盘控分离架构的存储系统、分布式存储系统,等等。
为了便于理解本申请的技术方案,下文将对与本申请相关的概念或者相关技术做出简要描述。
第一,非结构数据是指数据结构不规则或者不完整,没有预定义的数据模型,不方便使用数据库、二维逻辑表等来表现的数据。
第二,分级存储是指存储系统根据存储数据的重要性、访问频率、保留时间、容量等指标将存储数据采用不同的存储方式分别存储在不同性能的存储设备。
第三,压缩率是指压缩前后的数据大小之间的比值,是用于评估压缩算法优劣的主要指标之一。
第四,重压缩是指对按固定格式存储的数据做进一步的压缩。一般而言,需要对按固定格式存储的数据进行解码、还原等步骤之后,再采用新算法对数据进行重新压缩,从而降低存储空间。
应理解,用户在存储数据之前通常会进行数据压缩以节约存储空间。数据压缩是指在不丢失有用信息的前提下,缩减数据量以提高其传输、存储和处理的效率,或者按照一定的方法对数据进行重新组织,减少数据冗余和存储空间的一种技术方法。
一般而言,不同类型的非结构化数据通常会采用不同的存储格式(或者,封装格式)。比如,办公文档类数据通常采用PDF或者ZIP等方式存储,且通常采用DEFLATE、LZW等方法对这类数据进行压缩;图像类数据通常采用TIFF、PNG或JPEG等方式存储,且通常采用DEFLATE、JPEG或WEBP等方法对这类数据进行压缩;音视频类数据通常采用MPEG、AVI或RMVB等方式进行存储,且通常采用MPEG4、H264、H265等方法对这类数据进行压缩。
数据根据一段时间内的访问频率的高低可以划分为热数据或冷数据。热数据是指需要被计算节点频繁访问的在线类数据,比如,可以是半年以内的数据,用户经常会查询它们。冷数据是指离线类不经常访问的数据,比如,用于灾难恢复的备份或者因为要遵守法律规定必须保留一段时间的数据,比如,企业备份数据、业务与操作日志数据、话单与统计数据。
热数据与冷数据在存储系统中的存储方式是不同的。当前的存储系统广泛采用分级存储技术对数据进行分级存储,这不仅能减少非重要数据在一级本地磁盘所占用的空间,还能够提升整个系统的存储性能。另外,存储系统还将对数据做进一步的细分,然后将较高访问频率的一类数据存储于高性能的存储设备,将较低访问频率的一类数据存储于性能相 对较低的存储设备,从而降低存储成本。需要说明的是,非结构化数据中的大部分数据都属于访问频率较低的一类数据。
对于这类低频率访问数据,一般可以选择压缩率较高的算法对其进行压缩,以节省存储空间。但是在实际应用中,由于用户在存储数据之前一般会进行数据压缩操作,而现存的压缩算法主要针对的是未压缩数据。设计一个针对已压缩数据的重压缩算法不仅难度较大,而且收益也较少,甚至会出现经重压缩之后的数据又增大的情况。
目前,对已压缩数据的重压缩的技术方案一般会采取对已压缩数据做一定程度的还原操作,并在此基础上,对还原得到的数据再做重压缩的做法。
图2示出了现有的一种数据重压缩的方法。图2(a)展示的是对JPEG影像数据的重压缩方法流程,该重压缩方法主要是通过替换原编码方法实现对已压缩数据的重压缩,即如图2(a)所示,对数据流1还原到编码方法1这一程度,然后,选择编码方法2对数据再重新编码,然后输出数据流2。图2(b)展示的是对DEFLATE数据的重压缩方法流程,即如图2(b)所示,对数据流1还原到原始输入数据这一程度,然后选择新的压缩算法对原始输入数据做重压缩,然后输出数据流2。图2(c)展示的是一种常规的数据重压缩的过程,具体而言,用户在获得原始数据(未压缩数据)之后,首先会选择存储格式,比如TIFF,PDF,JPEG,MPEG等,然后选择或者由存储格式指定压缩算法对数据进行压缩和存储。如果用户希望对数据做进一步的压缩,则需要根据数据的重要程度选择有损或无损的压缩算法,并选择相匹配的存储格式,对数据进行重压缩后再存储。
目前的基于非结构化数据的重压缩框架具备如下特点:
1)与特定的压缩算法相关。例如,对JPEG影像数据的重压缩,对DEFLATE压缩数据的重压缩等。
2)用户一般需要参与数据重压缩的过程,可能因为用户对专业知识的缺失或者失误,而造成数据丢失等。
3)独立于存储系统。
因此,现有的数据重压缩的方法无法满足对不同类型的非结构化数据的重压缩需求,另外,仍需要用户选择是否对数据进行重压缩、对哪些数据进行重压缩、选择重压缩的方式、重压缩算法等,但是大多数用户通常缺乏对数据压缩、存储相关方面的知识,这将给用户带来较大的挑战。
鉴于上述技术问题,本申请提供了一种数据处理的方法和装置,通过提供一种非结构化数据的重压缩框架,本申请能够满足用户对不同类型的非结构化数据的重压缩需求,且又不需要用户参与数据重压缩的过程,从而帮助用户轻易地实现节约存储空间的目的。
图3示出了本申请提供的一种数据处理的方法,该方法应用于存储系统,该存储系统包括:存储装置与处理装置,该方式是由该处理装置执行。具体内容如图3所示。
需要说明的是,在本申请实施例中,第一存储数据包括第一数据与第一元数据,第一元数据是用于描述第一数据的相关信息,第一数据是具有实质内容的用户数据,第一存储数据是存储系统对第一数据做一定的处理并存储之后得到的一种数据形式。第一数据可以是已压缩数据,也可以是未压缩数据。在此统一对其进行说明,后文不再进行赘述。
S310,获取第一数据的分级存储特征与数据特征。
该分级存储特征包括以下至少一个特征:重要性、访问频率、保留时间。该数据特征 包括以下至少一个特征:数据类型、数据维度、数据大小、数据内容特征。
应理解,第一数据是用于表示具备实质内容的用户数据。当用户将第一数据存储至存储系统时,该存储系统会基于数据重要性、访问频率、保留时间等信息,确定第一数据的分级存储特征。该分级存储特征能够用于筛选适用于第一数据的压缩算法。
应理解,数据维度是指第一数据的长、宽、高等信息。数据类型是指第一数据的类型,包括:整数类型、浮点数类型,或者字符类型等信息。数据内容特征是指第一数据所具备的特征信息,该数据内容特征与所使用的特征分析方法相关。
示例性地,针对第一数据的数据内容特征的分析方法包括如下至少一种:统计分析、主成分分析(principal components analysis,PCA)、聚类分析、假设检验等。
应理解,依据不同的分析方法,该第一数据的数据内容特征是不同的。例如,基于统计分析时,第一数据的数据内容特征主要指的是拟合出的数据分布、均值、方差等;基于协方差分析时,第一数据的数据内容特征主要指的是不同维度之间的相关程度;基于PCA时,第一数据的数据内容特征主要指的是获得的数据中重要的几个维度;其他包括假设检验、聚类分析等。
该装置通过与存储系统(或者,分级存储系统)对接,获取存储系统为第一数据配置的分级存储特征,继而,该装置能够利用分级存储特征来筛选适用于该第一数据的压缩算法。
具体而言,该装置由于直接对接存储系统,因此,上层应用/用户在整个的数据重压缩过程中是毫无感知的,且不需要执行额外的操作,这能够减轻用户的压力。
应理解,存储系统为第一数据配置的分级存储特征,不仅能够用于判断是否需要对第一数据进行重压缩,还能够用于判断对第一数据做出何等程度的重压缩,或者说,能够用于判断对第一数据进行重压缩的压缩算法需要满足的性能要求,例如,吞吐量、压缩率等等。
需要说明的是,若存储系统为第一数据配置的分级存储特征涉及多个特征时,例如,涉及第一数据的访问频率、重要性、保留时间等,该装置能够对第一数据的分级存储特征所涉及的多个特征做一定的加权处理,从而得到一个用于评价第一数据的综合分数,或者,确定一个综合的性能要求参数,并基于该综合分数或者性能要求参数,判断是否需要对第一数据进行重压缩,以及判断对第一数据做出何等程度的重压缩。
又应理解,该装置通过对第一存储数据进行一定程度的还原处理,获得第一元数据,并基于该第一元数据,获得了第一数据,并基于该第一元数据获得了第一数据的部分数据特征信息,并基于对第一数据的数据内容分析,获得了数据内容特征,因此,该装置基于该第一元数据与第一数据,获得了第一数据的数据内容特征信息。例如,该装置通过解析第一存储数据的存储格式,获得了第一元数据,并基于该第一元数据,获得了第一数据。
应理解,第一数据在内存中一般呈现为一维的8-byte数组的形式,因此,需要使用第一元数据来描述第一数据的相关信息,比如,第一数据的数据类型、大小、维度,以及采用的压缩算法等信息。
应理解,第一存储数据的存储格式可以理解为:按照规定的方式、在指定的位置,存储第一元数据和第一数据,从而获得第一存储数据,换言之,存储格式就是第一元数据与第一数据的存储方式。比如,标记图像文件格式(tagged image file format,TIFF)主要包 含三个部分:图像文件头(image file header,IFH)、图像文件目录(image file directory,IFD)和目录项(directory entry,DE)。IFH记录版本信息、字节序和第一个IFD的偏移量;IFD记录了DE的个数以及各个DE的偏移地址;每个DE记录了相关的第一元数据信息、以及指向具体第一数据的指针等信息。
S320,根据分级存储特征与数据特征,确定第一压缩算法。
具体而言,该装置基于分级存储特征与数据特征,确定第一压缩算法,包括:
1-根据分级存储特征,确定压缩算法性能参数;
2-根据压缩算法性能参数与数据特征,确定第一压缩算法。
可选地,该压缩算法性能参数包括以下至少一项:压缩率与吞吐量。
具体而言,该装置利用存储系统反馈的分级存储特征,确定压缩算法性能参数,换言之,就是确定了适用于第一数据的压缩算法需要满足的要求,即,确定了压缩算法的最低压缩与解压缩的吞吐量的要求,继而确定了满足该压缩算法性能参数的至少一个压缩算法。该装置根据压缩算法性能参数与数据特征,确定第一压缩算法,该第一压缩算法需要同时满足该压缩算法性能参数与数据特征。
作为一种可能的实现方式,该装置基于分级存储特征与数据特征,确定第一压缩算法,包括:
#A:基于分级存储特征,确定第一压缩算法集合,第一压缩算法集合包括至少一个压缩算法;
#B:基于数据特征,确定第二压缩算法集合,第二压缩算法集合包括至少一个压缩算法;
#C:确定第一压缩算法,第一压缩算法同时匹配分级存储特征与数据特征。
应理解,第一压缩算法集合与第二压缩算法集合之间存在交集,并且,该第一压缩算法属于该交集。
具体而言,该装置利用存储系统反馈的数据重要性、访问频率、保留时间等参数,将这些指标量化为对压缩算法的最低压缩与解压缩的吞吐量或压缩率的要求,从而筛选出第一压缩算法集合。以及,该装置利用数据特征筛选出第二压缩算法集合,并从该第一压缩算法集合与第二压缩算法集合中确定交集,该交集包括至少一个压缩算法。
示例性地,该装置基于该分级存储特征,确定压缩算法的第一选择标准,例如,第一选择标准是该压缩算法所需达到的压缩吞吐量和重建吞吐量条件等;基于该数据特征,确定压缩算法的第二选择标准,例如,第二选择标准是该压缩算法需要支持该数据类型、维度等等。该装置从符合第一选择标准与第二选择标准的多个压缩算法中筛选出其中一个压缩算法对第一存储数据进行压缩。
作为一种可能的实现方式,该装置能够对同时符合上述的分级存储特征与数据特征的多个压缩算法进行排序,并从中挑选出最佳的压缩算法。该排序指标可以包括以下至少一种:压缩率、压缩吞吐量、重建吞吐量以及特征匹配程度。
作为一种可能的实现方式,该装置能够对第一压缩算法集合与第二压缩算法集合之间的交集进行数据特征匹配确定第三压缩算法集合,并对该第三压缩算法集合作排序处理,该排序指标可以是上述所列举的指标参数,并从中选择最佳的压缩算法。例如,该装置根据各个压缩算法与特征分析结果匹配的程度,对每个压缩算法做一个评分,获得高于某一 个评分阈值的一批压缩算法。
需要说明的是,该装置可以同时基于该分级存储特征与数据特征确定第一压缩算法,也可以先基于该分级存储特征,再基于该数据特征,确定该第一压缩算法,或者,也可以先基于该数据特征,再基于该分级存储特征,确定该第一压缩算法,本申请实施例对先后顺序不作限定。
通过基于分级存储特征确定压缩算法需要满足的性能要求,并结合数据特征,本申请能够更好地筛选适用于第一数据的第一压缩算法,从而能够实现满足诸多类型的非结构化数据的重压缩需求,从而避免了现存技术中强压缩算法相关的重压缩方案中存在的缺点。S330,根据第一压缩算法,对第一数据进行压缩,获得压缩后的数据。
应理解,若第一数据为压缩数据,则该装置根据第一压缩算法,对第一数据进行压缩,获得压缩后的数据,包括:
#D:对第一数据进行解压缩,获得中间数据,并保存解压缩时的第一操作参数;
#E:根据第一压缩算法,对所述中间数据进行压缩,获得压缩后的数据。
应理解,该中间数据可以包括原始的未压缩数据,也可以是对第一数据进行解码后得到的数据,这取决于对第一数据的解压缩的程度,本申请对中间数据的具体形式不作具体限定。应理解,该第一操作参数可以是指第一数据的存储格式信息、第一元数据信息、原压缩算法的相关参数。第一操作参数可以理解为对第一数据进行解析过程所涉及的信息,这些信息有助于在对第一数据进行解压缩之后,还能够基于该第一操作参数将该中间数据还原成第一数据。
示例性地,为了能还原成第一数据,需记录用户选择的存储格式,是否对第一数据进行了压缩,如果压缩,选择的压缩算法是什么,对第一数据进行解压时,还需评估、记录压缩算法相关参数等。待将压缩后的数据还原成第一数据时,首先对压缩后的数据进行解压缩,之后使用在解析过程中所记录的第一操作参数对解压缩后得到的中间数据做压缩处理,得到第一数据。
通过上述技术方案,本申请在没有用户参与的情况下实现了对第一数据的再压缩处理,并保存了相应的操作参数,从而能够实现在用户需要调用第一数据的时候,能够将压缩后的数据还原成第一数据,从而避免了对第一数据的损伤。
作为一种可能的实现方式,该装置对压缩后的数据进行解压,获得中间数据;根据第一操作参数,对中间数据进行压缩,然后存储为原存储格式,获得第一数据。如此,本申请支持将压缩后的数据还原成第一数据,从而能够避免对第一数据造成损伤。作为一种可能的实现方式,该装置对压缩后的数据进行解压,获得第一数据。
应理解,该解压过程可以理解为是直接对压缩后的数据进行解压缩,获得第一数据,也可以是对压缩后的数据进行解压缩的过程需要经过多个步骤,才能获得第一数据。
作为一种可能的实现方式,在确定第一压缩算法之前,该装置基于分级存储特征,确定需要对第一数据进行压缩。
本申请通过对第一数据的分级存储特征做判断,确定在需要对第一数据进行压缩处理的情况下,才对第一数据进行压缩,从而避免在不需要进行压缩处理时却对第一数据进行了压缩处理,从而能够实现不需要用户选择是否需要对第一数据进行压缩处理,继而,能够实现帮助用户来判断是否需要对第一数据进行压缩处理,从而避免了用户的参与。
作为一种可能的实现方式,该装置将该压缩后的数据存储到存储装置中。如此,本申请能够实现降低数据存储的占用空间。
通过获取第一数据的分级存储特征与数据特征,并基于上述特征确定适用于第一数据的压缩算法,并使用该压缩算法对第一数据进行压缩,本申请能够实现在不需要用户参与的情况下,实现对第一数据的自主判断与压缩,从而能够满足诸多类型的非结构化数据的重压缩需求,从而帮助用户轻易地实现节约存储空间的目的。
下文将结合图4对本申请提供的存储系统#400进行描述。具体内容如图4所示。
该存储系统包括:存储装置#410与处理装置#420。该处理装置#420用于执行上述方法实施例中所述的数据处理的方法。
作为一种示例性地描述,该处理装置#420能够用于获取第一数据的分级存储特征和与数据特征;还能够用于根据分级存储特征与数据特征,确定第一压缩算法;还能够用于根据第一压缩算法,对第一数据进行压缩,获得压缩后的数据。
应理解,上述内容仅作为一种示例性描述,具体内容可以参照前述方法实施例的内容。
图5是本申请提供的一种数据处理的装置#500的示意图。图5所示的内容是对图4的处理装置#420的进一步描述。
该处理装置#420包括存储分级模块#510、文件处理模块#520、算法选择模块#530、压缩模块#540。
可选地,该处理装置#420还包括主控模块#550。
下文将对各个模块的功能做进一步的描述。
该存储分级模块#510,用于获取第一数据的分级存储特征。具体地,该存储分级模块#510与存储系统对接,从而获取存储系统对第一数据的分级存储信息。
该文件处理模块#520,用于获取第一数据的数据特征信息,例如,数据类型、数据维度、数据大小、数据内容特征。应理解,在本申请实施例中,该文件处理模块#520具备如下几项优点:
1)采用统一的解析与重建接口执行对第一存储数据的解析工作。
该装置通过由主控模块#550封装统一的格式解析与重建接口供各个格式解析算法使用,从而实现所有算法采用统一的解析与还原接口。由于接口的统一,该装置的主控模块#550可以仅通过格式解析算法的名称来调用指定的算法,实现各种格式解析算法与主控模块#550之间的解耦。
2)将各个格式解析算法编译为独立动态库。
由于各个格式解析算法已实现与主控模块的解耦,所以本申请可以将其编译为独立的动态库从而供主控模块加载与调用。
3)对第一存储数据做不同程度的解析。
由于数据存储之前一般是压缩过的,当该装置对该第一数据进行重压缩时,需要根据原压缩方式以及备选的重压缩算法,决定是否对第一数据进行解压操作:1)重压缩算法可对由指定算法压缩后的数据进行重压缩。此时,本申请仅需通过解析第一存储数据的格式获得元数据和第一数据即可;2)重压缩算法针对的是未压缩数据。此时,本申请在通过解析第一存储数据的格式获得第一数据后,由该装置的主控模块#550加载相应的原压缩算法对第一数据进行解压,获得中间数据。
4)能够精准地将压缩后的数据还原成第一数据。
压缩算法通常有多个可选参数,比如压缩等级、预测方式等,不同参数组合下压缩出的结果通常是不同的,但在存储时一般不会记录这些参数。由于选用的压缩器参数的不同,可能导致还原后的用户数据与初始的用户数据存在差异。虽然用户存储的数据并未丢失,但由于缺乏数据压缩、存储相关知识,用户可能认为重压缩造成了数据的丢失。针对该问题,如果用户存储数据前已经对数据进行了压缩,那么在针对用户数据做解压时,本申请通过分析压缩器的参数选择并做出记录,从而在格式还原时,可以根据记录的参数实现用户数据的精准还原。
该算法选择模块#530,用于基于分级存储特征与数据特征,确定第一压缩算法。具体过程如前述方法实施例所述,在此不再赘述。
应理解,在本申请实施例中,该算法选择模块#530能够用于如下场景:
1)针对不同类型的非结构化数据的压缩算法的自适应选择。
由于非结构化数据的种类繁多,且不同种类的数据特征差异较大,除了部分通用压缩算法可能支持不同类型的数据压缩之外,大部分的压缩算法通常只能应用于指定类型的数据,比如,WEBP对影像数据的压缩,H264对视频数据的压缩等。因此,本申请能够针对不同类型的非结构化数据的压缩器选择适应的压缩算法。
2)针对相同类型的非结构化数据的压缩算法的自适应选择。
对于相同类型的非结构化数据,通用的压缩算法以及该类别对应的压缩算法通常均可使用,但是不同类型的压缩算法之间的压缩率可能差距较大。比如,对于一幅遥感影像,在无损压缩中,通用压缩算法LZW的压缩率可能为1.5,通用图片类压缩算法WEBP的压缩率可能为3.0,而JPEG-LS的压缩率可能达到5.0或更高等。所以,本申请的算法选择模块#530支持选择出与该第一数据的整体数据特征以及性能需求最为匹配的压缩算法。
3)针对一份较大数据分割后各个部分对应的压缩算法的自适应选择。
对于较大的数据,在压缩时通常将其分成若干个独立的数据块,然后对每个数据块进行压缩,此时各个数据块所使用的压缩算法是相同的。实际上,一个完整数据的不同部分特征差别可能较大。比如一副人物图像,人物通常位于图像的中间区域,而周围的图像一般与人物差异较大。通过事先对数据的分割,本申请的算法选择模块#530可以对不同的数据块选用不同的压缩算法,从而提高整体的压缩率。
该压缩模块#540,用于使用第一压缩算法对第一数据进行重压缩,获得压缩后的数据。
应理解,在本申请实施例中,该压缩模块#540负责对数据进行压缩和解压缩的工作。该压缩器模块#540具备如下特点:
1)所有压缩器采用统一的压缩、解压接口。
本申请支持由主控模块#550封装统一的压缩和解压接口供各个压缩器使用,从而实现所有压缩器采用统一的压缩与解压接口。由于接口的统一,该主控模块#550可以仅通过压缩器的名称来调用指定的压缩器,实现压缩器与主控模块#550之间的解耦。
2)各个压缩器编译为独立动态库。
由于各个压缩器已经与主控模块#550解耦,所以本申请支持将其编译为独立的动态库,便于供主控模块#550加载与调用。压缩器的独立编译具备如下优点:
a)由于各个压缩器之间是独立的,所以对某个压缩器的升级、维护等不会给其他压 缩器带来影响,各个压缩器可以独立演进,实现压缩器模块内部的解耦。
b)新压缩算法的加入与旧压缩算法的舍弃等均不会对其他压缩算法或该数据重压缩装置带来影响,这有利于扩展压缩模块#540的新功能。
c)在对某个具体场景下的数据进行重压缩时,仅需加载对应的压缩器,从而实现系统的轻量化。
该主控模块#550,用于负责命令行参数解析、多线程调度、文件处理和压缩算法模块相应的算法加载和调用等。该主控模块#550具备如下特点:
1)统一的压缩算法调用接口。
为实现不同压缩器的统一调用,首先提取出压缩器均包含的一些参数,比如输入、输出数据流,数据维度、类型、长度等信息。针对各个压缩器特有的一些参数,比如控制压缩率或速度的level、预测方法等参数,可以通过指定默认值、算法自适应选择或命令行输入等方式传入。借鉴编程中的基类与派生类之间的关系,主控模块#550提供统一的基类,其中包含各种共有参数以及待实现的参数解析、压缩、解压等方法,各个压缩算法根据自身特点重写继承的方法,从而实现了主控模块#550与压缩器模块的解耦,以及不同压缩算法的统一调用。
2)统一的格式调用接口。
3)线程池与多线程调度。
为了充分利用计算资源,本申请可以在主控模块#550中加入线程池,然后将包括格式解析与还原、数据压缩与解压等可以并行处理的任务分配到各个线程处理,从而提高总体的重压缩效率。
4)预估各个压缩器在常见场景的压缩率和吞吐量。
由于系统集成的压缩器较多,主控模块#550需要根据数据所属类型和各个压缩器的特点对压缩器进行初步筛选。数据所属类型一般可以通过元数据等相关信息获得,而压缩器的相关指标,比如压缩率、压缩吞吐量、解压吞吐量等的获取相对困难。可以通过之前的测试结果,对其在各种类型的数据集上的表现进行预估,并将预估的结果存储到主控模块#550可访问的位置,待对数据进行重压缩时,可以动态更新之前的预评估结果。待某些数据需要进行重压缩时,主控模块#550借鉴该结果,加载一个或多个相关的压缩器,以便供选择使用。
应理解,上述的针对各个模块的描述仅是一种示例性描述,其具体用于执行前述方法实施例中所述的各个步骤,且具体内容可以参考前述描述,在此不再赘述。
本申请实施例提供了一种计算机可读存储介质,存储有指令,当该指令在计算机上运行时,使得该计算机执行如前述方法实施例中由该处理装置执行的数据处理方法。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得该计算机执行如前述方法实施例中由该处理装置执行的数据处理方法。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。 此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (17)

  1. 一种数据处理的方法,其特征在于,所述方法应用于存储系统,所述存储系统包括:存储装置与处理装置,所述方法由所述处理装置执行,所述方法包括:
    获取第一数据的分级存储特征与数据特征,所述分级存储特征包括以下至少一个特征:重要性、访问频率、保留时间,所述数据特征包括以下至少一个特征:数据类型、数据维度、数据大小、数据内容特征;
    根据所述分级存储特征与所述数据特征,确定第一压缩算法;
    根据所述第一压缩算法,对所述第一数据进行压缩,获得压缩后的数据。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述分级存储特征与所述数据特征,确定第一压缩算法,包括:
    根据所述分级存储特征,确定压缩算法性能参数;
    根据所述压缩算法性能参数与所述数据特征,确定所述第一压缩算法。
  3. 根据权利要求2所述的方法,其特征在于,在所述确定第一压缩算法之前,所述方法还包括:
    根据所述分级存储特征,确定需要对所述第一数据进行压缩。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述第一数据是已压缩数据,所述根据所述第一压缩算法,对所述第一数据进行压缩,包括:
    对所述第一数据进行解压缩,获得中间数据,并保存所述解压缩时的第一操作参数;
    根据所述第一压缩算法,对所述中间数据进行压缩,获得所述压缩后的数据。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    对所述压缩后的数据进行解压,获得所述中间数据;
    根据所述第一操作参数,对所述中间数据进行压缩,获得所述第一数据。
  6. 根据权利要求1至3任一项所述的方法,其特征在于,所述方法还包括:
    对所述压缩后的数据进行解压,获得所述第一数据。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:
    将所述压缩后的数据存储到所述存储装置。
  8. 根据权利要求2至7任一项所述的方法,其特征在于,所述压缩算法性能参数包括以下至少一项:压缩率和吞吐量。
  9. 一种数据处理的装置,所述装置包括:
    存储分级模块,用于获取第一数据的分级存储特征,所述分级存储特征包括以下至少一个特征:重要性、访问频率、保留时间;
    文件处理模块,用于获取所述第一数据的数据特征,所述数据特征包括以下至少一个特征:数据类型、数据维度、数据大小、数据内容特征;
    算法选择模块,用于根据所述分级存储特征与所述数据特征,确定第一压缩算法;
    压缩模块,用于根据所述第一压缩算法,对所述第一数据进行压缩,获得压缩后的数据。
  10. 根据权利要求9所述的装置,其特征在于,所述算法选择模块用于:
    根据所述分级存储特征,确定压缩算法性能参数;
    根据所述压缩算法性能参数与所述数据特征,确定所述第一压缩算法。
  11. 根据权利要求10所述的装置,其特征在于,所述存储分级模块还用于:
    根据所述分级存储特征,确定需要对所述第一数据进行压缩。
  12. 根据权利要求9至11任一项所述的装置,其特征在于,所述第一数据是已压缩数据,所述压缩模块用于:
    对所述第一数据进行解压缩,获得中间数据,并保存所述解压缩时的第一操作参数;
    根据所述第一压缩算法,对所述中间数据进行压缩,获得所述压缩后的数据。
  13. 根据权利要求12所述的装置,其特征在于,所述压缩模块还用于:
    对所述压缩后的数据进行解压,获得所述中间数据;
    根据所述第一操作参数,对所述中间数据进行压缩,获得所述第一数据。
  14. 根据权利要求9至11任一项所述的装置,其特征在于,所述压缩模块还用于:
    对所述压缩后的数据进行解压,获得所述第一数据。
  15. 根据权利要求10至14任一项所述的装置,其特征在于,所述压缩算法性能参数包括以下至少一项:压缩率和吞吐量。
  16. 一种计算机可读存储介质,存储有指令,当所述指令在计算机上运行时,
    使得所述计算机执行权利要求1至8中任一项所述的数据处理方法。
  17. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器;
    所述存储器,用于存储计算机程序指令;
    所述处理器执行调用所述存储器中的计算机程序指令执行如权利要求1至8中任一项所述的方法。
PCT/CN2022/077413 2021-08-25 2022-02-23 一种数据处理的方法和装置 WO2023024459A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22859818.1A EP4354310A4 (en) 2021-08-25 2022-02-23 DATA PROCESSING METHOD AND DEVICE
US18/412,995 US20240154623A1 (en) 2021-08-25 2024-01-15 Data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110978326.1 2021-08-25
CN202110978326.1A CN115905136A (zh) 2021-08-25 2021-08-25 一种数据处理的方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/412,995 Continuation US20240154623A1 (en) 2021-08-25 2024-01-15 Data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2023024459A1 true WO2023024459A1 (zh) 2023-03-02

Family

ID=85322345

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077413 WO2023024459A1 (zh) 2021-08-25 2022-02-23 一种数据处理的方法和装置

Country Status (4)

Country Link
US (1) US20240154623A1 (zh)
EP (1) EP4354310A4 (zh)
CN (1) CN115905136A (zh)
WO (1) WO2023024459A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116846979A (zh) * 2023-08-29 2023-10-03 云南昇顺科技有限公司 一种云计算环境下资源的调度方法及调度系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827775A (zh) * 2022-09-29 2024-04-05 华为技术有限公司 数据压缩方法、装置、计算设备及存储系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9026505B1 (en) * 2011-12-16 2015-05-05 Emc Corporation Storing differences between precompressed and recompressed data files
US20170212698A1 (en) * 2016-01-22 2017-07-27 Samsung Electronics Co., Ltd. Computing system with cache storing mechanism and method of operation thereof
CN110099092A (zh) * 2018-01-31 2019-08-06 慧与发展有限责任合伙企业 动态数据压缩
CN111459895A (zh) * 2020-03-31 2020-07-28 杭州云象网络技术有限公司 一种区块链数据分级压缩与存储方法及系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9280550B1 (en) * 2010-12-31 2016-03-08 Emc Corporation Efficient storage tiering
US9766816B2 (en) * 2015-09-25 2017-09-19 Seagate Technology Llc Compression sampling in tiered storage
US10956042B2 (en) * 2017-12-06 2021-03-23 International Business Machines Corporation Tiering data compression within a storage system
CN112748863B (zh) * 2019-10-31 2024-04-19 伊姆西Ip控股有限责任公司 用于处理数据的方法、电子设备和计算机程序产品

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9026505B1 (en) * 2011-12-16 2015-05-05 Emc Corporation Storing differences between precompressed and recompressed data files
US20170212698A1 (en) * 2016-01-22 2017-07-27 Samsung Electronics Co., Ltd. Computing system with cache storing mechanism and method of operation thereof
CN110099092A (zh) * 2018-01-31 2019-08-06 慧与发展有限责任合伙企业 动态数据压缩
CN111459895A (zh) * 2020-03-31 2020-07-28 杭州云象网络技术有限公司 一种区块链数据分级压缩与存储方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4354310A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116846979A (zh) * 2023-08-29 2023-10-03 云南昇顺科技有限公司 一种云计算环境下资源的调度方法及调度系统
CN116846979B (zh) * 2023-08-29 2024-03-15 江苏睿鸿网络技术股份有限公司 一种云计算环境下资源的调度方法及调度系统

Also Published As

Publication number Publication date
CN115905136A (zh) 2023-04-04
EP4354310A1 (en) 2024-04-17
US20240154623A1 (en) 2024-05-09
EP4354310A4 (en) 2024-08-21

Similar Documents

Publication Publication Date Title
WO2023024459A1 (zh) 一种数据处理的方法和装置
US11494339B2 (en) Multi-level compression for storing data in a data store
US11403321B2 (en) System and method for improved performance in a multidimensional database environment
JP5087467B2 (ja) コンピュータストレージシステムにおいてデータ圧縮並びに整合性を管理する方法および装置
US20130124796A1 (en) Storage method and apparatus which are based on data content identification
US9477682B1 (en) Parallel compression of data chunks of a shared data object using a log-structured file system
US20160350385A1 (en) System and method for transparent context aware filtering of data requests
US7693859B2 (en) System and method for detecting file content similarity within a file system
US10289714B2 (en) Compression of serialized B-tree data
CN103020205A (zh) 一种分布式文件系统上基于硬件加速卡的压缩解压缩方法
US20070143311A1 (en) System for query processing of column chunks in a distributed column chunk data store
US9684665B2 (en) Storage apparatus and data compression method
CN105630810B (zh) 一种对于海量小文件在分布式存储系统中上载的方法
WO2021258749A1 (zh) 一种写请求数据压缩方法、系统、终端及存储介质
CN110727406A (zh) 一种数据存储调度方法及装置
Adams et al. Respecting the block interface–computational storage using virtual objects
CN115483935A (zh) 一种数据处理方法及装置
WO2024169851A1 (zh) 一种数据压缩方法、系统、设备及计算机可读存储介质
US20240070120A1 (en) Data processing method and apparatus
WO2023050856A1 (zh) 数据处理方法及存储系统
US11308093B1 (en) Encoding scheme for numeric-like data types
US11943294B1 (en) Storage medium and compression for object stores
CN116566396A (zh) 数据压缩方法、装置、存储介质、设备集群及程序产品
WO2024037002A1 (zh) 一种数据的缩减方法、装置、设备、存储介质及处理器
Feng et al. Improving edge elasticity via decode offload

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22859818

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022859818

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022859818

Country of ref document: EP

Effective date: 20240109

NENP Non-entry into the national phase

Ref country code: DE