WO2023279833A1 - 一种数据处理方法及装置 - Google Patents

一种数据处理方法及装置 Download PDF

Info

Publication number
WO2023279833A1
WO2023279833A1 PCT/CN2022/091692 CN2022091692W WO2023279833A1 WO 2023279833 A1 WO2023279833 A1 WO 2023279833A1 CN 2022091692 W CN2022091692 W CN 2022091692W WO 2023279833 A1 WO2023279833 A1 WO 2023279833A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
preset value
memory
deduplication
rate
Prior art date
Application number
PCT/CN2022/091692
Other languages
English (en)
French (fr)
Inventor
陈克云
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111057198.3A external-priority patent/CN115599591A/zh
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22836590.4A priority Critical patent/EP4357900A1/en
Publication of WO2023279833A1 publication Critical patent/WO2023279833A1/zh
Priority to US18/405,231 priority patent/US20240143449A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms

Definitions

  • the present application relates to the field of computer technology, in particular to a data processing method and device.
  • backup usually refers to the process of copying all or part of the data set in the file system or database system from the disk or storage array of the application host to other storage media in the data center.
  • deduplication and compression are computationally expensive operations that will occupy the computing resources of the application host. Affecting the normal production business of the application host, the backup usually occurs during the time period when the business is not busy, such as 12:00 am to 8:00 am.
  • the above method is limited by the limited computing resources of the application host, and cannot make full use of time resources for backup, resulting in a low amount of backup data.
  • the present application provides a data processing method and device, which are used to reduce the storage space occupied by the data while increasing the amount of backup data.
  • the embodiment of the present application provides a data processing method, which can be executed by a computing device (such as a server); in this method, the computing device can first obtain the data to be backed up (such as called first data) , for example, the first data may be obtained from a memory or a hard disk of the computing device, and for example, the first data may be obtained from other computing devices. Afterwards, the computing device determines the deduplication rate of the first data and the current working status of the computing device.
  • the working state here may include a busy state and an idle state, and the working state may be determined according to the usage of computing resources in the computing device.
  • the working state of the computing device When the working state of the computing device is in a busy state, continue to judge whether the deduplication rate of the first data exceeds a certain threshold (such as called the first preset value), when the deduplication rate of the first data exceeds the first preset value Deduplication is performed on the first data to obtain second data; when the deduplication rate of the first data does not exceed a first preset value, the first data may be stored in the first memory of the computing device.
  • a certain threshold such as called the first preset value
  • the data when the computing resource utilization rate of the computing device is lower than the first preset value, and the data deduplication rate is lower than the second preset value, the data will not be deduplicated, so as to reduce the computing resource overhead and ensure the backup Bandwidth; or when the utilization rate of computing resources is lower than the first preset value, but the data deduplication rate is not lower than the second preset value, the data is deduplicated to ensure the deduplication rate and reduce the storage space occupied by backup data , to reduce storage costs.
  • data backup can also be performed during the time when the utilization rate of computing resources is low, such as lower than the second preset value, so that as much time as possible can be used for backup, thereby Increase the amount of data backed up.
  • the computing device may also determine the compression rate of the second data, if the compression rate of the second data is greater than or equal to the second preset value, and the utilization of computing resources rate is less than the third preset value, then compress the second data; if the compression rate of the second data is greater than or equal to the second preset value, and the computing resource utilization rate is greater than or equal to the third preset value, then compress the second data Part of the second data is compressed.
  • the data after the data is deduplicated, it can be compressed according to the compression rate of the deduplicated second data. If the compression rate is high, but the utilization rate of computing resources is high, if it is not lower than the third preset value, the second data may not be compressed to reduce the consumption of computing resources and ensure backup bandwidth; when the utilization of computing resources is low, such as lower than the third preset value, the second data may be compressed to ensure compression efficiency, reduce the storage space occupied by backup data, and increase logical backup bandwidth.
  • the computing device further includes a second memory, the performance of the second memory is lower than that of the first memory; if the compression rate of the second data is less than a second preset value, the second data is stored to the second memory; if the compression ratio of the second data is greater than or equal to a second preset value, compress at least part of the second data; and store the compressed data in the second memory, and Store uncompressed data in the second data to the first memory.
  • the second data since the compression rate of the second data is relatively low, the second data may not be compressed and stored directly to reduce computing resource overhead and ensure backup bandwidth.
  • the compression rate of the second data When the compression rate of the second data is high, the second data may be compressed. At least part of the second data is compressed, so as to ensure the compression rate, reduce the storage space occupied by the backup data, and increase the logical backup bandwidth.
  • the data to be compressed is stored in the first memory with higher performance, and the data that does not need to be compressed or compressed is stored in the second memory.
  • a memory obtains the data to be compressed, so as to improve the performance of reading data.
  • the computing device may determine the deduplication rate of the first data in the following manner, for example, according to the attribute parameters of the first data; or according to the deduplication rate of the target data corresponding to the first data , the target data is data selected based on preset conditions; or determined according to the similarity between the first data and the stored data of the computer device.
  • data deduplication and/or data compression may be performed on the first data, and the idle state is based on usage of computing resources of the computing device definite.
  • computing resources include processor resources and/or memory.
  • processor utilization exceeds the fourth preset value, and/or the memory utilization exceeds the fifth preset value, it may indicate that the working state of the computing device is a busy state; when the processor utilization does not exceed the fifth preset value, Four preset values, and/or when the memory utilization rate does not exceed the fifth preset value, it may indicate that the working state of the computing device is an idle state.
  • the computing device may acquire uncompressed data from the first memory, and compress the uncompressed data, and the compressed data may be stored in the first memory.
  • the first preset condition includes that the computing device is in an idle state, or reaches a preset time, or the amount of data in the first storage reaches a sixth preset value.
  • the background compression is performed under the first preset condition, which can avoid the impact on the normal business generation.
  • the background compression can further ensure the data compression rate and reduce the storage space occupied by the backup data.
  • the computing device may obtain the first data that has not been deduplicated from the first storage, and perform deduplication and compression on the first data to obtain the third data;
  • the third data may be stored in a second memory, and the performance of the second memory is lower than that of the first memory; wherein, the first preset condition includes that the computing device is in an idle state, or reaches a preset time, Or the amount of data in the first memory reaches a sixth preset value.
  • performing background deduplication and compression under the second preset condition can avoid the impact on normal business generation.
  • the background deduplication and compression can further ensure the deduplication rate and compression rate of data, and reduce the cost of backup data.
  • the embodiment of the present application also provides a data processing device, the data processing device has the function of implementing the behavior in the method example of the first aspect above, and the beneficial effects can be referred to the description of the first aspect, which will not be repeated here.
  • the functions described above may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the structure of the data processing device includes an acquisition module, a determination module, a judgment module and a processing module. These modules can perform the corresponding functions in the method example of the first aspect above. For details, refer to the detailed description in the method example, and details are not repeated here.
  • the present application also provides a computing device, the computing device includes a processor and a memory, and may also include a communication interface, the processor executes the program instructions in the memory to implement the above-mentioned first aspect or the first A method provided by any possible implementation of the aspect.
  • the computing device may be a computing node, a server, or a controller in the storage system, or may be a computing device requiring data interaction.
  • the memory is coupled to the processor and holds program instructions and data necessary to determine data processing.
  • the communication interface is used to communicate with other devices, such as acquiring first data.
  • the present application provides a computing device system, where the computing device system includes at least one computing device.
  • Each computing device includes memory and a processor.
  • the processor of at least one computing device is used to access the codes in the memory to execute the method provided by the first aspect or any possible implementation manner of the first aspect.
  • the present application provides a computer-readable storage medium.
  • the computer-readable storage medium When the computer-readable storage medium is executed by a computing device, the computing device executes the aforementioned first aspect or any possible implementation of the first aspect. provided method.
  • the program is stored in the storage medium.
  • the storage medium includes but not limited to volatile memory, such as random access memory, and nonvolatile memory, such as flash memory, hard disk drive (hard disk drive, HDD), and solid state drive (solid state drive, SSD).
  • the present application provides a program product for a computing device
  • the program product for a computing device includes computer instructions, and when executed by a computing device, the computing device executes the aforementioned first aspect or any possible implementation of the first aspect method provided in the method.
  • the computer program product may be a software installation package, and if the method provided in the aforementioned first aspect or any possible implementation of the first aspect needs to be used, the computer program product may be downloaded and executed on a computing device. program product.
  • the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and implement the above first aspect and each possibility of the first aspect.
  • FIG. 1 is a schematic diagram of a backup system provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of another system architecture provided by an embodiment of the present application.
  • FIG. 4 is a schematic flow diagram of a data processing method provided in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a data processing method provided in an embodiment of the present application.
  • FIG. 6 is a schematic flow diagram of a background deduplication method provided by an embodiment of the present application.
  • FIG. 7 is a schematic flow diagram of a background compression method provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by the present application.
  • the data processing method provided by the embodiment of the present application can be applied to the backup system shown in FIG. 1 .
  • the system includes a client 10 and a storage device 20 .
  • the client 10, used to provide the data to be backed up may be a computing device deployed on the user side, or may be software on the computing device.
  • a computing device can be a physical machine or a virtual machine. Physical machines include but are not limited to desktop computers, servers (such as application servers, file servers, database servers, etc.), notebook computers, and mobile devices.
  • the software can be an application program installed on the computing device, such as backup software.
  • the main function of the backup software is to manage the backup strategy, such as when to start backup, what content on the backup client 10, where to backup data, etc.
  • the client 10 can communicate with the storage device 20 through a network or other means, and the network generally refers to any telecommunications or computer network, including, for example, an enterprise intranet, a wide area network (wide area network, WAN), a local area network (local area network, LAN), a personal domain network (personal area network, PAN) or the Internet.
  • a network generally refers to any telecommunications or computer network, including, for example, an enterprise intranet, a wide area network (wide area network, WAN), a local area network (local area network, LAN), a personal domain network (personal area network, PAN) or the Internet.
  • the storage device 20 is configured to provide services such as storage resources and computing resources for the client 10.
  • the storage device 20 can be used as a backup medium to store data in the client 10.
  • FIG. 1 only shows one client 10 and storage device 20 for simplicity. In practical applications, the embodiment of the present application does not limit the number of clients 10 and storage devices 20 .
  • the storage device 20 in this application may be a centralized storage system, and for another example, the storage device 20 may be any storage device in a distributed storage system.
  • FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system architecture includes an application server 100 , a switch 110 , and a storage system 120 .
  • the client 10 in FIG. 1 may be the application server 100 in FIG. 2
  • the storage device 20 in FIG. 1 may be the storage system 120 in FIG. 2 .
  • the computers running these applications are called “application servers”.
  • the application server accesses the storage system 120 through the optical fiber switch 110 to access data.
  • the switch 110 is only an optional device, and the application server 100 can also directly communicate with the storage system 120 through the network.
  • the network please refer to the introduction above, so I won’t go into details here.
  • the storage system 120 shown in FIG. 2 is a centralized storage system.
  • the characteristic of the centralized storage system is that there is a unified entrance, and all data from external devices must pass through this entrance, and this entrance is the engine 121 of the centralized storage system.
  • the engine 121 is the most core component in the centralized storage system, where many advanced functions of the storage system are implemented.
  • FIG. 2 there are one or more controllers in the engine 121 .
  • Figure 2 illustrates that the engine includes two controllers as an example. There is a mirror channel between controller 0 and controller 1, so that the two controllers can back up each other.
  • the engine 121 also includes a front-end interface 125 and a back-end interface 126 , wherein the front-end interface 125 is used to communicate with the application server 100 to provide storage services for the application server 100 .
  • the back-end interface 126 is used to communicate with the hard disk 134 to expand the capacity of the storage system. Through the back-end interface 126, the engine 121 is connected with more hard disks 134, thereby forming a very large storage resource pool.
  • the controller 0 includes at least a processor 123 and a memory 124 .
  • the processor 123 is realized by a central processing unit (central processing unit, CPU), hardware logic circuit, processing core, ASIC, AI chip or programmable logic device (programmable logic device, PLD), and the above-mentioned PLD can be complex program logic Device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL), system on chip (system on chip, SoC) or any combination thereof.
  • complex program logic Device complex programmable logical device, CPLD
  • field-programmable gate array field-programmable gate array
  • FPGA field-programmable gate array
  • GAL general array logic
  • SoC system on chip
  • the processor 123 is configured to process data access requests from outside the storage system 120 (server or other storage systems), and is also configured to process requests generated inside the storage system 120 .
  • the processor 123 may store the data in memory first, such as in the memory 124 .
  • the processor 123 sends the data stored in the memory 124 to the hard disk 134 through the back-end port 126 for persistent storage.
  • the processor 123 is also used for computing or processing data, such as data deduplication, data compression, data verification, virtualization of storage space, and address translation. Only one processor 123 is shown in FIG. 1 . In practical applications, there are often multiple processors 123 , and one processor 123 has one or more processor cores. This embodiment does not limit the number of processors and the number of processor cores.
  • the memory 124 refers to an internal memory directly exchanging data with the processor 123. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for the operating system or other running programs.
  • the memory 124 includes at least two types of memory, for example, the memory 124 can be either a random access memory or a read only memory (ROM).
  • the random access memory is, for example, dynamic random access memory (DRAM), or storage class memory (SCM).
  • DRAM dynamic random access memory
  • SCM storage class memory
  • DRAM is a semiconductor memory, which, like most RAM, is a volatile memory device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory.
  • Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM.
  • the DRAM and the SCM are only illustrative examples in this embodiment of the present application, and the memory may also include other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like.
  • the memory 124 may also be a dual in-line memory module or a dual in-line memory module (DIMM), that is, a module composed of DRAM.
  • DIMM dual in-line memory module
  • multiple memories 124 and different types of memories 124 may be configured in the storage system 120 . This embodiment does not limit the quantity and type of the memory 124 .
  • controller 1 (and other controllers not shown in FIG. 2 ) are similar to those of the controller 0 and will not be repeated here.
  • Figure 2 shows a centralized storage system with separate disk control.
  • the engine 121 may not have a hard disk slot, the hard disk 134 needs to be placed in the hard disk enclosure 130 , and the back-end interface 126 communicates with the hard disk enclosure 130 .
  • the back-end interface 126 exists in the engine 121 in the form of an adapter card, and one engine 121 can use two or more back-end interfaces 126 to connect multiple hard disk enclosures at the same time.
  • the adapter card can also be integrated on the motherboard, and at this time the adapter card can communicate with the processor 123 through a high-speed serial computer expansion (peripheral component interconnect express, PCI-E) bus.
  • PCI-E peripheral component interconnect express
  • the storage system 120 can also be a storage system with integrated disk control, the engine 121 has a hard disk slot, and the hard disk 134 can be directly deployed in the engine 121, that is, the hard disk 134 and the engine 121 are deployed in the same device.
  • Hard disk 134 usually used for persistently storing data
  • hard disk 134 can be non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), hard disk drive (hard disk drive, HDD) , storage-class memory (storage-class memory, SCM) or solid state drive (solid state disk, SSD), etc.
  • non-volatile memory such as read-only memory (read-only memory, ROM), hard disk drive (hard disk drive, HDD) , storage-class memory (storage-class memory, SCM) or solid state drive (solid state disk, SSD), etc.
  • the performance of the SSD and the performance of the SCM are higher than that of the HDD, and the performance of the memory in this application is mainly considered from the aspects of operation speed and/or access delay.
  • the storage device 20 includes at least two hard disks with different performances (such as being marked as a first memory and a second memory), for example, the first memory and the second memory are respectively SSD, HDD, or SSD, SCM , or SCM, HDD, etc., for the convenience of description, the first storage is SSD, and the second storage is HDD as an example for description.
  • the hardware architecture of the application server 100 is not shown in FIG. 2 .
  • the application server 100 may include a processor, a memory, a network card, and optionally, a hard disk.
  • the types and functions of the components included in the application server 100 please refer to the relevant introduction to the storage system 120, and details are not repeated here.
  • FIG. 3 is a schematic diagram of the system architecture of a distributed storage system provided by the embodiment of the present application.
  • the distributed storage system Including server clusters.
  • the server cluster includes one or more servers 140 (three servers 110 are shown in FIG. 3 , but not limited to three servers 140 ), and the servers 140 can communicate with each other.
  • a virtual machine 107 can also be created on the server 140 , the required computing resources come from the local processor 123 and memory 124 of the server 140 , and various applications can run in the virtual machine 107 .
  • the client 10 may be a virtual machine 107, or may also be a computing device (not shown in FIG. 3 ) other than the server 110 .
  • the server 110 includes at least a processor 112 , a memory 113 , a network card 114 and a hard disk 105 .
  • the processor 112, the memory 113, the network card 114 and the hard disk 105 are connected through a bus.
  • Buses include but are not limited to: PCIe bus, double data rate (DDR) bus, interconnection bus supporting multi-protocol (hereinafter referred to as multi-protocol interconnection bus, which will be introduced in detail below), serial advanced technology Attachment (serial advanced technology attachment, SATA) bus and serial connection SCSI (serial attached scsi, SAS) bus, controller area network bus (Controller Area Network, CAN), computer standard connection (computer express link, CXL) standard bus, etc. .
  • PCIe bus Peripheral Component Interconnect
  • SATA serial advanced technology attachment
  • SCSI serial attached scsi, SAS
  • controller area network bus Controller Area Network, CAN
  • computer standard connection computer express link, CXL
  • the centralized storage system and the distributed storage system mentioned above are only examples, and the data processing method provided in the embodiment of the present application is also applicable to other centralized storage systems and distributed storage systems.
  • the data processing method provided in the embodiment of the present application can be applied to data backup scenarios.
  • the computing device determines deduplication and compression according to the usage of computing resources and the data characteristics of the data to be backed up. In this way, it can ensure that the production business of the computing equipment is not affected, and the backup tasks are no longer limited by the backup time specified by the backup policy, and as many backup tasks as possible can be executed, thereby increasing the amount of backup data and utilizing Computing resources perform data deduplication to ensure the deduplication rate and reduce the storage space occupied by data.
  • source-end implementation means that the client 10 executes the data processing method provided by the embodiment of this application.
  • Implementation at the target end means that the storage device 20 executes the data processing method provided by the embodiment of the present application.
  • the computing device mentioned in this method may be the client 10 or the storage device 20 in FIG. 1 .
  • the method includes:
  • step 400 the processor acquires data to be backed up (for example, first data).
  • the processor can obtain the data to be backed up.
  • the data to be backed up can be obtained from the data write request of the application program.
  • the data carried in the write data request triggered by the application program is temporarily stored in the storage (such as internal memory or hard disk) of the client 10, and then the data to be backed up is obtained from the storage.
  • the storage device 20 side in addition to obtaining the data to be backed up locally, the storage device 20 may also obtain the data to be backed up from the data write request received from the client 10 .
  • Step 401 determine the working state of the computing device according to the usage of computing resources.
  • the computing resources here refer to the resources on the computing device for performing operations such as computing and data processing, such as CPU and memory.
  • Computing resources in this application include, but are not limited to: one or more items such as CPU and memory.
  • the computing resource is a CPU
  • the usage of the computing resource can be characterized by parameters such as CPU utilization or CPU idle rate.
  • the computing resource is memory
  • the usage of the computing resource can be characterized by parameters such as memory utilization, memory idle rate, memory usage, and remaining memory.
  • the working status of the computing device can be determined based on the usage of the above computing resources, and the working status of the computing device includes but is not limited to: an idle state and a busy state.
  • the CPU utilization rate is used as an example to represent the usage of computing resources in the following description. Exemplarily, if the CPU utilization is lower than a certain threshold (such as a preset value A), it is determined that the computing device is in an idle state; otherwise, if the CPU utilization is greater than or equal to a preset value A, it is determined that the computing device is in a busy state.
  • the CPU utilization rate can be determined by calling a related system function, which will not be emphasized here, and any method for determining the CPU utilization rate is applicable to this embodiment of the application.
  • Step 402 determine whether the computing device is in a busy state, if yes, execute step 403; otherwise, execute step 406.
  • step 402 it may also be determined whether the computing device is in an idle state, and if it is in an idle state, execute step 406; otherwise, execute step 403.
  • Step 403 determining data features of the first data.
  • the data characteristics of the first data include but are not limited to a deduplication rate, where the deduplication rate can be understood as a degree of duplication between the first data and stored data. If the first data and the stored data contain 75% of the same data, then the deduplication rate of the first data is 75%.
  • a deduplication rate can be understood as a degree of duplication between the first data and stored data. If the first data and the stored data contain 75% of the same data, then the deduplication rate of the first data is 75%.
  • Method 1 for determining the deduplication rate determine according to an attribute parameter (such as the first attribute parameter) of the first data.
  • the first attribute parameter of the first data may be generated by an application program, such as backup software or an application program that generates the first data, and the processor may predict the repetition of the first data according to the first attribute parameter generated by the application program for the first data. delete rate.
  • the first attribute parameter may be the backup type of the first data.
  • the backup types include full backup, differential backup and incremental backup.
  • the full backup refers to a complete copy of all data at a certain point in time, that is, to back up all data of the backed up content.
  • Differential backup is based on the previous full backup and only backs up newly generated or changed data. It is worth noting that multiple differential backups may be performed between two adjacent full backups. These multiple backups are all based on the same full backup. Therefore, most of the data backed up by the multiple differential backups may be identical.
  • the base of incremental backup is different from that of differential backup. Incremental backup is based on the previous backup, and backs up newly generated or changed data. Therefore, the data backed up by incremental backup is different each time.
  • the first attribute parameter can also be other parameters, such as an identifier used to indicate the deduplication rate.
  • the first attribute parameter can be set as an identifier indicating a high deduplication rate; when the backup type of the first data is incremental backup, the first attribute parameter can be set as an identifier indicating a low deduplication rate , etc., which are not limited in this embodiment of the present application.
  • Method 2 for determining the deduplication rate determine according to the deduplication rate of the target data corresponding to the first data.
  • the deduplication rate of the first data may be determined according to the deduplication rate of the target data corresponding to the first data, for example, the deduplication rate of the target data is used as the deduplication rate of the first data.
  • the target data here can be data other than the first data.
  • the target data corresponding to the first data can be the data before the first data, such as the data carried in the last write data request of the first data, or the target The data is part of the first data.
  • the deduplication rate of the target data can be the actual deduplication rate of the target data determined by deduplication of the target data, or it can be the deduplication rate of the target data determined by other methods, such as through the aforementioned method of determining the deduplication rate 1. or the deduplication rate of the target data determined through the deduplication rate determination method 3 below, which is not limited in this embodiment of the present application.
  • the file is divided into multiple data blocks.
  • the data included in one or more data blocks is used as the target data of the file.
  • one or more data blocks can be collected periodically, and the data contained in these data blocks is called the target data in this period.
  • the deduplication rate of the target data is determined, and the deduplication rate of the target data is used as the The deduplication rate of other data in the same cycle.
  • each cycle may be at equal time intervals, or each cycle may include the same number of data blocks. For example, the duration of each cycle is T, or each cycle includes N data blocks.
  • the first data may be the file, or part of the data in the file, which is not limited in this embodiment of the present application.
  • the third method for determining the deduplication rate is determined according to the similarity between the first data and the stored data.
  • the similarity between the first data and the stored data can be determined by the similar fingerprints of the two, and the process can include: first calculating the similar fingerprint of the first data, such as dividing the first data into multiple blocks, and A hash operation is performed for each block to obtain a hash value of each block, and these hash values form the fingerprint of the first data. Then, the similar fingerprint of the first data is matched with the similar fingerprint of the stored data to determine the maximum value of similarity between the first data and the stored data.
  • the similar fingerprints of the stored data can be determined in the above manner, which will not be repeated here.
  • step 402 and step 403 can be executed at the same time, or step 403 is executed before or after step 402, which is not limited in this embodiment of the application .
  • Step 404 judging whether the deduplication rate of the first data exceeds a first preset value; if not, execute step 405; otherwise, execute step 406.
  • the backup type of the first data is differential backup or full backup, it can indicate that the deduplication rate of the first data is relatively high. In the application It may be specifically expressed as that the deduplication rate of the first data exceeds the first preset value. If the backup type of the first data is incremental backup, it means that the deduplication rate of the first data does not exceed the first preset value.
  • the deduplication rate of the first data exceeds the first preset value.
  • the maximum value of the similarity is less than or equal to the preset value B, it is considered that the deduplication rate of the first data does not exceed the first preset value.
  • Step 405 without performing deduplication and compression, and storing the first data.
  • the first data is data block 1
  • the first data can be stored in SSD (see 3-1-2 in Figure 5), and the metadata of the first data can be generated (see 3-1-2 in Figure 5).
  • the metadata can be used to record the storage location of the first data in the SSD, and information such as whether the first data is deduplicated, or whether background deduplication is required.
  • the first data can be read from the SSD according to the metadata to perform background deduplication.
  • step 405 is only an example of an optional implementation, and the present application may also compress data with a deduplication rate lower than the first preset value, which will be described below and will not be repeated here.
  • Step 406 performing deduplication on the first data, and recording the deduplicated data as second data.
  • the first data is data block 2
  • deduplication rate of the first data exceeds the first preset value
  • deduplication is performed on the first data (see 3-2 in FIG. 5 ).
  • the deduplication method includes at least file-level deduplication and sub-file-level deduplication (also called block-level deduplication).
  • file-level deduplication deduplication is performed in units of files.
  • File-level deduplication also known as single-instance storage (SIS), detects and removes duplicate file copies. It stores only one copy of the file, so all other copies are replaced with pointers to the only copy.
  • File-level deduplication is simple and fast, but it cannot eliminate duplicate content in files. For example, two 10MB Powerpoint presentation files differ only in the title page, they will not be considered as duplicate files. The two files are stored separately.
  • Sub-file-level deduplication refers to decomposing a file/object into data blocks of fixed size or variable size, and performing deduplication operations in units of data blocks. Sub-file level deduplication removes duplicate data between files.
  • Fixed-length block deduplication divides files into fixed-length blocks and uses a hash algorithm to find duplicate data. Fixed-length blocks are simple, but may miss a lot of duplicate data, because similar data may have different block boundaries. Imagine adding a person's name to the title page of a document, the entire document will be shifted, and all blocks will be changed, making it impossible to detect duplicate data.
  • variable-length segment deduplication if there is a change in one segment, only the boundaries of this segment are adjusted, leaving the remaining segments unchanged. Compared with the fixed block method, this method improves the ability to identify duplicate data segments.
  • a preset algorithm such as the CDC algorithm
  • the hash operation such as SHA1
  • the resulting hash value is similar to the fingerprint information of the data block, and the data blocks with the same content have the same fingerprint information, so that they can be passed Confirm whether the content of the data block is the same
  • fingerprint match the data block with the data block that already exists in the storage do not store the data block that has been stored before, only use the hash value as the index information to record the Data blocks, and map the index information of the data blocks to the specific storage location.
  • data 2 (D0, D1, D2, D3, D4, D5) is deduplicated. Assuming that D1 and D3 are duplicate data blocks, D1 and D3 are deduplicated to obtain the second data (D0, D2, D4, D5). According to the above process, it can be guaranteed that the same data block is stored on the physical medium at least once, so as to achieve the purpose of deduplicating data.
  • the actual deduplication rate of the first data is monitored as a basis for obtaining the data deduplication rate in step 403.
  • whether to perform deduplication for subsequent data may be determined according to the actual deduplication rate of the first data.
  • Step 407 determine the compression ratio of the second data.
  • Method 1 for determining the compression rate determine according to the second attribute parameter of the second data.
  • the second attribute parameter includes but not limited to the file type. If the file type is audio and video, the compression rate of the first identified data may be relatively low. If the file type is a text file or an office file, the compression ratio of the first identified data is relatively high. It should be noted that the above file type is only an example, and the second attribute parameter may also be other parameters, which is not limited in this embodiment of the present application.
  • the second attribute parameter can be generated by the application program for the first data. Since the second data is actually deduplicated data of the first data, the second attribute parameter of the first data That is, the second attribute parameter of the second data.
  • the second way of determining the compression ratio is to determine according to the compression ratio of the target data corresponding to the second data.
  • the present application may determine the compression ratio of the second data according to the compression ratio of the target data corresponding to the second data, for example, use the compression ratio of the target data as the compression ratio of the second data.
  • the target data may be some data in the second data, or the target data may be data other than the second data.
  • the first data The data carried in the last write IO request (including the second data) (belonging to the same file as the first data).
  • the compression ratio of the target data may be the actual compression ratio of the target data, or a predicted compression ratio, such as the compression ratio of the target data determined by means of determining the compression ratio, etc., which is not limited in this embodiment of the present application.
  • Step 408 judging whether the compression rate of the second data is less than a second preset value, and if smaller, execute step 409 , otherwise, execute step 410 .
  • Step 409 store the second data without performing compression.
  • the second data can be stored in the HDD (see FIG. 5 5-1), through this design, since the second data is not compressed, the overhead of computing resources can be reduced.
  • Step 410 judging whether the CPU utilization is less than a preset value (such as a third preset value), and if so, go to step 411 ; otherwise, go to step 412 .
  • a preset value such as a third preset value
  • Step 411 compress the second data, and record the compressed data as third data.
  • a compression algorithm may be used to compress the second data (see 5-2 in FIG. 5) (see 6 in FIG. 5).
  • the compression algorithm may be Shannon-Fano algorithm, Huffman coding, Arithmetic coding, LZ77/LZ78 coding, etc., the embodiment of this application does not limit the compression algorithm, any existing algorithm that can compress data and the compression algorithm that may be applied in the future are applicable to the embodiment of this application.
  • the compression algorithm may be specified by the user or adaptive by the system, which is not limited in this embodiment of the present application.
  • the compressed data (that is, the third data) may be stored in the HDD (refer to 7 in FIG. 5 , the same parts will not be described in detail below), so as to complete the data backup of the first data.
  • the processor may monitor the actual compression rate of the second data as a basis for obtaining the data compression rate in step 407, and optionally, may determine whether to perform compression on subsequent data according to the actual compression rate of the second data.
  • Step 412 compress part of the second data.
  • some data (such as D0 and D2) in the second data (such as D0, D2, D4, D5) can be selected for compression, and the rest of the data (such as D4 and D5 ) is temporarily not compressed.
  • the selection method may be random, or satisfy a preset rule, such as selecting the data in the previous position, which is not limited in this embodiment of the present application.
  • a preset rule such as selecting the data in the previous position, which is not limited in this embodiment of the present application.
  • the uncompressed part of the data can be stored in the SSD first (see 5-3-2 in Figure 5), and record the metadata of the data (see 5-3-3 in Figure 5 ), the metadata may include information such as the storage location of the data in the SSD, and whether the data is compressed. Subsequently, the processor can read the data from the SSD according to the metadata and compress it.
  • the CPU utilization rate is high, the CPU overhead can be reduced as much as possible, the impact on the backup bandwidth can be reduced, and the data can be guaranteed.
  • the reduction rate further reduces the storage space occupied by data and reduces storage costs.
  • background deduplication may also be performed on non-deduplicated data stored in the SSD.
  • the background deduplication process can include: (1) the post-deduplication module can determine non-deduplication data (as shown in Figure 5, data block 1) according to the metadata record; (2) obtain the deduplication data from the SSD. (3) deduplicate the data; (4) optionally, compress the deduplicated data, store the processed data in the HDD, and complete the data backup of the data.
  • the post-deduplication module may be a software module or a hardware module, which is not limited in this embodiment of the present application.
  • the present application can also perform background compression on the uncompressed data stored in the SSD, as shown in FIG. 7 , the background compression process can include: (1) the processor can determine the uncompressed data according to the metadata record; 2) Obtain uncompressed data from the SSD; (3) perform compression and other processing on these data, and store the processed data to the HDD.
  • the efficiency of data backup can be improved through hierarchical storage, and at the same time, the deduplication rate and reduction rate of data can be guaranteed, and the storage space occupied by data can be further reduced.
  • the determination of the deduplication method based on the deduplication rate is only an example, and the data characteristics of the first data may also be other parameters, such as the priority of data or the priority of backup tasks, etc. , or use other parameters to determine the deduplication mode, which is not limited in this embodiment of the present application.
  • the above-mentioned method of hierarchically storing backup data is only an example. In the above-mentioned data processing process, hierarchical storage may not be performed. If the hard disk of the storage device 20 only includes a memory with one performance, for example, only SSD or HDD is included.
  • Steps 407 to 412 are optional steps and are not mandatory steps. For example, in the backup scenario of this application, only backup data can be deduplicated without compression, which is not limited in the embodiment of this application . Therefore, these steps are all shown in dashed boxes in FIG. 4 .
  • data can be deduplicated when the computing resource utilization rate is low but the data deduplication rate is high; when the computing resource utilization rate is low and the data deduplication rate is low, the data is not deduplicated , to reduce computing resource overhead, so that without affecting normal production and business operations, data backup can be performed at a time when computing resource utilization is low, which is conducive to improving backup bandwidth, ensuring data deduplication rate, and reducing data Occupied storage space, reducing storage costs.
  • Embodiment 1 Data processing is performed on the client side 10 .
  • the backup software on the client 10 obtains the data to be backed up (still referred to as the first data) from a local storage (such as internal memory or hard disk), and determines the deduplication according to the usage of the computing resources of the client 10.
  • a local storage such as internal memory or hard disk
  • the first data is sent to the storage device 20.
  • the client 10 may notify the storage device 20 that the first data is not compressed, and then the The storage device 20 performs deduplication and/or compression processing on the first data.
  • the deduplication process of the data by the client 10 may include: taking the first data as an example, the client 10 divides the first data into multiple data blocks based on the hash algorithm, such as the multiple data are block 1, block 2, and block 3 and block 4. Then the fingerprint (ie hash value) of each data block is calculated, and then the fingerprint of each data block is sent to the storage device 20 .
  • the storage device 20 traverses the local fingerprint library, which includes the fingerprint of any data block stored by the storage device 20, and inquires whether there is a fingerprint (fp1) of block 1 and a fingerprint (fp2) of block 2 of the first data in the fingerprint library , the fingerprint of block 3 (fp3), and the fingerprint of block 4 (fp4). If they exist, the data block is a repeated data block. And block 3 are repeated data blocks, and blocks 2 and 4 are non-repeated data blocks. Afterwards, the storage device 20 sends the query result to the client 10 .
  • the query result here is used to indicate whether there is a repeated data block in the first data, identification information of the repeated data block, and the like.
  • the client 10 determines the repeated data block in the first data according to the query result.
  • the metadata includes but is not limited to: the fingerprint of the data block, the data block The offset in the first data and the length of the data block, etc., are not limited to the metadata in this embodiment of the present application.
  • the client 10 sends the metadata of the repeated data blocks and the data of the non-repeated data blocks to the storage device 20 .
  • the second data includes metadata of repeated data blocks and data of non-repeated data blocks.
  • the second data includes fp1 , fp3 , data of data block 2 and data of data block 4 .
  • data and metadata are separated, such as the second data only includes the data of data block 2 and the data of data block 4, and the second data and the metadata of data block 1 and the metadata of data block 3 are sent to the storage device 20.
  • the data volume of the second data is smaller than that of the first data, thereby reducing data transmission volume, reducing resource overhead and network load for backup, and improving logical backup bandwidth.
  • the compression method is determined according to the compression rate of the second data and the usage of computing resources of the client 10, still taking the CPU utilization rate as an example:
  • the second data is sent to the storage device 20, and the storage device is notified that the second data is not compressed, and the second data is subsequently compressed by the storage device 20, In order to reduce storage resources for backing up the second data.
  • the second data is compressed, and the compressed data (such as the first three data) to the storage device 20.
  • the client 10 may notify the storage device 20 of the compression information of the third data, such as a compression algorithm, a checksum, and the like. If the storage device 20 determines that the third data is compressed data, the third data may not be compressed any more.
  • Embodiment 2 is executed on the side of the storage device 20 .
  • the client 10 sends the data to be backed up (eg, first data) to the storage device 20 , and correspondingly, the storage device 20 receives the first data sent by the client 10 .
  • first data is data that has not been deduplicated or compressed.
  • the storage device 20 determines the deduplication mode according to the usage of computing resources of the storage device 20, such as taking CPU utilization as an example:
  • the CPU utilization rate of the storage device 20 is higher than the first preset value, deduplication is not performed.
  • the first data is stored in the SSD, and metadata of the first data is generated for recording the first data. A data is not deduplicated.
  • the storage device 20 when the storage device 20 detects that a preset condition (such as a first preset condition) is met, based on the metadata record, it acquires non-deduplicated data, such as the first data, from the SSD to Deduplication and compression are performed on the first data, and the deduplication and compression data are stored in the HDD.
  • a preset condition such as a first preset condition
  • the first preset condition includes but is not limited to: (1) the CPU utilization rate of the storage device 20 is lower than a preset value (such as the fifth preset value).
  • a preset value such as the fifth preset value
  • the fifth preset value may be equal to the first preset value.
  • the set values may also be different, which is not limited in this embodiment of the present application.
  • (3) The amount of data stored in the SSD reaches a preset value (for example, recorded as a sixth preset value).
  • the first data is deduplicated, and the deduplicated data is recorded as second data.
  • the second data can be stored in HDD.
  • the compression method is determined according to the compression rate of the second data and the usage of computing resources of the storage device 20, still taking the CPU utilization rate as an example:
  • the second data may not be compressed, and optionally, the storage device 20 stores the second data in the HDD.
  • the storage device 20 stores the compressed data to HDD.
  • the storage device 20 when the storage device 20 detects that a preset condition (such as a second preset condition) is met, it acquires uncompressed data from the SSD based on metadata records, and performs Compression, optional, store compressed data to HDD.
  • a preset condition such as a second preset condition
  • the second preset condition includes but is not limited to: (1) the CPU utilization rate of the storage device 20 is lower than a preset value (such as the seventh preset value).
  • a preset value such as the seventh preset value.
  • the seventh preset value may be equal to the fourth preset value.
  • the set values may also be different, which is not limited in this embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data processing device 800 provided in the present application.
  • the data processing device 800 includes An acquisition module 801 , a determination module 802 , a judgment module 803 , and a processing module 804 .
  • the data processing apparatus 800 is configured to execute the method executed by the computing device in the above method embodiments.
  • the obtaining module 801 is configured to obtain the first data; for the specific implementation, please refer to the description of step 401 in FIG. 4 , which will not be repeated here.
  • a determining module 802 configured to determine a deduplication rate of the first data and a working state of the computing device. Please refer to the description of step 401 and step 403 in FIG. 4 for the specific implementation manner, which will not be repeated here.
  • a judging module 803 configured to judge whether the deduplication rate of the first data exceeds a first preset value when the computing device is in a busy state.
  • a judging module 803 configured to judge whether the deduplication rate of the first data exceeds a first preset value when the computing device is in a busy state.
  • the processing module 804 is configured to, in response to the above judgment, perform deduplication on the first data to obtain second data when the deduplication rate of the first data exceeds the first preset value; specific implementation Please refer to the description of step 406 in FIG. 4 , which will not be repeated here. It is also used for storing the first data in the first memory of the computing device when the deduplication rate of the first data does not exceed the first preset value. For the specific implementation manner, please refer to the description of step 405 in FIG. 4 , which will not be repeated here.
  • the determining module 802 is also configured to determine the compression rate of the second data; for a specific implementation, please refer to the description of step 407 in FIG. 4 , which will not be repeated here.
  • a processing module 804 configured to compress the second data when the compression rate of the second data is not lower than a second preset value and the utilization rate of computing resources is lower than a third preset value; When the compression ratio of the second data is not lower than a second preset value and the computing resource utilization is not lower than a third preset value, at least part of the second data is compressed.
  • step 408 to step 411 in FIG. 4 , which will not be repeated here.
  • the processing module 804 is further configured to store the second data in a second memory when the compression rate of the second data is lower than a second preset value; performance of the second memory is lower than that of the first memory. Or, when the compression rate of the second data is not lower than a second preset value, at least part of the second data is compressed; optionally, the compressed data is stored in a second memory, and Store uncompressed data in the second data to the first memory.
  • the determining module 802 is specifically configured to determine the deduplication rate of the first data according to the attribute parameters of the first data; or determine the deduplication rate of the first data according to the deduplication rate of the target data corresponding to the first data , the target data is data selected based on preset conditions; or determined according to the similarity between the first data and the data stored in the computer device.
  • the processing module 804 is further configured to perform deduplication and/or data compression on the first data when the computing device is in an idle state, where the idle state is based on computing resources of the computing device The usage is determined.
  • the computing resources include processor resources and/or memory.
  • the working state of the computing device is a busy state; when the processor When the utilization rate of memory does not exceed the fourth preset value, and/or the memory utilization rate does not exceed the fifth preset value, the working state of the computing device is an idle state.
  • the present application also provides a data processing device, which can be the engine 121 or the storage system 120 shown in FIG. 2, or the server 110 shown in FIG.
  • a data processing device which can be the engine 121 or the storage system 120 shown in FIG. 2, or the server 110 shown in FIG.
  • the embodiment of the present application also provides a computer storage medium, where computer instructions are stored in the computer storage medium, and when the computer instructions are run on the computing device, the computing device executes the above-mentioned related method steps to To implement the method executed by the computing device in the foregoing embodiment, refer to the relevant description of any method embodiment in FIG. 4 to FIG. 7 , and details are not repeated here.
  • an embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to implement the computer program performed by the computing device in the above-mentioned embodiments.
  • a computer program product which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to implement the computer program performed by the computing device in the above-mentioned embodiments.
  • an embodiment of the present application also provides a device, which may specifically be a chip, a component or a module, and the device may include a connected processor and a memory; wherein the memory is used to store computer-executable instructions , when the device is running, the processor can execute the computer-executed instructions stored in the memory, so that the chip executes the method executed by the computing device in the above method embodiments, referring to the relevant description of any method embodiment in Fig. 4 to Fig. 7 , For the sake of brevity, details are not repeated here.
  • the data processing device, computer storage medium, computer program product or chip provided in the embodiments of the present application are all used to execute the method corresponding to the computing device provided above, therefore, the beneficial effects it can achieve can refer to the above-mentioned The beneficial effects of the provided corresponding method will not be repeated here.
  • the present application further provides a distributed system including a first computing device and a second computing device, where the first computing device is configured to send data to be backed up to the second computing device.
  • the first computing device includes a data processing device configured to execute any one of the methods shown in FIG. 4 to FIG. 7 , and details are omitted here for brevity.
  • all or part may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供一种数据处理方法及装置。在本申请中,计算设备获取第一数据,确定第一数据的重删率和计算设备自身的工作状态,当计算设备的工作状态为繁忙状态时,若第一数据的重删率大于或等于第一预设值,则对第一数据进行重删,这样可以保证数据重删率,减少数据占用的存储空间。当第一数据的重删率低于第一预设值时,则不对第一数据进行重删,可以直接存储第一数据,以降低对计算资源的消耗,尽可能多备份数据,保证备份带宽,提高备份数据量。

Description

一种数据处理方法及装置
相关申请的交叉引用
本申请要求在2021年07月08日提交中华人民共和国知识产权局、申请号为202110773758.9、申请名称为“一种重删压缩方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中;本申请要求在2021年09月09日提交中华人民共和国知识产权局、申请号为202111057198.3、申请名称为“一种数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种数据处理方法及装置。
背景技术
在信息数据管理领域,备份通常指在数据中心内,将文件系统或数据库系统中全部或部分数据集合从应用主机的磁盘或存储阵列复制到其他存储介质的过程。
目前,在备份过程中,常用的数据处理方法包括重复数据删除和数据压缩(如下简称为重删压缩),重删压缩是一个计算高消耗性的操作,会占用应用主机的计算资源,为了不影响应用主机的正常生产业务,备份通常发生在业务不繁忙的时间段,如凌晨12点至早上8点。
上述方式受限于应用主机有限的计算资源,不能充分利用时间资源进行备份,导致备份数据量低。
发明内容
本申请提供一种数据处理方法及装置,用于提高备份数据量的同时,降低数据所占用存储空间。
第一方面,本申请实施例提供了一种数据处理方法,该方法可以由计算设备(如服务器)执行;在该方法中,计算设备可以先获取待备份的数据(如称为第一数据),如可以从计算设备的内存或硬盘中获取第一数据,又如可以从其他计算设备获取第一数据。之后,计算设备确定该第一数据的重删率以及该计算设备当前的工作状态。这里的工作状态可以包括繁忙状态和空闲状态,该工作状态可以是根据计算设备中计算资源的使用请情况确定的。当计算设备的工作状态处于繁忙状态时,继续判断第一数据的重删率是否超过某阈值(如称为第一预设值),当第一数据的重删率超过第一预设值时,对第一数据进行重复数据删除,以得到第二数据;当第一数据的重删率未超过第一预设值时,则可以将第一数据存储于计算设备的第一存储器中。
通过上述方法,可以在计算设备的计算资源利用率低于第一预设值,且数据重删率低于第二预设值时,不对数据进行重删,以降低计算资源开销,以保证备份带宽;或计算资源利用率低于第一预设值,但数据重删率不低于第二预设值时,对数据进行重删,以保证重删率,减少备份数据所占用的存储空间,降低存储成本。另外,在保证不影响正常生产 业务的基础上,在计算资源利用率较低如低于第二预设值的时间内也可以进行数据备份,这样便可以利用尽可能多的时间进行备份,从而提高备份的数据量。
在一种可能的实现方式中,在得到第二数据之后,计算设备还可以确定该第二数据的压缩率,若第二数据的压缩率大于或等于第二预设值,且计算资源的利用率小于第三预设值,则对第二数据进行压缩;若第二数据的压缩率大于或等于所述第二预设值,且计算资源利用率大于或等于第三预设值,则对第二数据的部分数据进行压缩。
通过上述方法,在对数据进行重删后,可以根据重删后的第二数据的压缩率进行压缩,若压缩率较高,但计算资源利用率较高时,如不低于第三预设值时,可以不对第二数据进行压缩,以降低对计算资源的消耗,保证备份带宽,当计算资源利用率较低,如低于第三预设值,对第二数据进行压缩,以保证压缩率,减少备份数据所占用的存储空间,提升逻辑备份带宽。
在一种可能的实现方式中,计算设备还包括第二存储器,第二存储器的性能低于第一存储器;若第二数据的压缩率小于第二预设值,则将所述第二数据存储至所述第二存储器;若第二数据的压缩率大于或等于第二预设值时,则对第二数据中的至少部分数据进行压缩;并将压缩后的数据存储至第二存储器,以及将第二数据中未压缩的数据存储至第一存储器。
通过上述方法,由于第二数据的压缩率较低,则可以不对第二数据进行压缩,直接存储从而减少计算资源开销,保证备份带宽,在第二数据的压缩率较高时,则可以对第二数据的至少部分数据进行压缩,从而保证压缩率,减少备份数据所占用的存储空间,提升逻辑备份带宽。另外,通过分级存储的方式,将待压缩的数据存储至性能较高的第一存储器,将不需要进行压缩或压缩后的数据存储至第二存储器,后台压缩时可以以较快的速度从第一存储器获取待压缩的数据,从而提升读数据性能。
在一种可能的实现方式中,计算设备可以通过如下方式确定第一数据的重删率,如根据所述第一数据的属性参数确定;或根据第一数据对应的目标数据的重删率确定,所述目标数据为基于预设条件选择的数据;或根据所述第一数据与所述计算机设备已存储数据的相似度确定。
通过上述方法,提供确定数据重删率的灵活性。
在一种可能的实现方式中,当计算设备处于空闲状态时,可以对所述第一数据进行重复数据删除和/或数据压缩,所述空闲状态是基于所述计算设备的计算资源的使用情况确定的。
在一种可能的实现方式中,计算资源包括处理器资源和/或内存。示例性地,当处理器利用率超过第四预设值,和/或内存利用率超过第五预设值时,可以表示计算设备的工作状态为繁忙状态;当处理器的利用率未超过第四预设值,和/或内存利用率未超过第五预设值时,可以表示计算设备的工作状态为空闲状态。
在一种可能的实现方式中,当满足第一预设条件时,计算设备可以从第一存储器中获取未压缩的数据,并对未压缩的数据进行压缩,压缩后的数据可以被存储至第二存储器;其中,第一预设条件包括计算设备处于空闲状态,或达到预设时间,或第一存储器的数据量达到第六预设值。
通过上述方法,在满足第一预设条件下执行后台压缩,可以避免对正常生成业务的影响,另外,通过后台压缩可以进一步保证数据的压缩率,减少备份数据所占用的存储空间。
在一种可能的实现方式中,当满足第二预设条件时,计算设备可以从第一存储器获取 未重删的第一数据,并对第一数据进行重删压缩,以得到第三数据;可选的,可以将第三数据存储至第二存储器,第二存储器的性能低于所述第一存储器;其中,第一预设条件包括所述计算设备处于空闲状态,或达到预设时间,或所述第一存储器的数据量达到第六预设值。
通过上述方法,在满足第二预设条件下执行后台重删压缩,可以避免对正常生成业务的影响,另外,通过后台重删压缩可以进一步保证数据的重删率以及压缩率,减少备份数据所占用的存储空间。
第二方面,本申请实施例还提供了一种数据处理装置,该数据处理装置具有实现上述第一方面的方法实例中行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述数据处理装置的结构中包括获取模块、确定模块、判断模块和处理模块。这些模块可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
第三方面,本申请还提供了一种计算设备,所述计算设备包括处理器和存储器,还可以包括通信接口,所述处理器执行所述存储器中的程序指令执行上述第一方面或第一方面任一可能的实现方式提供的方法。该计算设备可以为存储系统中的计算节点、服务器、或控制器等设备,也可以是需要数据交互的计算设备。所述存储器与所述处理器耦合,其保存确定数据处理过程中必要的程序指令和数据。所述通信接口,用于与其他设备进行通信,如获取第一数据。
第四方面,本申请提供了一种计算设备系统,该计算设备系统包括至少一个计算设备。每个计算设备包括存储器和处理器。至少一个计算设备的处理器用于访问所述存储器中的代码以执行第一方面或第一方面的任意一种可能的实现方式提供的方法。
第五方面,本申请提供了一种计算机可读存储介质,所述计算书可读存储介质被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第六方面,本申请提供了一种计算设备程序产品,所述计算设备程序产品包括计算机指令,在被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
第七方面,本申请还提供一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行所述存储器中存储的软件程序,执行上述第一方面以及第一方面的各个可能的实现方式中所述的方法。
附图说明
图1为本申请实施例提供的一种备份系统示意图;
图2为本申请实施例提供的一种系统架构示意图;
图3为本申请实施例提供的另一种系统架构示意图;
图4为本申请实施例提供的一种数据处理方法的流程示意图;
图5为本申请实施例提供的一种数据处理方法示意图;
图6为本申请实施例提供的一种后台重删方式的流程示意图;
图7为本申请实施例提供的一种后台压缩方式的流程示意图;
图8为本申请提供的一种数据处理装置的结构示意图。
具体实施方式
本申请实施例提供的数据处理方法可以应用于图1所示的备份系统中,参阅图1所示,该系统包括客户端10和存储设备20。
其中,客户端10,用于提供要备份的数据,可以是部署在用户侧的一种计算设备,也可以是该计算设备上的软件。计算设备可以是物理机,也可以是虚拟机。物理机包括但不限于桌面电脑、服务器(如应用服务器、文件服务器、数据库服务器等)、笔记本电脑以及移动设备。软件可以是计算设备上安装的应用程序,如备份软件,备份软件的主要功能是管理备份策略,如什么时候开始备份、备份客户端10上的什么内容、备份数据到什么地方等等。客户端10可以通过网络或其他方式与存储设备20通信,网络通常表示任何电信或计算机网络,包含例如企业内部网、广域网(wide area network,WAN)、局域网(local area network,LAN)、个域网(personal area network,PAN)或因特网。
存储设备20,用于为客户端10提供诸如存储资源、计算资源等服务,本申请中存储设备20可以作为备份介质存储客户端10中的数据。
需要说明的是,图1仅示出的一个客户端10和存储设备20以保持简洁,实际应用中,本申请实施例对客户端10以及存储设备20的数量不做限定。如本申请中的存储设备20可以是集中式存储系统,又如,存储设备20可以是分布式存储系统中的任一存储设备。
如下列举两种具体的系统架构:
图2为本申请实施例提供的一种系统架构示意图,参阅图2所示,该系统架构中包括应用服务器100、交换机110、存储系统120。图1中的客户端10可以是图2中的应用服务器100,图1中的存储设备20可以是图2中的存储系统120。
用户通过应用程序来存取数据。运行这些应用程序的计算机被称为“应用服务器”。应用服务器通过光纤交换机110访问存储系统120以存取数据。然而,交换机110只是一个可选设备,应用服务器100也可以直接通过网络与存储系统120通信。网络可以参见上文的介绍,此处不再赘述。
图2所示的存储系统120是一个集中式存储系统。集中式存储系统的特点是有一个统一的入口,所有从外部设备来的数据都要经过这个入口,这个入口就是集中式存储系统的引擎121。引擎121是集中式存储系统中最为核心的部件,许多存储系统的高级功能都在其中实现。
如图2所示,引擎121中有一个或多个控制器。图2以引擎包含两个控制器为例予以说明,控制器0与控制器1之间具有镜像通道,使得两个控制器可以互为备份。引擎121还包含前端接口125和后端接口126,其中前端接口125用于与应用服务器100通信,为应用服务器100提供存储服务。而后端接口126用于与硬盘134通信,以扩充存储系统的容量。通过后端接口126,引擎121连接更多的硬盘134,从而形成一个非常大的存储资源池。
在硬件上,如图2所示,控制器0至少包括处理器123、内存124。其中,处理器123是一个中央处理器(central processing unit,CPU)、硬件逻辑电路、处理核、ASIC、AI芯片或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)、片上系统(system on chip,SoC)或其任意组合。
处理器123,用于处理来自存储系统120外部(服务器或者其他存储系统)的数据访问请求,也用于处理存储系统120内部生成的请求。示例性的,处理器123通过前端端口125接收应用服务器100发送的写数据请求时,处理器123可以优先将数据存储在内存中,如存储在内存124中。当内存124中的数据量达到一定阈值时,处理器123通过后端端口126将内存124中存储的数据发送给硬盘134进行持久化存储。处理器123还用于对数据进行计算或处理,如重复数据删除、数据压缩、数据校验、虚拟化存储空间以及地址转换等。图1中仅示出了一个处理器123,在实际应用中,处理器123的数量往往有多个,其中,一个处理器123又具有一个或多个处理器核。本实施例不对处理器的数量,以及处理器核的数量进行限定。
内存124,是指与处理器123直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存124包括至少两种存储器,例如内存124既可以是随机存取存储器,也可以是只读存储器(read only memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。DRAM是一种半导体存储器,与大部分RAM一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但存取速度上比DRAM慢,在成本上也比DRAM更为便宜。然而,DRAM和SCM在本申请实施例中只是示例性的说明,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。另外,内存124还可以是双列直插式存储器模块或双线存储器模块(dual in-line memory module,DIMM),即由DRAM组成的模块。实际应用中,存储系统120中可配置多个内存124,以及不同类型的内存124。本实施例不对内存124的数量和类型进行限定。
控制器1(以及其他图2中未示出的控制器)的硬件组件和软件结构与控制器0类似,这里不再赘述。
图2所示的是一种盘控分离的集中式存储系统。在该系统中,引擎121可以不具有硬盘槽位,硬盘134需要放置在硬盘框130中,后端接口126与硬盘框130通信。后端接口126以适配卡的形态存在于引擎121中,一个引擎121上可以同时使用两个或两个以上后端接口126来连接多个硬盘框。或者,适配卡也可以集成在主板上,此时适配卡可通过高速串行计算机扩展(peripheral component interconnect express,PCI-E)总线与处理器123通信。可选的,该存储系统120还可以是盘控一体的存储系统,引擎121具有硬盘槽位,硬盘134可直接部署在引擎121中,即硬盘134和引擎121部署于同一台设备。
硬盘134,通常用于持久性地存储数据,硬盘134可以为非易失性存储器(non-volatile  memory),例如只读存储器(read-only memory,ROM),硬盘驱动器(hard disk drive,HDD),储存级存储器(storage-class memory,SCM)或固态驱动器(solid state disk,SSD)等。其中,SSD的性能以及SCM的性能均高于HDD,本申请中的存储器的性能主要从运算速度和/或访问时延等方面进行考量。在一种可能的架构中,存储设备20至少包括两种不同性能的硬盘(如记为第一存储器和第二存储器),例如第一存储器和第二存储器分别为SSD、HDD,或SSD、SCM,或SCM、HDD等等,下文中为便于说明,将以第一存储器为SSD,第二存储器为HDD为例进行描述。
值得注意的是,图2中未示出应用服务器100的硬件架构,在硬件上,应用服务器100可以包括处理器、内存、网卡,可选的,还可以包括硬盘。应用服务器100所包括的组件的类型和作用请参见对存储系统120的相关介绍,此处不再赘述。
本申请实施例提供的数据处理方法除了适用于集中式存储系统,也同样适用于分布式存储系统,图3为本申请实施例提供的一种分布式存储系统的系统架构示意图,分布式存储系统包括服务器集群。参阅图3所示,服务器集群包括一个或多个服务器140(图3中示出了三个服务器110,但不限于三个服务器140),各个服务器140之间可以相互通信。可选的,还可以在服务器140上创建虚拟机107,所需的计算资源来源于服务器140本地的处理器123和内存124,虚拟机107中可运行各种应用程序。其中,客户端10可以是虚拟机107,或者也可以是服务器110之外的计算设备(图3未示出)等。
在硬件上,如图3所示,服务器110至少包括处理器112、内存113、网卡114和硬盘105。处理器112、内存113、网卡114和硬盘105之间通过总线连接。总线包括但不限于:PCIe总线、双数据速率(double data rate,DDR)总线、支持多协议的互联总线(如下简称为多协议互联总线,下文将会对其进行详细介绍)、串行高级技术附件(serial advanced technology attachment,SATA)总线和串行连接SCSI(serial attached scsi,SAS)总线、控制器局域网络总线(Controller Area Network,CAN)、计算机标准连接(computer express link,CXL)标准总线等。关于处理器112、内存113、网卡114和硬盘105作用和具体类型可以参见图2中的相关说明,此处不再赘述。
上述提及的集中式存储系统以及分布式存储系统仅是举例,本申请实施例提供的数据处理方法也适用于其他集中式存储系统以及分布式存储系统。
本申请实施例中提供的数据处理方法可以应用于数据备份场景中,在应用本申请实施例提供的数据处理方法时,计算设备根据计算资源的使用情况和待备份数据的数据特征确定重删压缩方式,这样可以保证不影响计算设备的生产业务,并且令备份任务不再受限于备份策略制定的备份时间,可以尽可能多的执行备份任务,从而提高备份的数据量,同时可以尽可能利用计算资源执行数据重删,保证重删率,减少数据占用的存储空间。
在本申请中,根据该数据处理方法的实施位置,可以分为源端实施和目标端实施,以图1为例,源端实施是指客户端10执行本申请实施例提供的数据处理方法。目标端实施则是指存储设备20执行本申请实施例提供的数据处理方法。
下面结合附图4,以图1所提及的系统为例,对本申请实施例提供的数据处理方法进行说明。对应的,该方法提及的计算设备可以是图1的客户端10或存储设备20。如图4所示,该方法包括:
步骤400,处理器获取待备份数据(如称为第一数据)。
处理器获取待备份数据的方式有多种,示例性地,在客户端10侧,可以从应用程序 的写数据请求中获取待备份数据。或者还可以是,先将应用程序触发的写数据请求中携带的数据暂时存储至客户端10的存储器(如内存或硬盘)中,之后,再从该存储器中获取待备份数据。再示例性地,在存储设备20侧,除上述从本地获取待备份数据之外,还可以是存储设备20从接收到的来自客户端10的写数据请求中获取待备份数据。
步骤401,根据计算资源的使用情况确定计算设备的工作状态。
这里的计算资源,是指计算设备上用于执行计算、数据处理等操作的资源,如CPU、内存等。本申请中计算资源包括但不限于:CPU、内存等一项或多项。如计算资源为CPU时,计算资源的使用情况可以由CPU利用率或CPU空闲率等等参数表征,当然,本申请实施例对此不做限定,凡是可以反映CPU使用情况的参数均适用于本申请实施例。又如计算资源为内存时,计算资源的使用情况,可以由内存利用率、内存空闲率、内存使用量、内存剩余量等等参数表征。
基于上述计算资源的使用情况可以确定计算设备的工作状态,计算设备的工作状态包括但不限于:空闲状态、繁忙状态。为便于说明,下文中均以CPU利用率表示计算资源的使用情况为例进行说明。示例性地,若CPU利用率低于某阈值(如预设值A),则确定计算设备处于空闲状态;反之,若CPU利用率大于或等于预设值A,则确定计算设备处于繁忙状态。其中,CPU利用率可以调用相关系统函数确定,这里不再重点说明,任何确定CPU利用率的方式均适用于本申请实施例。
步骤402,判断计算设备是否处于繁忙状态,如果是,则执行步骤403;否则,执行步骤406。
可以理解的是,在步骤402中也可以判断计算设备是否处于空闲状态,如果处于空闲状态,则执行步骤406,否则,执行步骤403。
步骤403,确定第一数据的数据特征。
第一数据的数据特征包括但不限于重删率,这里的重删率可以理解为第一数据与已存储数据的重复度。如第一数据与已存储数据包含75%的相同数据,则第一数据的重删率为75%。下面对确定第一数据的重删率的几种方式进行介绍:
确定重删率方式一:根据第一数据的属性参数(如第一属性参数)确定。
第一数据的第一属性参数可以是应用程序,如备份软件或产生该第一数据的应用程序生成的,处理器可以根据应用程序为第一数据生成的第一属性参数预测第一数据的重删率。
示例性地,该第一属性参数可以为第一数据的备份类型。本领域技术人员可知,备份类型包括全量备份、差异备份和增量备份。其中,全量备份是指在某一个时间点上对所有数据的一个完整拷贝,即将所备份内容的所有数据进行备份。差异备份,是以上一次的全量备份为基准,仅备份新产生的数据或更改数据。值得注意的是,在相邻两次全量备份直接,可能会执行多次差异备份,这多次备份均以同一次全量备份为基准,因此,该多次差异备份所备份的数据可能大部分是相同的。增量备份,与差异备份的基准是不同的,增量备份是以上一次备份为基准,备份新产生或更改的数据,因此,增量备份每次备份的数据都是不同的。
需要说明的是,上述备份类型仅为举例,第一属性参数还可以是其他参数,如用于指示重删率高低的标识,示例性地,当备份软件确定第一数据的备份类型为全量备份或差异备份时,可以设置第一属性参数为用于指示重删率高的标识,当第一数据的备份类型为增量备份时,设置第一属性参数为用于指示重删率低的标识,等等,本申请实施例对此不做 限定。
确定重删率方式二:根据第一数据对应的目标数据的重删率确定。
本申请中,可以根据第一数据对应的目标数据的重删率确定第一数据的重删率,如将目标数据的重删率作为第一数据的重删率。这里的目标数据可以是第一数据之外的数据,例如,第一数据对应的目标数据可以是该第一数据之前的数据,如第一数据的上一个写数据请求中携带的数据,或目标数据是第一数据中的部分数据。目标数据的重删率可以是对目标数据进行重删,以确定的该目标数据的实际重删率,也可以是其他方式确定的目标数据的重删率,如通过前述的确定重删率方式一或通过下文的确定重删率方式三确定的该目标数据的重删率等,本申请实施例对此不做限定。
举例来说,以文件为粒度,将文件分为多个数据块,示例性地,以将其中一个或多个数据块所包括的数据作为该文件的目标数据,再示例性地,为提高精度,可以周期性采集一个或多个数据块,这些数据块所包括的数据称为该周期内的目标数据,之后,确定目标数据的重删率,将该目标数据的重删率作为与目标数据处于同一周期内的其他数据的重删率。例如,假设一个文件的大小为1GB,将该文件划分为多个数据块,如划分为256个4MB大小的数据块,假设每个周期内的目标数据为该周期的第一个数据块,以目标数据的实际重删率为例,首先对该周期内的第一个4MB数据块进行重删,确定该数据块的实际重删率,若被重删掉的数据为3MB,则重删率为75%,那么确定该周期内的其余数据的重删率也为75%,重删的具体方式下文会详细说明,此处不再赘述。其中,每个周期可以是等时间间隔,或每个周期可以包括相同数量的数据块。例如,每个周期的时长为T,或每个周期均包括N个数据块。
上文中,第一数据可以是该文件,也可以是该文件中的部分数据,本申请实施例对此不做限定。
确定重删率方式三,根据第一数据与已存储数据的相似度确定。
示例性地,第一数据与已存储数据的相似度可以通过两者的相似指纹确定,该流程可以包括:首先计算第一数据的相似指纹,如将第一数据划分为多个块,以块为单位进行哈希运算得到每个块的哈希值,由这些哈希值组成第一数据的指纹。然后将第一数据的相似指纹与已存储数据的相似指纹进行匹配,确定第一数据与已存储数据的相似度最大值。其中,已存储数据的相似指纹可以通过上述方式确定,此处不再赘述。
需要说明的是,本申请对步骤402和步骤403的时序没有严格限定,例如,步骤402和步骤403可以同时执行,或者步骤403在步骤402之前或之后执行,本申请实施例对此不做限定。
步骤404,判断第一数据的重删率是否超过第一预设值;若未超过,则执行步骤405;否则,则执行步骤406。
在确定重删率方式一中,以第一属性参数为备份类型为例,若第一数据的备份类型为差异备份或全量备份,则可以表示第一数据的重删率较高,在申请中可以具体表示为第一数据的重删率超过第一预设值。若第一数据的备份类型为增量备份,则表示该第一数据的重删率未超过第一预设值。
在确定重删率方式三中,若第一数据与已存储数据的相似度最大值高于预设值B时,认为第一数据的重删率超过第一预设值。反之,若相似度最大值小于或等于预设值B,则认为第一数据的重删率未超过第一预设值。
步骤405,不执行重删压缩,存储第一数据。
请结合图5理解,假设第一数据为数据块1,若第一数据的重删率未超过第一预设值,则不对第一数据进行重删压缩(参见图5中的3-1-1),在一种可选的实施方式中,可以将第一数据存储至SSD(参见图5中的3-1-2),并生成第一数据的元数据(参见图5中的3-1-3),该元数据可以用于记录第一数据在SSD中的存储位置,以及第一数据是否被重删,或是否需要进行后台重删等信息。后续,可以根据该元数据从SSD中读取该第一数据进行后台重删。通过该设计,可以在计算设备处于繁忙状态时降低对计算资源的消耗,降低对备份带宽的影响,并通过后台重删机制保证数据重删率进一步减少数据所占用的存储空间。
需要说明的是,步骤405仅为一种可选的实施方式示例,本申请还可以对重删率小于第一预设值的数据进行压缩,下文会进行说明,此处不做重复说明。
步骤406,对第一数据进行重删,将重删后的数据记为第二数据。
继续结合图5理解,假设第一数据为数据块2,若第一数据的重删率超过第一预设值,则对第一数据进行重删(参见图5中的3-2)。
具体的,根据重删粒度大小,重删方法至少包括文件级重删和子文件级重删(也称块级重删)。文件级重删是以文件为单元进行重删操作。(1)文件级重删,也称为单例存储(single-instance storage,SIS)检测并移除重复的文件副本。它只存储文件的一个副本,所以其他的副本都会以指向唯一副本的指针代替。文件级去重简单快速,但是不能消除文件重复的内容。例如,两个10MB大小的Powerpoint演示文件只是标题页不同,他们不会被看作重复文件。两个文件会分别存储。(2)子文件级重删是指将文件/对象分解成固定大小或不定大小的数据块,以数据块为单元进行重删操作。子文件级去重去掉了文件间的重复数据。子文件去重有两种实现方式:固定长度块和长度可变段。固定长度块去重将文件划分为固定长度的块,使用哈希算法找出重复的数据。固定长度块虽然简单,但是可能会错过不少重复数据,因为相似数据的块边界可能不同。设想一下,在一个文档的标题页上加上一个人名,整个文档会移位,所有块都产生了变化,造成无法探测重复数据。在长度可变段去重中,如果一个段中有变化,那么只有此段的边界被调整,剩余段不变。与固定块方法相比,该方法提升了识别重复数据段的能力。
以不定大小的块级重删为例描述重复数据删除过程:首先,按照预设算法(如CDC算法)确定一段数据内各块的边界,从而将数据分为多个块,每个块的大小可能是不同的;然后,以块为单元进行哈希运算(如SHA1),所得哈希值类似于数据块的指纹信息,内容相同的数据块具有相同的指纹信息,如此便可以通过匹配指纹的方式确认数据块的内容是否相同;接着,将数据块与已存在于存储中的数据块进行指纹匹配,对于之前已经存储过的数据块不再重复存储,只利用哈希值作为索引信息记录该数据块,并通过映射将数据块的索引信息与具体存放位置对应起来,对于之前没有存储过的新数据块,先进行物理存储,再利用哈希值索引进行记录。如图5中,对数据2(D0,D1,D2,D3,D4,D5)进行重删,假设D1和D3为重复数据块,则将D1和D3进行重删,得到第二数据(D0,D2,D4,D5)。按照上述流程,便可以保证相同的数据块在物理介质上至存储一次,达到删除重复数据的目的。
在一种实施方式中,监测第一数据的实际重删率,作为步骤403获取数据重删率的依据,可选的,可以根据第一数据的实际重删率确定后续数据是否执行重删。
步骤407,确定第二数据的压缩率。
如下介绍两种确定第二数据的压缩率的方式:
确定压缩率方式一:根据第二数据的第二属性参数确定。
示例性地,第二属性参数包括但不限于文件类型。如文件类型为音视频时,标识第一数据的压缩率可能较低。如文件类型为文本文件或office文件时,标识第一数据的压缩率较高。需要说明的是,上述文件类型仅为举例,第二属性参数还可以是其他参数,本申请实施例对此不做限定。另外,与第一属性参数类似,第二属性参数可以是应用程序为第一数据生成的,由于第二数据实际上是第一数据重删后的数据,因此,第一数据的第二属性参数即为第二数据的第二属性参数。
确定压缩率方式二:根据第二数据对应的目标数据的压缩率确定。
与确定重删率方式二类似,本申请可以根据第二数据对应的目标数据的压缩率确定第二数据的压缩率,如将该目标数据的压缩率作为第二数据的压缩率。其中,目标数据可以为第二数据中的部分数据,或目标数据可以是第二数据之外的数据,目标数据的确定方式可以参见上文的相关介绍,此处不再赘述,如第一数据(包括该第二数据)的上一个写IO请求中携带的数据(与第一数据属于同一文件)。目标数据的压缩率可以是目标数据的实际压缩率,也可以是预测的压缩率,如通过确定压缩率方式一确定的目标数据的压缩率等等,本申请实施例对此不做限定。
步骤408,判断第二数据的压缩率是否小于第二预设值,若小于,则执行步骤409,否则,执行步骤410。
步骤409,不执行压缩,存储第二数据。
请结合图5理解,若第二数据(如D0,D2,D4,D5)的压缩率较低,则在一种可选的实施方式中,可以将第二数据存储至HDD(参见图5中的5-1),通过该设计,由于未对第二数据进行压缩,因此可以减少计算资源的开销。
步骤410,判断CPU利用率是否小于预设值(如第三预设值),若小于,则执行步骤411,否则,执行步骤412。
步骤411,对第二数据进行压缩,将压缩后的数据记为第三数据。
本申请实施例,可以使用压缩算法对第二数据(参见图5中的5-2)进行压缩(参见图5中的6),压缩算法如可以是香农-范诺算法、哈夫曼编码、算数编码、LZ77/LZ78编码等等,本申请实施例对压缩算法不做限定,任何可以对数据进行压缩的已有算法以及未来可能应用的压缩算法均适用于本申请实施例。并且,压缩算法可以是用户指定的,或者是系统自适应压缩,本申请实施例对此也不做限定。在一种实施方式中,可以将压缩后的数据(即第三数据)存储至HDD(参见图5中的7,下文相同之处不再赘述),从而完成该第一数据的数据备份。
在一种实施方式中,处理器可以监测第二数据的实际压缩率,作为步骤407获取数据压缩率的依据,可选的,可以根据第二数据的实际压缩率确定后续数据是否执行压缩。
步骤412,对第二数据中的部分数据进行压缩。
如参见图5中的5-3-1所示,可以选择第二数据(如D0,D2,D4,D5)中的部分数据(如D0和D2)进行压缩,其余部分数据(如D4和D5)则暂不压缩。选择方式可以是随机的,或满足预设的规则,如选择位置在前的数据,本申请实施例对此不做限定。压缩方式可以参见上文的相关介绍,此处不再赘述。
在一种实施方式中,对于未压缩的部分数据则可以先存储SSD中(参见图5中的5-3-2), 并记录该数据的元数据(参见图5中的5-3-3),该元数据可以包括该数据在SSD中的存储位置,以及该数据是否被压缩等信息。后续,处理器可以根据该元数据从SSD中读取该数据进行压缩,通过该设计,可以在CPU利用率较高时,尽量减少对CPU的开销,降低对备份带宽的影响,还可以保证数据缩减率,进一步减少数据所占用的存储空间,降低存储成本。
为了进一步减少数据所占用的存储空间,在一种可选的实施方式中,本申请实施例,还可以对SSD中存储的未重删数据进行后台重删。参见图6所示,该后台重删流程可以包括:(1)后重删模块可以根据元数据记录确定未重删数据(如图5中的数据块1);(2)从SSD中获取该数据;(3)对该数据进行重删;(4)可选的,还可以对重删后的数据进行压缩等处理,之后将处理后的数据存储至HDD,完成该数据的数据备份。其中,后重删模块可以是软件模块也可以是硬件模块,本申请实施例对此不做限定。通过上述设计,可以保证重删率,可以进一步减少数据占用的存储空间。
类似的,本申请,还可以对于SSD中存储的未压缩的数据进行后台压缩,参见图7所示,该后台压缩流程可以包括:(1)处理器可以根据元数据记录确定未压缩数据;(2)从SSD中获取未压缩的数据;(3)对这些数据进行压缩等处理,并将处理后得到的数据存储至HDD。通过上述设计,通过分级存储可以提高数据备份的效率,同时可以保证数据的重删率、缩减率,进一步减少数据所占用的存储空间。
需要说明的是,(1)需要说明的是,上述基于重删率确定重删方式仅为示例,第一数据的数据特征还可以是其他参数,如数据的优先级或备份任务的优先级等,或利用其他参数确定重删方式,本申请实施例对此不做限定。(2)上述分层存储备份数据的方式仅为举例,上述数据处理过程中,也可以不执行分层存储,如若存储设备20硬盘仅包括一种性能的存储器,例如仅包括SSD或仅包括HDD等,则备份过程中的未重删的数据、重删数据、未压缩数据、压缩后的数据均可以存储于该同一性能的存储器中,本申请实施例对此不做限定。(3)步骤407至步骤412为可选的步骤,并非必须执行的步骤,如在本申请的备份场景中,可以仅对备份数据进行重删而不压缩,本申请实施例对此不做限定。因此,在图4中这些步骤均以虚线框示出。
上述方式,可以在计算资源利用率较低,但数据重删率较高时,对数据进行重删;在计算资源利用率较低时,且数据重删率较低时,不对数据进行重删,降低计算资源开销,这样可以在不影响正常生产业务运行的情况下,利用起计算资源利用较低时的时间进行数据备份,有利于提高备份带宽,同时可以保证数据的重删率,减少数据占用的存储空间,降低存储成本。
下面结合具体的实施例对本申请实施例的数据处理方法进行详细说明。
实施例一、在客户端10侧执行数据处理。
以备份软件为例,客户端10上备份软件从本地存储器(如内存或硬盘)中获取待备份的数据(仍称为第一数据),并根据客户端10的计算资源的使用情况确定重删方式,如以CPU利用率为例:
1),若客户地10的CPU利用率高于第一预设值,则将第一数据发送至存储设备20,可选的,客户端10可以通知存储设备20第一数据未压缩,后续由存储设备20对第一数据进行重删,和/或压缩等处理。
2),若CPU利用率未超过第一预设值,且第一数据的重删率未超过第二预设值,则对第一数据中的部分数据进行重删,将其余部分数据发送至存储设备。
3),若CPU利用率未超过第一预设值,且第一数据的重删率超过第二预设值,则对第一数据进行重删,将重删后的数据记为第二数据。
客户端10对数据的重删过程可以包括:以第一数据为例,客户端10基于哈希算法将第一数据分为多个数据块,如多个数据为块1、块2、块3和块4。然后计算每个数据块的指纹(即哈希值),之后将每个数据块的指纹发送至存储设备20。
存储设备20遍历本地的指纹库,指纹库包括存储设备20已存储的任一数据块的指纹,查询指纹库中是否存在第一数据的块1的指纹(fp1)、块2的指纹(fp2)、块3的指纹(fp3)、块4的指纹(fp4),如果存在,则该数据块为重复的数据块,例如,指纹库中存在fp1纹和fp3,不存在fp2和fp4,则块1和块3为重复的数据块,块2和块4为不重复的数据块。之后,存储设备20将查询结果发送给客户端10。这里的查询结果用于指示第一数据中是否存在重复的数据块,以及重复的数据块的标识信息等等。
客户端10根据查询结果确定第一数据中的重复的数据块,对于重复的数据块可以仅发送该数据块的元数据,如该元数据包括但不限于:该数据块的指纹、该数据块在该第一数据中的偏移量以及该数据块的长度等,本申请实施例对元数据不做限定。之后,客户端10将重复的数据块的元数据和不重复的数据块的数据发送至存储设备20。也就说,第二数据包括重复的数据块的元数据和不重复的数据块的数据。举例来说,如上述示例中,第二数据包括fp1、fp3、数据块2的数据和数据块4的数据。或者,也可以说数据和元数据分离,如第二数据仅包括数据块2的数据和数据块4的数据,将第二数据和数据块1的元数据和数据块3的元数据发送给存储设备20。上述方式,经过重复数据删除后,第二数据的数据量大小小于第一数据的数据量大小,从而可以减少数据传输量,降低用于备份的资源开销以及网络负担,提高逻辑备份带宽。
得到第二数据之后,根据第二数据的压缩率、客户端10的计算资源的使用情况确定压缩方式,仍以CPU利用率为例:
(1)若第二数据的压缩率小于第三预设值,则将第二数据发送至存储设备20,并通知存储设备第二数据未压缩,后续由存储设备20对第二数据进行压缩,以减少备份第二数据的存储资源。
(2)若第二数据的压缩率大于或等于第三预设值,且CPU利用率低于第四预设值,则对第二数据进行压缩,并将压缩后的数据(如即为第三数据)发送至存储设备20。可选的,客户端10可以通知存储设备20第三数据的压缩信息,如压缩算法、校验和等。存储设备20确定第三数据为压缩后的数据,则可以不再对第三数据进行压缩。
(3)若第二数据的压缩率大于或等于第三预设值,且CPU利用率高于第四预设值,则对第二数据的部分数据进行压缩,并将压缩后的数据,以及将第二数据中其余未压缩的数据发送至存储设备20,并通知存储设备20哪些数据未经压缩,后续,可以由存储设备20对这部分数据进行再次压缩。
实施例二、在存储设备20侧执行。
客户端10将待备份的数据(如第一数据)发送至存储设备20,对应的,存储设备20接收客户端10发送的第一数据。值得注意的是,第一数据为未经重删以及压缩的数据。
存储设备20根据存储设备20的计算资源的使用情况确定重删方式,如以CPU利用率为例:
1)若存储设备20的CPU利用率高于第一预设值,则不执行重删,可选的,将第一数据存储至SSD,并生成第一数据的元数据,用于记录该第一数据未被重删。
在一种实施方式中,存储设备20检测到满足预设的条件(如称为第一预设条件)时,基于元数据记录,从SSD中获取未重删的数据,如第一数据,以对第一数据进行重删、压缩,并将重删、压缩后的数据存储至HDD。
其中,第一预设条件包括但不限于:(1)存储设备20的CPU利用率低于预设值(如第五预设值),示例性地,第五预设值可以等于第一预设值,也可以不等,本申请实施例对此不做限定。(2)达到预设时间。(3)SSD中存储的数据的数据量达到预设值(如记为第六预设值)。
2)若存储设备20的CPU利用率未超过第一预设值,且第一数据的重删率未超过第二预设值,则对第一数据中的部分数据进行重删,以及将其余部分数据发送至存储设备。可选的,将重删后的数据存储至HDD,将未经压缩的这部分数据存储至SDD,并生成这部分数据的元数据,用于记录这部分数据未经压缩。
3)若存储设备20的CPU利用率未超过第一预设值,且第一数据的重删率超过第二预设值,则对第一数据进行重删,将重删后的数据记为第二数据。可选的,可以将第二数据存储至HDD。
存储设备20对第一数据进行重删的流程可以参见上文的描述,不同之处在于,重删过程完全发生在存储设备20内,此处不再赘述。
对第一数据重删后得到第二数据,之后,根据第二数据的压缩率、存储设备20的计算资源的使用情况确定压缩方式,仍以CPU利用率为例:
(1)若第二数据的压缩率小于第三预设值,则可以不对第二数据进行压缩,可选的,存储设备20将第二数据存储至HDD。
(2)若第二数据的压缩率大于或等于第三预设值,且CPU利用率低于第四预设值,则对第二数据进行压缩。可选的,存储设备20将压缩后的数据存储至HDD。
(3)若第二数据的压缩率大于或等于第三预设值,且CPU利用率高于第四预设值,则对第二数据的部分数据进行压缩,可选的,将压缩后的数据存储至HDD。以及将第二数据中其余未压缩的数据存储至SSD中,并生成这部分未压缩数据的元数据,元数据参见上文的描述,此处不再赘述。
在一种实施方式中,存储设备20检测到满足预设的条件(如称为第二预设条件)时,基于元数据记录,从SSD中获取未压缩的数据,并对未压缩的数据进行压缩,可选的,将压缩后的数据存储至HDD。
其中,第二预设条件包括但不限于:(1)存储设备20的CPU利用率低于预设值(如第七预设值),示例性地,第七预设值可以等于第四预设值,也可以不等,本申请实施例对此不做限定。(2)达到预设时间。(3)SSD中存储的数据的数据量达到第六预设值。
基于与方法实施例同一发明构思,本申请实施例还提供了一种数据处理装置,图8为本申请提供的一种数据处理装置800的结构示意图,如图8所示,数据处理装置800包括获取模块801、确定模块802、判断模块803、处理模块804。该数据处理装置800用于执行上述方法实施例中计算设备所执行的方法。
获取模块801,用于获取第一数据;具体实现方式请参见图4中的步骤401的描述,此处不再赘述。
确定模块802,用于确定所述第一数据的重删率以及所述计算设备的工作状态。具体实现方式请参见图4中的步骤401、步骤403的描述,此处不再赘述。
判断模块803,用于当所述计算设备的工作状态处于繁忙状态时,判断所述第一数据的所述重删率是否超过第一预设值。具体实现方式请参见图4中的步骤402、步骤404的相关描述,此处不再赘述。
处理模块804,用于响应上述判断,当所述第一数据的重删率超过所述第一预设值时,对所述第一数据进行重复数据删除,以得到第二数据;具体实现方式请参见图4中的步骤406的描述,此处不再赘述。还用于当所述第一数据的重删率未超过所述第一预设值时,将所述第一数据存储于所述计算设备的第一存储器中。具体实现方式请参见图4中的步骤405的描述,此处不再赘述。
可选的,确定模块802,还用于确定第二数据的压缩率;具体实现方式请参见图4中的步骤407的描述,此处不再赘述。处理模块804,用于当所述第二数据的压缩率不低于第二预设值且计算资源的利用率低于第三预设值时,对第二数据进行压缩;或用于当所述第二数据的压缩率不低于第二预设值且计算资源利用率不低于第三预设值时,对第二数据的至少部分数据进行压缩。具体实现方式请参见图4中的步骤408至步骤411的描述,此处不再赘述。
可选的,处理模块804,还用于当第二数据的压缩率低于第二预设值时,将该第二数据存储至第二存储器;第二存储器的性能低于第一存储器。或,当所述第二数据的压缩率不低于第二预设值时,对所述第二数据的至少部分数据进行压缩;可选的,将压缩后的数据存储至第二存储器,以及将第二数据中未压缩的数据存储至第一存储器。
可选的,确定模块802,具体用于根据所述第一数据的属性参数确定第一数据的重删率;或根据第一数据对应的目标数据的重删率确定第一数据的重删率,该目标数据为基于预设条件选择的数据;或根据所述第一数据与所述计算机设备已存储数据的相似度确定。
可选的,处理模块804,还用于当所述计算设备处于空闲状态时,对所述第一数据进行重复数据删除和/或数据压缩,所述空闲状态是基于所述计算设备的计算资源的使用情况确定的。
可选的,所述计算资源包括处理器资源和/或内存。示例性地,当所述处理器利用率超过第四预设值,和/或所述内存利用率超过第五预设值时,所述计算设备的工作状态为繁忙状态;当所述处理器的利用率未超过所述第四预设值,和/或所述内存利用率未超过所述第五预设值时,所述计算设备的工作状态为空闲状态。
作为一种可能的实现方式,本申请还提供一种数据处理装置,该数据处理装置可以为图2所示的引擎121或存储系统120,或图3所示的服务器110,用于实现上述相关方法步骤以实现上述实施例中的计算设备所执行的方法,参见图4至图7任一方法实施例的相关描述,此处不再赘述。
作为一种可能的实现方式,本申请实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在计算设备上运行时,使得计算设备执行上述相关方法步骤以实现上述实施例中的计算设备所执行的方法,参见图4至图7任一方法实施例的相关描述,此处不再赘述。
作为一种可能的实现方式,本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的计算设备所执行的方法,参见图4至图7任一方法实施例的相关描述,此处不再赘述。
作为一种可能的实现方式,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的计算设备所执行的方法,参见图4至图7任一方法实施例的相关描述,为了简洁,此处不再赘述。
其中,本申请实施例提供的数据处理装置、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的计算设备对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
作为另一种可能的实现方式,本申请还提供一种包括第一计算设备和第二计算设备的分布式系统,其中,第一计算设备,用于向第二计算设备发送待备份的数据。而第一计算设备包括数据处理装置,用于执行图4至图7任一方法,为了简洁,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (16)

  1. 一种数据处理方法,其特征在于,应用于计算设备,该方法包括:
    获取第一数据;
    确定所述第一数据的重删率以及所述计算设备的工作状态;
    当所述计算设备的工作状态处于繁忙状态时,判断所述第一数据的所述重删率是否超过第一预设值;
    响应上述判断,当所述第一数据的重删率超过所述第一预设值时,对所述第一数据进行重复数据删除,以得到第二数据;当所述第一数据的重删率未超过所述第一预设值时,将所述第一数据存储于所述计算设备的第一存储器中。
  2. 如权利要求1所述的方法,其特征在于,所述工作状态是根据所述计算设备内计算资源的使用情况确定的,在得到所述第二数据之后,所述方法还包括:
    确定所述第二数据的压缩率;
    若所述第二数据的压缩率不低于第二预设值,且所述计算资源的利用率低于第三预设值,则对所述第二数据进行压缩;
    若所述第二数据的压缩率不低于所述第二预设值,且所述计算资源利用率不低于所述第三预设值,则对所述第二数据的至少部分数据进行压缩。
  3. 如权利要求2所述的方法,其特征在于,所述计算设备还包括第二存储器,所述第二存储器的性能低于所述第一存储器;
    所述方法还包括:
    若所述第二数据的压缩率低于第二预设值,则将所述第二数据存储至所述第二存储器;
    若所述第二数据的压缩率不低于所述第二预设值时,则对所述第二数据的至少部分数据进行压缩;将压缩后的数据存储至所述第二存储器,以及将所述第二数据中未压缩的数据存储至所述第一存储器。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述确定所述第一数据的重删率,包括:
    根据所述第一数据的属性参数确定;或
    根据第一数据对应的目标数据的重删率确定,所述目标数据为基于预设条件选择的数据;或
    根据所述第一数据与所述计算机设备已存储数据的相似度确定。
  5. 如权利要求1-4任一项所述的方法,其特征在于,还包括:
    当所述计算设备处于空闲状态时,对所述第一数据进行重复数据删除和/或数据压缩。
  6. 如权利要求2-5任一项所述的方法,其特征在于,所述计算资源包括处理器资源和/或内存。
  7. 如权利要求6所述的方法,其特征在于,所述确定所述计算设备的工作状态包括:
    当所述处理器利用率超过第四预设值,和/或所述内存利用率超过第五预设值时,确定所述计算设备处于繁忙状态;
    当所述处理器的利用率未超过所述第四预设值,和/或所述内存利用率未超过所述第五预设值时,所述计算设备处于空闲状态。
  8. 一种数据处理装置,其特征在于,所述装置包括:
    获取模块,用于获取第一数据;
    确定模块,用于确定所述第一数据的重删率以及所述计算设备的工作状态;
    判断模块,用于当所述计算设备的工作状态处于繁忙状态时,判断所述第一数据的所述重删率是否超过第一预设值;
    处理模块,用于响应上述判断,当所述第一数据的重删率超过所述第一预设值时,对所述第一数据进行重复数据删除,以得到第二数据;还用于当所述第一数据的重删率未超过所述第一预设值时,将所述第一数据存储于所述计算设备的第一存储器中。
  9. 如权利要求8所述的装置,其特征在于,
    所述确定模块,用于根据所述计算设备内计算资源的使用情况确定所述工作状态;
    所述确定模块,还用于确定所述第二数据的压缩率;
    所述处理模块,用于当所述第二数据的压缩率不低于第二预设值,且所述计算资源的利用率低于第三预设值时,对所述第二数据进行压缩;
    当所述第二数据的压缩率不低于所述第二预设值,且所述计算资源利用率不低于所述第三预设值时,对所述第二数据的部分数据进行压缩。
  10. 如权利要求8或9所述的装置,其特征在于,所述装置还包括第二存储器,所述第二存储器的性能低于所述第一存储器;
    所述处理模块,还用于若所述第二数据的压缩率低于第二预设值,则将所述第二数据存储至所述第二存储器;或,若所述第二数据的压缩率不低于所述第二预设值时,则对所述第二数据的至少部分数据进行压缩;将压缩后的数据存储至所述第二存储器,以及将所述第二数据中未压缩的数据存储至所述第一存储器。
  11. 如权利要求8-10任一项所述的装置,其特征在于,所述确定模块,具体用于:
    根据所述第一数据的属性参数确定;或根据第一数据对应的目标数据的重删率确定,所述目标数据为基于预设条件选择的数据;或根据所述第一数据与所述计算机设备已存储数据的相似度确定。
  12. 如权利要求8-11任一项所述的装置,其特征在于:所述处理模块,还用于:当所述计算设备处于空闲状态时,对所述第一数据进行重复数据删除和/或数据压缩。
  13. 如权利要求9-12任一项所述的装置,其特征在于,所述计算资源包括处理器资源和/或内存。
  14. 如权利要求13所述的装置,其特征在于所述确定模块,具体用于:
    当所述处理器利用率超过第四预设值,和/或所述内存利用率超过第五预设值时,确定所述计算设备处于繁忙状态;
    当所述处理器的利用率未超过所述第四预设值,和/或所述内存利用率未超过所述第五预设值时,所述计算设备处于空闲状态。
  15. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;
    所述存储器,用于存储计算机程序指令;
    所述处理器执行调用所述存储器中的计算机程序指令执行如权利要求1至7中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质被计算设备执行时,所述计算设备执行上述权利要求1至7中任一项所述的方法。
PCT/CN2022/091692 2021-07-08 2022-05-09 一种数据处理方法及装置 WO2023279833A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22836590.4A EP4357900A1 (en) 2021-07-08 2022-05-09 Data processing method and apparatus
US18/405,231 US20240143449A1 (en) 2021-07-08 2024-01-05 Data Processing Method and Apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110773758 2021-07-08
CN202110773758.9 2021-07-08
CN202111057198.3 2021-09-09
CN202111057198.3A CN115599591A (zh) 2021-07-08 2021-09-09 一种数据处理方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/405,231 Continuation US20240143449A1 (en) 2021-07-08 2024-01-05 Data Processing Method and Apparatus

Publications (1)

Publication Number Publication Date
WO2023279833A1 true WO2023279833A1 (zh) 2023-01-12

Family

ID=84801226

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091692 WO2023279833A1 (zh) 2021-07-08 2022-05-09 一种数据处理方法及装置

Country Status (3)

Country Link
US (1) US20240143449A1 (zh)
EP (1) EP4357900A1 (zh)
WO (1) WO2023279833A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999605A (zh) * 2012-11-21 2013-03-27 重庆大学 一种通过优化数据放置来减少数据碎片的方法和装置
CN105022741A (zh) * 2014-04-23 2015-11-04 苏宁云商集团股份有限公司 压缩方法和系统以及云存储方法和系统
CN105389387A (zh) * 2015-12-11 2016-03-09 上海爱数信息技术股份有限公司 一种基于压缩的重复数据删除性能及重删率提升的方法和系统
US20160259591A1 (en) * 2013-12-24 2016-09-08 Hitachi, Ltd. Storage system and deduplication control method
CN107632786A (zh) * 2017-09-20 2018-01-26 杭州宏杉科技股份有限公司 一种数据重删的管理方法及装置
CN110727404A (zh) * 2019-09-27 2020-01-24 苏州浪潮智能科技有限公司 一种基于存储端的数据重删方法、设备以及存储介质
WO2021012162A1 (zh) * 2019-07-22 2021-01-28 华为技术有限公司 存储系统数据压缩的方法、装置、设备及可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999605A (zh) * 2012-11-21 2013-03-27 重庆大学 一种通过优化数据放置来减少数据碎片的方法和装置
US20160259591A1 (en) * 2013-12-24 2016-09-08 Hitachi, Ltd. Storage system and deduplication control method
CN105022741A (zh) * 2014-04-23 2015-11-04 苏宁云商集团股份有限公司 压缩方法和系统以及云存储方法和系统
CN105389387A (zh) * 2015-12-11 2016-03-09 上海爱数信息技术股份有限公司 一种基于压缩的重复数据删除性能及重删率提升的方法和系统
CN107632786A (zh) * 2017-09-20 2018-01-26 杭州宏杉科技股份有限公司 一种数据重删的管理方法及装置
WO2021012162A1 (zh) * 2019-07-22 2021-01-28 华为技术有限公司 存储系统数据压缩的方法、装置、设备及可读存储介质
CN110727404A (zh) * 2019-09-27 2020-01-24 苏州浪潮智能科技有限公司 一种基于存储端的数据重删方法、设备以及存储介质

Also Published As

Publication number Publication date
US20240143449A1 (en) 2024-05-02
EP4357900A1 (en) 2024-04-24

Similar Documents

Publication Publication Date Title
US10664453B1 (en) Time-based data partitioning
US8370315B1 (en) System and method for high performance deduplication indexing
US9454321B1 (en) Workload-driven storage configuration management
US9977746B2 (en) Processing of incoming blocks in deduplicating storage system
US9430156B1 (en) Method to increase random I/O performance with low memory overheads
US8799238B2 (en) Data deduplication
US10430376B1 (en) Managing inline data compression in storage systems
US9916258B2 (en) Resource efficient scale-out file systems
EP3066553B1 (en) Storage appliance and method thereof for inline deduplication with segmentation
US10467102B1 (en) I/O score-based hybrid replication in a storage system
US8250043B2 (en) System and method for compression of partially ordered data sets
US20150363134A1 (en) Storage apparatus and data management
US9274907B1 (en) Decommissioning of virtual backup appliances
Zhou et al. LDFS: A low latency in-line data deduplication file system
US10229127B1 (en) Method and system for locality based cache flushing for file system namespace in a deduplicating storage system
WO2021073635A1 (zh) 一种数据存储方法及装置
US11169968B2 (en) Region-integrated data deduplication implementing a multi-lifetime duplicate finder
WO2023165196A1 (zh) 一种日志存储加速方法、装置、电子设备及非易失性可读存储介质
US10678431B1 (en) System and method for intelligent data movements between non-deduplicated and deduplicated tiers in a primary storage array
US9690809B1 (en) Dynamic parallel save streams
WO2023040305A1 (zh) 一种数据备份系统及装置
WO2023050856A1 (zh) 数据处理方法及存储系统
WO2023279833A1 (zh) 一种数据处理方法及装置
EP4321981A1 (en) Data processing method and apparatus
JP7323801B2 (ja) 情報処理装置および情報処理プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22836590

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022836590

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022836590

Country of ref document: EP

Effective date: 20240117

NENP Non-entry into the national phase

Ref country code: DE