WO2023040305A1 - 一种数据备份系统及装置 - Google Patents

一种数据备份系统及装置 Download PDF

Info

Publication number
WO2023040305A1
WO2023040305A1 PCT/CN2022/092467 CN2022092467W WO2023040305A1 WO 2023040305 A1 WO2023040305 A1 WO 2023040305A1 CN 2022092467 W CN2022092467 W CN 2022092467W WO 2023040305 A1 WO2023040305 A1 WO 2023040305A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
file
data processing
processing device
storage device
Prior art date
Application number
PCT/CN2022/092467
Other languages
English (en)
French (fr)
Inventor
杜翔
罗先强
陈克云
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22868683.8A priority Critical patent/EP4394606A1/en
Publication of WO2023040305A1 publication Critical patent/WO2023040305A1/zh
Priority to US18/606,476 priority patent/US20240220371A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1451Management of the data involved in backup or backup restore by selection of backup contents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the present application relates to the field of computer technology, in particular to a data backup system and device.
  • backup usually refers to the process of copying all or part of the data set in the file system or database system from the disk or storage array of the business host to other storage media.
  • data can be backed up to a remote storage device through backup software deployed on the business host.
  • the backup software provider usually integrates the deduplication technology at the source side (that is, the business host side) to deduplicate the stored duplicate data in the backup data , so as to reduce the amount of data transmission between the business host and the storage device, so as to achieve the purpose of increasing the logical backup bandwidth.
  • the above deduplication operation consumes more CPU computing resources on the service host, which may have a greater impact on the service performance of the service host.
  • the present application provides a data backup system and device, which are used to reduce performance impact on service hosts on the basis of ensuring backup bandwidth.
  • an embodiment of the present application provides a data backup system, and the system includes a data processing apparatus and a storage device.
  • the data processing device may be used to receive a write request sent by a service host (eg, the first host), and the write request is used to request to write the data to be backed up (eg, the first data) into the storage device.
  • the storage device may perform a deduplication and compression operation on the first data carried in the write request to obtain data after the deduplication and compression operation (for example, called second data), and send the second data to to the storage device.
  • the storage device is used to receive the second data sent by the data processing device and store the second data. So far, the backup of the first data is completed.
  • the calculation operations such as writing the data to be backed up (such as the first data) into the internal memory and deduplicating and compressing the data to be backed up all take place in the data processing device, without consuming the CPU resources of the first host, thereby reducing the need for the second
  • the impact of the production environment of the first host increases the CPU utilization of the first host.
  • the data processing device may be a network card or a data processing unit (data processing unit, DPU).
  • the data processing device when the data processing device is a network card or a DPU, it can be integrated or installed in a service host in a pluggable manner, making deployment more convenient.
  • the data processing device After the data processing device receives the write request, it stores the first data carried in the write request into the memory of the data processing device, and then returns a write request completion response to the first host;
  • the data processing device is specifically configured to obtain the first data from the internal memory, and delete data blocks in the first data that are duplicated with data blocks already stored in the storage device.
  • the data processing device performs computing operations such as deduplication and compression on the data to be backed up, thereby reducing the consumption of CPU resources of the first host and improving the backup efficiency of the backup task on the first host.
  • the data processing device after receiving the write request, is further configured to return the write request completion response to the first host after storing the first data in the memory of the data processing device; Acquire the first data from the internal memory and store it in the persistent storage medium of the data processing device; when performing deduplication and compression operations on the first data, the data processing device is specifically used to retrieve the data from the persistent Acquiring the first data from a storage medium; deleting a data block in the first data that is duplicated with a data block already stored in the storage device.
  • the data processing device can temporarily write the file data to be backed up to a local persistent storage medium, such as a disk, and since the file data has been stored persistently, logical data backup is completed.
  • data backup can be completed in the data processing device.
  • the backup process only depends on the computing power of the data processing device and the read/write bandwidth and size of the disk, and is no longer affected by the bandwidth performance from the host to the storage device and the processing capacity of the storage device. For large data backup scenarios, the backup performance can be significantly improved and the backup window can be shortened.
  • logical backup does not need to be deduplicated, deleted, or compressed, and it does not need to be sent to a storage device for storage, it does not involve network communication overhead, which can significantly shorten the backup window and improve backup efficiency.
  • the data processing apparatus further includes a first file system, the first file system is the same as the second file system of the storage device, and the write request sent by the first host is a write request based on the second file system, For example, it is used to write the first data into the first file in the second file system; when the data processing device stores the first data to the persistent storage medium of the data processing device, it is specifically used to write the first data into the first file through the first file system A data is stored to the persistent storage medium of the data processing device.
  • the data processing device stores and manages the data of the file to be backed up sent by the host through the local file system.
  • the data processing device is further configured to receive a file creation request, where the file creation request is used to request to create the first file in the second file system; and send the file creation request to the storage device;
  • the storage device is further configured to create the first file requested by the file creation request in the second file system of the storage device, and generate mapping address information of the first file, where the mapping address information is used to indicate that the data of the first file is located at The data processing device, or for indicating the access path of the data of the first file in the data processing device; sending a successful creation response to the data processing device;
  • the data processing device is further configured to: receive the creation success response sent by the storage device, and create the first file in the first file system; if the write request sent by the first host is used to request to write the first data into the second file system in the second file system A file; when the data processing device stores the first data to the persistent storage medium of the data processing device, it is specifically configured to write the first data into the first file in the first file system.
  • the storage device obtains the data of the file according to the mapping address information of the file, so as the entry of data access, it can serve more device data access and provide data access flexibility.
  • the data processing device is further configured to: delete the first file stored in the data processing device after storing the data of the first file in the storage device through a deduplication and compression operation;
  • the storage device is used to modify the mapping address information of the first file to indicate data of the first file and store it in the storage device.
  • the data processing device deletes the data after storing the data in the storage device, which can improve the utilization rate of the storage medium.
  • the data processing device is further configured to receive a first read request sent by the second host, where the first read request is used to read at least part of the data of the first file; and forward the read request to The storage device; the storage device is used to send a second read request to the data processing device when determining that the data of the first file is located in the data processing device according to the mapping address information of the first file, and the second read request uses The data processing device is further configured to read at least part of the data of the first file from the data processing device according to the read request.
  • the data processing device provides backup services for the host, that is, on the basis of storing the data to be backed up of the host to the local persistent storage medium of the data processing device, it can also provide data access services for the other devices, providing data access flexibility.
  • the embodiment of the present application also provides a data processing device, the data processing device has the function of implementing the behavior in the method example of the first aspect above, and the beneficial effects can be referred to the description of the first aspect, which will not be repeated here.
  • the functions described above may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the structure of the data processing device includes a receiving module, a processing module, and a sending module. These modules can perform the corresponding functions in the method example of the first aspect above. For details, refer to the detailed description in the method example, and details are not repeated here.
  • the present application also provides a computing device, the computing device includes a processor and a memory, and may also include a communication interface, the processor executes the program instructions in the memory to implement the above-mentioned first aspect or the first A method provided by any possible implementation of the aspect.
  • the memory is coupled with the processor, and stores necessary program instructions and data during the data backup process.
  • the communication interface is used to communicate with other devices, such as receiving a write request sent by the first host, or sending second data to the storage device.
  • the present application provides a computer-readable storage medium.
  • the computer-readable storage medium When the computer-readable storage medium is executed by a computing device, the computing device executes the aforementioned first aspect or any possible implementation of the first aspect. provided method.
  • the program is stored in the storage medium.
  • the storage medium includes but not limited to volatile memory, such as random access memory, and nonvolatile memory, such as flash memory, hard disk drive (hard disk drive, HDD), and solid state drive (solid state drive, SSD).
  • the present application provides a program product for a computing device
  • the program product for a computing device includes computer instructions, and when executed by a computing device, the computing device executes the aforementioned first aspect or any possible implementation of the first aspect method provided in the method.
  • the computer program product may be a software installation package, and if the method provided in the aforementioned first aspect or any possible implementation of the first aspect needs to be used, the computer program product may be downloaded and executed on a computing device. program product.
  • the present application also provides a computer chip, the chip is connected to the memory, and the chip is used to read and execute the software program stored in the memory, and implement the above first aspect and each possibility of the first aspect.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic flow diagram of a data backup method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of another system architecture provided by an embodiment of the present application.
  • FIG. 4 is a schematic flow diagram of another data backup method provided in the embodiment of the present application.
  • FIG. 5 is a schematic flow diagram of creating a file provided by the embodiment of the present application.
  • FIG. 6 is a schematic flow chart of file data migration provided by the embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a data access method provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the file system is a structured data file storage and organization form. All the data in the computer are 0 and 1, and a series of 01 combinations stored on the hardware media are completely impossible for users to distinguish and manage. Therefore, this application uses the concept of "file" to organize these data, and the data used for the same purpose composes different types of files according to the structure required by different applications. Usually different suffixes are used to refer to different types, and then each file is given a name that is easy to understand and remember. And when there are many files, these files are grouped according to a certain division method, and each group of files is placed in the same directory (or folder). In addition, there may be a subdirectory (subdirectory or subfolder) under the directory except files, and all files and directories form a tree structure.
  • This tree structure has a dedicated name: File System (File System).
  • File System File System
  • FAT/FAT32/NTFS of Windows EXT2/EXT3/EXT4/XFS/BtrFS of Linux, etc.
  • special characters for the names of these directories, subdirectories, and files such as " ⁇ ” for Windows/DOS, "/” for Unix-like systems)
  • a string of characters is called a file path, such as "/etc/systemd/system.conf" in Linux or "C: ⁇ Windows ⁇ System32 ⁇ taskmgr.exe” in Windows.
  • a path is a unique identifier for accessing a specific file.
  • D: ⁇ data ⁇ file.exe under Windows is the path of a file, which represents the file.exe file under the data directory under the D partition.
  • FIG. 1 is a schematic structural diagram of a backup system provided by an embodiment of the present invention.
  • the system includes a host 110 ( Figure 1 only shows one host 110, but this embodiment of the application does not limit it), a data processing device 120, and a storage device 130 ( Figure 1 only shows one storage device 130, but this embodiment of the present application does not limit it).
  • the host 110 may be a computing device deployed on the user side, and the computing device may be a physical machine or a virtual machine.
  • Physical machines include but are not limited to desktop computers, servers (such as application servers, file servers, database servers, etc.), notebook computers, and mobile devices.
  • Host 110 as a business host that needs to back up data, is generally equipped with backup software.
  • the host 110 backs up the data in the host 110 by running the backup software.
  • the backup software is also provided with a backup strategy.
  • the backup strategy can be a backup strategy , may also be a policy set by the user, and the backup policy may include, for example: backup start time, data to be backed up, and a target storage device for backing up the data.
  • the host 110 sends the data to be backed up in the host to the data processing device by running the backup software and according to the backup policy set in the backup software, which will be described in detail below.
  • the data processing device 120 is connected between the host 110 and the storage device 130 , and is used to process the data sent by the host 110 , for example, perform deduplication and compression processing, and send the processed data to the storage device 130 .
  • the specific process of data processing by the data processing device 120 will be described in detail below.
  • the data processing device 120 may be a data processor (data processing unit, DPU), a smart network card (smartnic), or other components, which are not limited in this embodiment of the present application.
  • the data processing device 120 includes a processor 121 , a memory 122 , a front-end interface 123 , and a back-end interface 124 .
  • the processor 121 , the memory 122 , the front-end interface 123 and the back-end interface 124 are connected through a bus 125 .
  • the processor 121 is a central processing unit (central processing unit, CPU), hardware logic circuit, processing core, application specific integrated circuit (application specific integrated circuit, ASIC) chip, AI chip or programmable logic device (programmable logic device, PLD) implementation
  • the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), a system on a chip ( system on chip, SoC) or any combination thereof.
  • the processor 121 may be configured to process a data backup request or a data restoration request from the host 110 .
  • the processor 121 receives the data backup request sent by the host computer 110 through the front-end interface 123, it will temporarily save the data to be backed up carried in the data backup request to the memory 122.
  • the processor 121 will persistently store the data in the memory 122.
  • two ways of data persistence are provided. One is to send the data to the storage device 130 for storage. persistent storage. When performing persistence in this way, the processor 121 will perform deduplication and compression processing on the persistent data, and then send it to the storage device 130 for storage. The specific process of deduplication and compression will be described in detail below.
  • the other is to first persist the data to the local hard disk, then deduplicate and compress the data persisted to the hard disk, and send the deduplicated and compressed data to the storage device 130 for storage.
  • Only one processor 121 is shown in FIG. 1 . In practical applications, there are usually multiple processors 121 , and one processor 121 has one or more processor cores. This embodiment does not limit the number of processors and the number of processor cores.
  • the memory 122 refers to an internal memory directly exchanging data with the processor 121. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for the operating system or other running programs.
  • the memory 122 includes at least two types of memory, for example, the memory 122 can be either a random access memory or a read only memory (ROM).
  • the random access memory is, for example, dynamic random access memory (DRAM), or storage class memory (SCM).
  • DRAM dynamic random access memory
  • SCM storage class memory
  • DRAM is a semiconductor memory, which, like most RAM, is a volatile memory device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory.
  • Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM.
  • the DRAM and the SCM are only illustrative examples in this embodiment of the present application, and the memory 122 may also include other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like.
  • the memory 122 may also be a dual in-line memory module or a dual in-line memory module (DIMM), that is, a module composed of DRAM.
  • DIMM dual in-line memory module
  • multiple memories 122 and different types of memories 122 may be configured in the data processing device 120 . This embodiment does not limit the quantity and type of the memory 122 .
  • the front-end interface 123 is used to transmit data between the data processing device 120 and the host 110 .
  • the front-end interface 123 may be a Peripheral Component Interconnect Express (PCIe) interface, and the data processing device 120 and the host 110 are connected through a PCIe bus.
  • PCIe Peripheral Component Interconnect Express
  • the front-end interface 123 can also be other types of interfaces, such as a non-volatile memory host controller (non-volatile memory express, NVMe) interface, which is not limited in this application, and any method for realizing communication between the two is applicable to this application. Application example.
  • the backend interface 124 is used to transmit data between the data processing device 120 and the storage device 130 .
  • the backend interface may be a network card, and the network card is connected to a network, so that the data processing device 120 and the storage device 130 can communicate through the network.
  • the network can be wired or wireless communication.
  • a network generally refers to any telecommunications or computer network, including, for example, an intranet, a wide area network (WAN), a local area network (LAN), a personal area network (PAN), the Internet or Wireless network (such as WIFI, 5th Generation (5G) communication technology).
  • the data processing device 120 may communicate with the storage device 130 using various network protocols, such as TCP/IP protocol, UDP/IP protocol, RDMA protocol, and the like.
  • the data processing device 120 may also communicate with the storage device 130 through a fiber optic switch.
  • the fiber optic switch can also be replaced with an Ethernet switch, an InfiniBand switch, a converged Ethernet-based remote direct memory access (RDMA over converged ethernet, RoCE) switch, and the like.
  • RDMA converged Ethernet-based remote direct memory access
  • the bus 125 includes but is not limited to: PCIe bus, double data rate (double data rate, DDR) bus, interconnection bus supporting multi-protocol (hereinafter referred to as multi-protocol interconnection bus, which will be described in detail below), serial advanced Technology attachment (serial advanced technology attachment, SATA) bus and serial connection SCSI (serial attached scsi, SAS) bus, controller area network bus (Controller Area Network, CAN), computer standard connection (computer express link, CXL) standard bus wait.
  • PCIe bus Peripheral Component Interconnect
  • SATA serial advanced technology attachment
  • SCSI serial attached scsi, SAS
  • controller area network bus Controller Area Network, CAN
  • computer standard connection computer express link, CXL
  • the storage device 130 is configured to provide data backup services for the host 110 .
  • the storage device 130 may be, but not limited to, a storage area network (Storage Area Network, SAN) device or a network attached storage (network attached storage, NAS) device. If the storage device 130 is a NAS device, the NAS device may be used to provide a file-level sharing service for the host.
  • FIG. 1 only shows one storage device 130. In practical applications, the storage device 130 in the embodiment of the present application can be the storage device 130 in the centralized storage system, or any storage device in the distributed storage system. device 130.
  • FIG. 2 is a schematic flow chart of a data backup method provided in the embodiment of the present application. As shown in FIG. 2, the method includes the following steps:
  • Step 201 The host 110 sends a write request to the DPU 120 in response to the backup command, for backing up the data indicated by the backup command to the storage device 130 .
  • the write request carries data that needs to be backed up to the storage device 130 .
  • the backup instruction may be generated in response to a user's backup operation, or may be automatically generated by the host 110 according to a preset backup strategy. Hourly back up the specified file system or specified file or specified database.
  • the backup software calls the open function to open the file to be backed up under the file system, and then triggers the fwrite system call to transfer the data of the file.
  • the fwrite system call Carry information such as the data (or the storage address of the data) of the specified file and the path name of the file, so as to request the data of the specified file to be written into the corresponding file of the file system of the storage device 130 through the fwrite system call, To complete the data backup of the file.
  • the write request may be the fwrite system call.
  • the DPU 120 uses the PCIe interface and the virtio-fs paravirtualization technology to access the host 110, the two functions of the DPU 120 and the host 110 may be implemented. The operating systems communicate directly. In this case, the DPU 120 can recognize and receive the fwrite system call sent by the host 110 .
  • the write request may also be a message in other protocol frame formats generated based on the fwrite system call. For example, when the DPU 120 accesses the host 110 through other methods such as an NVMe interface, the write request may be based on the fwrite system Call the resulting RPC call.
  • fwrite here refers to writing file data.
  • Generating a write request based on fwrite is only an example. The embodiment of the present application does not limit the type and generation method of the write request.
  • the write request can also be generated based on the write system call .
  • Step 202 the DPU 120 stores the data in the write request into the memory 122 .
  • step 203 the DPU 120 sends a request success response to the host 110, which is used to indicate that the DPU 120 has received the write request. It should be noted that step 203 is an optional step and is not required to be executed, so it is shown in a dashed box in FIG. 2 .
  • step 204 the DPU 120 obtains data to be backed up from the memory 122 (for example, first data), performs deduplication and compression operations on the first data, and records the deduped and compressed data as second data.
  • data to be backed up from the memory 122 for example, first data
  • the DPU 120 obtains data to be backed up from the memory 122 (for example, first data)
  • performs deduplication and compression operations on the first data and records the deduped and compressed data as second data.
  • the DPU 120 may temporarily save the data in the write requests in the memory 122 , and then, the DPU 120 acquires the first data from the memory 122 .
  • the first data here may be a continuous piece of data with a preset length, or variable-length data, which is not limited in this embodiment of the present application.
  • the deduplication and compression operation refers to data deduplication and/or data compression.
  • data deduplication refers to the use of algorithms to eliminate duplicate data, thereby reducing the storage space occupied by data. In this application, if duplicate data is detected during backup, it will be discarded, and then a pointer will be created to point to the data copy that has been backed up, which can reduce the amount of data transmitted between the DPU 120 and the storage device 130, and reduce the network load.
  • the deduplication method includes at least file-level deduplication and sub-file-level deduplication (also called block-level deduplication). In file-level deduplication, deduplication is performed in units of files. Let's introduce them respectively:
  • File-level deduplication also known as single-instance storage (SIS) detects and removes duplicate file copies. It stores only one copy of the file, so all other copies are replaced with pointers to the only copy. File-level deduplication is simple and fast, but it cannot eliminate duplicate content in files. For example, two 10MB Powerpoint presentation files differ only in the title page, they will not be considered as duplicate files. The two files will be stored separately.
  • SIS single-instance storage
  • Sub-file-level deduplication refers to decomposing a file/object into data blocks of fixed size or variable size, and performing deduplication operations in units of data blocks. Sub-file level deduplication removes duplicate data between files.
  • Fixed-length block deduplication divides files into fixed-length blocks and uses a hash algorithm to find duplicate data. Fixed-length blocks are simple, but may miss a lot of duplicate data, because similar data may have different block boundaries. Imagine adding a person's name to the title page of a document, the entire document will be shifted, and all blocks will be changed, making it impossible to detect duplicate data.
  • variable-length segment deduplication if there is a change in one segment, only the boundaries of this segment are adjusted, leaving the remaining segments unchanged. Compared with the fixed block method, this method improves the ability to identify duplicate data segments.
  • the deduplication process is described as follows: First, according to the preset algorithm (such as the Content-Defined Chunking (CDC) algorithm, the embodiment of the present application does not do this. limit) to determine the boundaries of each block in the first data, thereby dividing the first data into multiple blocks, and the size of each block may be different;
  • the hash value is similar to the fingerprint information of the data block, and the data blocks with the same content have the same fingerprint information, so it can be confirmed whether the content of the data block is the same by matching the fingerprint; Fingerprint matching is performed on the data blocks that have been stored before. The data blocks that have been stored before are no longer stored repeatedly.
  • the data deduplication process in this application may include:
  • the DPU 120 divides the first data into multiple data blocks based on the CDC algorithm, for example, the multiple data are block 1 , block 2 , block 3 and block 4 . Then calculate the fingerprint (ie hash value) of each data block, and then send the fingerprint of each data block to the storage device 130 .
  • the storage device 130 traverses the local fingerprint library, which includes the fingerprints of the data blocks stored by the storage device 130, and inquires whether there is a fingerprint (fp1) of the block 1, a fingerprint (fp2) of the block 2, and a fingerprint of the block 2 of the first data in the fingerprint library.
  • the storage device 130 sends the query result to the DPU 120 .
  • the query result here is used to indicate whether there is a repeated data block in the first data, identification information of the repeated data block, and the like.
  • the DPU 120 determines the repeated data block in the first data according to the query result, and the DPU 120 generates the metadata of the repeated data block, such as the metadata including but not limited to: the fingerprint of the data block, the data block in the first
  • the offset in the data and the length of the data block, etc. are not limited to the metadata in this embodiment of the present application.
  • the DPU 120 can send the metadata of the repeated data block and the data of the non-duplicated data block to the storage device 130 .
  • the deduplicated data includes fp1, fp3, data of data block 2, and data of data block 4. After deduplication, data transmission volume can be reduced, resource overhead for backup and network burden can be reduced, and logical backup bandwidth can be increased.
  • the present application can also use a compression algorithm to compress the first data.
  • the compression algorithm can be, for example, Shannon-Fano algorithm, Huffman coding, arithmetic coding, LZ77/LZ78 coding, etc. This application does not limit this, any can Existing algorithms for compressing data and compression algorithms that may be applied in the future are applicable to the embodiments of the present application.
  • the compression algorithm may be specified by the user, or may be adaptively compressed by the system according to a preset policy, which is not limited in this embodiment of the present application.
  • the DPU 120 may only perform data deduplication on the first data.
  • DPU 120 may also perform data compression on only the first data.
  • the DPU 120 may deduplicate the data first, and then compress the deduplicated data, etc., which is not limited in this embodiment of the present application.
  • Step 205 the DPU 120 sends the second data to the storage device 130 .
  • Step 206 the storage device 130 stores the second data.
  • the storage device 130 writes the data of the unrepeated data block in the storage medium, and generates metadata of the data, and the metadata may include the storage location of the data in the storage device 130, Fingerprints etc.
  • the metadata may include the storage location of the data in the storage device 130, Fingerprints etc.
  • the DPU 120 may send the second data to the storage device 130, or may aggregate multiple second data into a data block of a specified size and then send it to the storage device 130, that is, repeat the above steps to obtain multiple The second data corresponding to the first data is aggregated and sent to the storage device 130 together, thereby reducing the number of write IOs.
  • the data to be backed up is written into the internal memory 122, and the data to be backed up is deduplicated and other computing operations all occur in the DPU 120, which does not consume the CPU resources of the host 110, reduces the impact on the production environment of the host 110, and improves the performance of the host. 110 CPU utilization.
  • Fig. 3 shows a schematic diagram of the structure of the data processing device 220.
  • a hard disk 126 is added in Fig. 3.
  • Hard disk 126 also can be referred to as auxiliary memory
  • hard disk 126 can be non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), hard disk drive (hard disk drive, HDD) or solid state Drive (solid state disk, SSD), etc.
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state Drive
  • DPU 220 may be used to provide backup service for host 110
  • hard disk 126 may be used to store backup data sent by host 110 .
  • the size of the hard disk 126 provided in the DPU 220 can be determined according to the size of the data that the host 110 needs to back up each time. For example, if the host 110 generates 1TB of data in one backup cycle, the hard disk 126 may be at least 1TB in size.
  • the data processing device 220 is DPU220 as an example for introduction.
  • the implementation of the method in this system can be divided into two processes. The first process is to create a file system/file for the DPU220 (step 401-step 406). In the second process, the DPU 220 backs up the files of the host 110 to the storage device 130 (step 407-step 413).
  • step 401 the DPU 220 creates a file system.
  • the DPU 220 "formats" the hard disk 126 to establish a local file system for managing the local storage space (including the hard disk 126 ).
  • the file system may be of any type (such as ext3, zfs, etc.), which is not limited in this embodiment of the present application.
  • the type of the local file system of the DPU 220 and the type of the file system of the first host may be the same or different.
  • the type of the local file system of the DPU 220 and the type of the file system on the storage device 130 may be the same , may also be different, which is not limited in the embodiments of the present application.
  • the embodiment of the present application does not limit the timing for the DPU 220 to create the local file system.
  • Step 402 the host 110 sends a create request to the DPU 220 to request to create an object under the file system of the secondary storage device 130 .
  • Objects here refer to directories or files in the file system.
  • the host 110 can mount the file system of the storage device 130 to a local directory of the host 110, and then the host 110 can execute the file system of the storage device 130 like creating a file , directory and other operations. Taking creating a file as an example, the host 110 may send a creation request for requesting to create a file under the file system of the storage device 130, so that the storage device 130 creates a file in the local file system.
  • the host 110 mounts the file system whose root directory is /FS0/ in the storage device 130 to the /mnt/ directory of the host 110.
  • the application program of the host 110 Call the open function to request to create an object (directory or file) under /mnt/FS0/.
  • the open system call will carry the specified path name, object name and object type.
  • the open request is open ⁇ "mnt/FS0/vm0.vmdk", O_CREAT ⁇ , which means creating vm0.vmdk under the path mnt/FS0/.
  • step 401 and step 402 can be executed at the same time, or step 401 can be executed before step 402, or step 401 can be executed after step 402. This is not limited.
  • Step 403 DPU 220 sends the creation request to storage device 130 .
  • Step 404 the storage device 130 creates an object under the specified path in response to the creation request.
  • the storage device 130 After receiving the creation request, the storage device 130 creates the file under the corresponding path of the file system. Continue referring to (b) of FIG. vm0.vmdk. If the creation is successful, a successful creation response is sent to the DPU220. If the creation fails, for example, a file with the same name already exists in the local file system of the storage device 130 , then the creation fails, and the storage device 130 returns a creation failure response to the DPU 220 . The following takes the successful creation as an example to illustrate.
  • the storage device 130 generates file attribute information for the created file, the file attribute information is used to indicate the file attribute, and the file attribute includes at least two types: normal file (regular) and stub file (stub) , wherein, a normal file means that the file data is stored locally in the storage device 130 .
  • the stub file means that the data of the file is not stored locally on the storage node, but is stored at the mapping address of the stub file.
  • the mapping address is used to indicate the actual storage location of the file data.
  • the mapping address includes a device identifier, which is used to uniquely indicate a device.
  • it may also include a file path, etc. This embodiment of the present application does not limit this, as long as it can indicate the data source.
  • the mapping address of FS0/vm0.vmdk on the storage device 130 is DPU220, or the mapping address is DPU220:/FS0/vm0.vmdk.
  • step 405 the storage device 130 sends a response to the DPU 220 indicating that the creation is successful.
  • step 406 the DPU 220 creates the same object under the specified path corresponding to the local file system of the DPU 220 based on the aforementioned creation request.
  • DPU220 responds to the above open ⁇ "mnt/FS0/vm0.vmdk", O_CREAT ⁇ request, in the local file system Create the same file system in a certain directory of , as shown in (b) of FIG. 5 , DPU220 creates an FSO directory under the data/ directory, and creates vm0.vmdk in the FSO directory. At this point, a new file is created.
  • One or more files or directories can be created through the above method, for example, vm1.vmdk in (c) of FIG. 5 can also be created based on the above method.
  • the DPU 220 will create the same file system (such as FS0:/) and files as the host 110 and the storage device 130 . Subsequently, the DPU 220 may store the backup data of the same file in the host 110 into a corresponding file in the local file system.
  • Step 407 the host 110 sends a write request to the DPU 220 in response to the backup command, for backing up the data indicated by the backup command to the storage device 130 .
  • Step 408 the DPU 220 stores the data in the write request into the memory 122 .
  • Step 409 the DPU 220 sends a request success response to the host 110 .
  • Step 407 to Step 409 are the same as Step 201 to Step 203 respectively, and will not be repeated here.
  • Step 410 the DPU 220 stores the data to be backed up in the memory 122 to the hard disk 126 .
  • the DPU220 writes the data to be backed up into corresponding files in the local file system for persistent storage.
  • the backup software of the host 110 triggers the fwrite system call to request to write the data of vm0.vmdk into vm0.vmdk of the storage device 130 .
  • the DPU220 in response to the fwrite system call, the DPU220 first writes the data of vm0.vmdk into the local memory 122, then obtains at least part of the data of the vm0.vmdk from the memory 122, and writes it into the vm0.vmdk of the local file system of the DPU220 , to complete the persistent storage of vm0.vmdk.
  • the DPU 220 may first determine whether the vm0.vmdk file exists in the local file system, and if so, write the data of vm0.vmdk into the vm0.vmdk of the local file system. If it does not exist, you can create the file in the local file through the above steps. Or, if the vm0.vmdk file does not exist in the local file system, or there is no remaining storage space in the hard disk 126 , the DPU 220 may also send the data of the vm0.vmdk file to the storage device 130 .
  • the set trigger condition is that the amount of data in the memory 122 reaches a preset threshold, or reaches a preset time, for example, the DPU 220 periodically writes the data in the memory 122 to the hard disk 126 .
  • the DPU 220 triggers writing the data in the internal memory 122 into the hard disk 126 according to the instruction of the host 110, for example, the host 110 sends instruction information (such as called first instruction information) to the DPU 220, and the first instruction information uses Instructing the DPU 220 to persistently store the data in the memory 122 .
  • the backup software of the host 110 may trigger an fsync call based on a preset policy, and the fsync call is used to instruct writing data into a persistent storage medium.
  • the preset policy may be but not limited to: 1) Periodically trigger fsync call.
  • the DPU 220 uses the file as a unit, an fsync call is triggered for each file; 3) the DPU 220 triggers the fsync call after a preset period of time after sending the write request, etc., which is not limited in this embodiment of the present application.
  • the DPU 220 after receiving the fsync call, the DPU 220 writes the data in the memory 122 into the hard disk 126 .
  • the DPU 220 After the DPU 220 writes the data in the memory 122 into the hard disk 126, it returns an fsync call success response to the host.
  • the DPU 220 may write all data in the current internal memory 122 to the hard disk 126 , or may write data belonging to a specified file to the hard disk 126 .
  • the backup software may send a fysch call carrying a file handle corresponding to vm0.vmdk to DPU 220 to instruct DPU to write all data of vm0.vmdk in memory 122 into hard disk 126 .
  • a fysch call carrying a file handle corresponding to vm0.vmdk to DPU 220 to instruct DPU to write all data of vm0.vmdk in memory 122 into hard disk 126 .
  • Step 411 the DPU 220 obtains the data to be backed up from the hard disk 126 (for example, the first data), and performs deduplication and compression operations on the first data to obtain deduplication and compressed data (for example, the second data).
  • the background sequentially reads the data in the local file system from the file header for aggregation, and deduplicates and compresses a piece of aggregated data such as the first data, to get the second data.
  • the specific manner of performing the deduplication and compression operation on the first data by the DPU 220 may refer to the foregoing description, which will not be repeated here.
  • Step 412 the DPU 220 sends the second data to the storage device 130 .
  • Step 413 the storage device 130 writes the data into a corresponding file in the local file system of the storage device 130 .
  • the data sent by the DPU 220 to the storage device 130 each time may be partial data of a file, that is to say, the DPU 220 and the storage device 130 may repeat steps 411 to 413 multiple times to complete the data of a complete file. migrate. Then in this process, as an optional implementation, DPU220 can also record the offset position (such as offset) of the migrated data in the file, in other words record the offset position of the data to be read next time, as a migration The cursor for the task.
  • the offset position such as offset
  • the data size of the file vm0.vmdk is 100M
  • the DPU reads the data of 0-60M of the file from the hard disk 126 for the first time, and records the offset position of 60M, and the next time it can start from the position of 60M of the file read data.
  • the DPU 220 may notify the storage device 130 of the offset position of the migrated data in the file.
  • the DPU 220 may send indication information (such as second indication information) for indicating that the file data has been completely migrated to the storage device 130 .
  • the storage device 130 can modify the file attribute of the file to "regular", that is, a normal file, and the data identifying the file no longer points to the local data of the DPU 220 .
  • the indication information may be in the same data packet as the data of the file.
  • a data packet includes a header and a payload, and the header includes but is not limited to one or more of the following: the offset of the migrated data Shift position, second indication information.
  • the second indication information may include 1 bit, and different values of this bit indicate whether the data of the file is completely sent to the storage device 130, for example, when the value is 0, it means that it has been completely sent, and when the value is 1, it means that it is not completely sent. send.
  • the storage device 130 sends a successful execution response to the DPU 220 , and the DPU 220 can delete the data of the file locally on the DPU 220 after receiving the response.
  • the DPU 220 sends the data of vm0.vmdk to the storage device 130, and the storage device 130 writes the data of vm0.vmdk into vm0.vmdk.
  • the storage device 130 modifies the file attribute of vm0.vmdk to a normal file, and sends a successful execution response to the DPU 220, and then the DPU 220 can delete the vm0.vmdk in the local file system.
  • the DPU 220 can first temporarily write the file data to be backed up into the local file system, that is, into the hard disk 126. Since the file data has been stored persistently, the logical data backup is completed. In this way, data backup can be completed in the DPU220, the backup process only depends on the computing power of the DPU220 and the read-write bandwidth and size of the disk, and is no longer limited by the bandwidth performance from the host to the storage device 130 and the processing capability of the storage device 130, For big data backup scenarios, it can significantly improve backup performance and shorten the backup window. In addition, since the file data does not need to be deduplicated, and does not need to be sent to the storage device 130 for storage, it does not involve network communication overhead, which can significantly shorten the backup window and improve backup efficiency.
  • the embodiment of the present application may also provide a data access method.
  • FIG. 7 is a system provided by the embodiment of the present application. As shown in FIG. 7, the system includes host 0, host 1, DPU0, DPU1, host 2, and storage device 130, wherein host 0, DPU0, and storage device 130 are respectively It is the host 110, DPU220, and storage device 130 in FIG. 3, and host 1 is connected to DPU1.
  • Fig. 7 shows a schematic flow diagram corresponding to the data access method applied to the system, the flow includes:
  • the storage device 130 receives a data access request (for example, referred to as a first access request) sent by the host 1 through the DPU1.
  • a data access request for example, referred to as a first access request
  • the data access request is used to request access to data of FS0/vm1.vmdk.
  • the file attribute of the file vm1.vmdk in the file system of the storage device 130 is a stub file, and the mapping address is DPU0.
  • the first access request may access some data of vm1.vmdk, and the storage device 130 can judge whether this part of data is stored locally in the storage device 130 according to the recorded offset position of the migrated data of vm1.vmdk, if stored locally , then the storage device 130 can directly send the part of data to DPU1. Otherwise, storage device 130 retrieves the data from DPU0.
  • the storage device 130 sends a data access request (such as a second access request) to DPU0.
  • DPU0 receives the second data access request.
  • the second access request is used to request access to data of FS0/vm1.vmdk.
  • the first access request and the second access request may be the same or different, which is not limited in this embodiment of the present application.
  • the storage device 130 can directly forward the first access request to DPU0, and at this time, the second access request is the first access request.
  • the storage device 130 Conversely, when DPU0 cannot recognize the first access request, storage device 130 generates a second access request that DPU0 can recognize based on the first access request, and at this time the second access request is different from the first access request.
  • DPU0 obtains the data of vm1.vmdk from the local file system of DPU0 in response to the second access request, and sends the data to the storage device 130 .
  • the storage device 130 receives the data sent by DPU0.
  • the storage device 130 may write the data of vm1.vmdk into the vm1.vmdk of the local file system of the storage node.
  • the storage node sends the data of vm1.vmdk to host 1.
  • the storage device 130 can also receive the data access request sent by the host 2, and obtain the data requested by the host 2 through the above method, and return it to the host 2, which will not be repeated here. .
  • the foregoing method can provide flexibility in data access during the data backup process.
  • the embodiment of the present application further provides a data processing device, the data processing device is configured to execute the method executed by the DPU in the above method embodiment.
  • the data processing device 800 includes a receiving module 801 , a processing module 802 and a sending module 803 ; specifically, in the data processing device 800 , the modules are connected through a communication channel.
  • the receiving module 801 is configured to receive a write request sent by the first host, and the write request carries the first data that needs to be backed up to the storage device 130; for the specific implementation, please refer to the description of step 201 in FIG. 2 , or refer to the steps in FIG. 4 The description of 407 will not be repeated here.
  • the processing module 802 is configured to perform deduplication and compression operations on the first data to obtain second data; for the specific implementation, please refer to the description of step 204 in FIG. 2, or refer to the description of step 411 in FIG. 4, here I won't repeat them here.
  • a sending module 803, configured to send the second data to the storage device 130.
  • step 205 in FIG. 2 please refer to the description of step 205 in FIG. 2 , or refer to the description of step 412 in FIG. 4 , which will not be repeated here.
  • the data processing device 800 is a network card or a DPU.
  • the processing module 802 is also configured to store the first data in the memory 122 of the data processing device; please refer to step 202 in FIG. 2 for the specific implementation manner. , or refer to the description of step 408 in FIG. 4 , which will not be repeated here.
  • the sending module is further configured to return a write request completion response to the first host; for a specific implementation, please refer to the description of step 203 in FIG. 2 , or refer to the description of step 409 in FIG. 4 , which will not be repeated here.
  • the processing module 802 is further configured to store the first data in the memory 122 of the data processing device; please refer to step 408 in FIG. 4 for the specific implementation manner. description, which will not be repeated here.
  • the processing module 802 is further configured to obtain the first data from the memory 122 and store it in the persistent storage medium of the data processing device; for the specific implementation, please refer to the description of step 410 in FIG. repeat.
  • the data processing apparatus 800 further includes a first file system, where the first file system is the same as the second file system of the storage device 130 .
  • the receiving module 801 is also used to: receive a file creation request, the file creation request is used to request to create a first file under the file system of the storage device 130; please refer to the description of step 402 in FIG. repeat.
  • the sending module 803 is further configured to: send the file creation request to the storage device; for a specific implementation, please refer to the description of step 403 in FIG. 4 , which will not be repeated here.
  • the receiving module 801 is also configured to: receive a successful creation response sent by the storage device 130; for a specific implementation, please refer to the description of step 405 in FIG. 4 , which will not be repeated here.
  • the processing module 802 is further configured to: create the first file in the first file system; for a specific implementation, please refer to the description of step 406 in FIG. 4 , which will not be repeated here.
  • the processing module 802 when storing the first data in the persistent storage medium of the data processing device, the processing module 802 is specifically configured to: write the first data into the first file system of the first file system. in the file.
  • the processing module 802 is specifically configured to: write the first data into the first file system of the first file system. in the file.
  • step 411 in FIG. 4 please refer to the description of step 411 in FIG. 4 , which will not be repeated here.
  • the processing module 802 is further configured to: delete the first file stored in the device after storing the data of the first file in the storage device through a deduplication and compression operation.
  • the receiving module 801 is also configured to receive at least part of the data read request sent by the storage device 130 for requesting to read the first file; for the specific implementation, please refer to step 2 in FIG. 7 description and will not be repeated here.
  • the sending module is further configured to send at least part of the data of the first file to the storage device 130 .
  • step 3 in FIG. 7 please refer to the description of step 3 in FIG. 7 , which will not be repeated here.
  • the embodiment of the present application also provides a computer storage medium, the computer storage medium stores computer instructions, and when the computer instructions are run on the storage device, the storage device executes the above-mentioned related method steps to realize the DPU120 in the above-mentioned embodiment.
  • the method of execution refer to the description of steps 201 to 205 in FIG. 2 , which will not be repeated here, or perform the above-mentioned related method steps to realize the method performed by the DPU220 in the above embodiment, refer to steps 401 to 403 in FIG. 4 .
  • the description of steps 405-412 will not be repeated here.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, it causes the computer to execute the above-mentioned related steps, so as to realize the method performed by the DPU 120 in the above-mentioned embodiment, see step 201 in FIG. 2
  • the description of ⁇ step 205 will not be repeated here, or the above-mentioned related method steps are executed to realize the method executed by the DPU220 in the above-mentioned embodiment, refer to the description of steps 401-403 and steps 405-412 in FIG. Let me repeat.
  • an embodiment of the present application also provides a device, which may specifically be a chip, a component or a module, and the device may include a connected processor and a memory; wherein the memory is used to store computer-executable instructions, and when the device is running,
  • the processor can execute the computer-executed instructions stored in the memory, so that the chip executes the methods executed by the DPU120 in the above-mentioned method embodiments, refer to the description of steps 201 to 205 in FIG.
  • Method steps to implement the method executed by the DPU 220 in the above embodiment refer to the description of steps 401 to 403 and steps 405 to 412 in FIG. 4 and will not repeat them here.
  • the data processing device, computer storage medium, computer program product or chip provided in the embodiment of the present application are all used to execute the method corresponding to the DPU120, DPU220 or storage device 130 provided above, therefore, the beneficial effects it can achieve Reference can be made to the beneficial effects of the corresponding methods provided above, and details will not be repeated here.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or It may be integrated into another device, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may be one physical unit or multiple physical units, which may be located in one place or distributed to multiple different places. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit (or module) in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • an integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk.
  • the computer-executed instructions in the embodiments of the present application may also be referred to as application program codes, which is not specifically limited in the embodiments of the present application.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.
  • a magnetic medium such as a floppy disk, a hard disk, or a magnetic tape
  • an optical medium such as a DVD
  • a semiconductor medium such as a solid state disk (Solid State Disk, SSD)
  • the various illustrative logic units and circuits described in the embodiments of the present application can be implemented by a general-purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, Discrete gate or transistor logic, discrete hardware components, or any combination of the above designed to implement or operate the described functions.
  • the general-purpose processor may be a microprocessor, and optionally, the general-purpose processor may also be any conventional processor, controller, microcontroller or state machine.
  • a processor may also be implemented by a combination of computing devices, such as a digital signal processor and a microprocessor, multiple microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration to accomplish.
  • the steps of the method or algorithm described in the embodiments of the present application may be directly embedded in hardware, a software unit executed by a processor, or a combination of both.
  • the software unit may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other storage medium in the art.
  • the storage medium can be connected to the processor, so that the processor can read information from the storage medium, and can write information to the storage medium.
  • the storage medium can also be integrated into the processor.
  • the processor and storage medium can be provided in an ASIC.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供一种数据备份系统及装置,该系统包括数据处理装置和存储设备,数据处理装置用于接收第一主机发送的写请求,该写请求携带需要备份至所述存储设备的第一数据;用于对该第一数据进行重删压缩操作,以得到第二数据;还用于将所述第二数据写入所述存储设备,上述方式,将待备份的数据写入内存,以及对待备份的数据进行重复数据删除等计算操作均发生在数据处理装置内,不消耗第一主机的CPU资源,减少对第一主机生产环境的影响,提升第一主机的CPU利用率。

Description

一种数据备份系统及装置
相关申请的交叉引用
本申请要求在2021年09月18日提交中华人民共和国知识产权局、申请号为202111101810.2、申请名称为“一种数据备份系统及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种数据备份系统及装置。
背景技术
在信息数据管理领域,备份通常指将文件系统或数据库系统中全部或部分数据集合从业务主机的磁盘或存储阵列复制到其他存储介质的过程。
在一种备份方式中,可通过部署在业务主机上的备份软件将数据备份至远端的存储设备中。由于需要备份的数据通常存在较高的数据重复率,针对该特点,备份软件提供商通常都集成了源端(即业务主机端)重删技术,以将备份数据中已存储过的重复数据删除,从而减少业务主机与存储设备之间的数据传输量,以此达到提升逻辑备份带宽的目的。
然而,上述重删操作会消耗业务主机上较多的CPU计算资源,对业务主机的业务性能可能会产生较大的影响。
发明内容
本申请提供一种数据备份系统及装置,用于在保证备份带宽的基础上,减少对业务主机的性能影响。
第一方面,本申请实施例提供了一种数据备份系统,该系统包括数据处理装置和存储设备。在该系统中,数据处理装置可以用于接收业务主机(如称为第一主机)发送的写请求,该写请求用于请求将待备份数据(如称为第一数据)写入存储设备。存储设备接收到该写请求之后,可以对该写请求中携带的第一数据进行重删压缩操作,以得到重删压缩操作后的数据(如称为第二数据),并将第二数据发送至存储设备。存储设备用于接收数据处理装置发送的第二数据,并存储第二数据。至此,完成第一数据的备份。
通过上述设计,将待备份数据(如第一数据)写入内存,以及对待备份数据进行重删压缩等计算操作均发生在数据处理装置内,不消耗第一主机的CPU资源,从而减少对第一主机生产环境的影响,提升第一主机的CPU利用率。
在一种可能的实现方式中,该数据处理装置可以是网卡或数据处理单元(data processing unit,DPU)。
通过上述设计,数据处理装置为网卡或DPU时,可以集成或以可插拔方式安装至业务主机中,部署更加方便。
在一种可能的实现方式中,数据处理装置接收到写请求之后,将写请求中携带的第一数据存储至数据处理装置的内存,之后,向第一主机返回写请求完成响应;在对所述第一 数据进行重删操作时,数据处理装置具体用于从内存中获取第一数据,删除第一数据中与存储设备中已存储的数据块重复的数据块。
通过上述设计,由数据处理装置对待备份数据进行重删压缩等计算操作,从而减少对第一主机的CPU资源的消耗,可以提高第一主机上的备份任务的备份效率。
在一种可能的实现方式中,数据处理装置接收到写请求之后,还用于将所述第一数据存储至所述数据处理装置的内存后,向所述第一主机返回写请求完成响应;从所述内存中获取所述第一数据并存储至所述数据处理装置的持久性存储介质;在对所述第一数据进行重删压缩操作时,数据处理装置具体用于从所述持久性存储介质中获取所述第一数据;删除所述第一数据中与存储设备中已存储的数据块重复的数据块。
通过上述设计,数据处理装置可以先将待备份的文件数据暂时写入本地持久性存储介质,如磁盘中,由于该文件数据已被持久化存储,便完成了逻辑上的数据备份。这样,在数据处理装置内便可完成数据备份,备份过程仅取决于数据处理装置的计算能力和磁盘的读写带宽和大小,不再受到主机至存储设备的带宽性能和存储设备的处理能力的限制,对于大数据的备份场景下可以明显提高备份性能、缩短备份窗口。另外由于完成逻辑备份时不需要对待备份数据进行重删压缩等处理,并且不需要发送至存储设备进行存储,也就不涉及网络通信的开销,可以明显缩短备份窗口,提高备份效率。
在一种可能的实现方式中,数据处理装置还包括第一文件系统,第一文件系统与存储设备的第二文件系统相同,第一主机发送的写请求为基于第二文件系统的写请求,如用于将第一数据写入第二文件系统中第一文件;数据处理装置在将所述第一数据存储至数据处理装置的持久性存储介质时,具体用于通过第一文件系统将第一数据存储至数据处理装置的持久性存储介质。
通过上述设计,数据处理装置通过本地文件系统存储以及管理主机发送过来的待备份文件的数据。
在一种可能的实现方式中,数据处理装置还用于接收文件创建请求,该文件创建请求用于请求在第二文件系统中创建第一文件;将该文件创建请求发送至存储装置;
存储设备还用于在存储设备的第二文件系统中创建该文件创建请求所请求创建的第一文件,并生成第一文件的映射地址信息,该映射地址信息用于指示第一文件的数据位于数据处理装置,或用于指示第一文件的数据在数据处理装置中的访问路径;向数据处理装置发送创建成功响应;
数据处理装置还用于:接收存储设备发送的创建成功响应,在第一文件系统中创建第一文件;若第一主机发送的写请求用于请求将第一数据写入第二文件系统中第一文件;所述数据处理装置在将第一数据存储至数据处理装置的持久性存储介质时,具体用于将第一数据写入第一文件系统的第一文件中。
通过上述设计,存储设备根据文件的映射地址信息获取文件的数据,从而作为数据访问的入口可以为更多的设备数据访问服务,提供数据访问的灵活性。
在一种可能的实现方式中,数据处理装置还用于:在将第一文件的数据经过重删压缩操作,存储至所述存储装置后,删除存储在数据处理装置中的第一文件;所述存储装置用于将第一文件的所述映射地址信息修改为用于指示第一文件的数据存储于所述存储装置。
通过上述设计,数据处理装置将数据存储至存储设备后将数据删除,可以提高存储介质的利用率。
在一种可能的实现方式中,数据处理装置还用于接收第二主机发送的第一读请求,该第一读请求用于读取所述第一文件的至少部分数据;转发该读请求至所述存储设备;存储设备用于根据所述第一文件的映射地址信息确定该第一文件的数据位于数据处理装置时,向所述数据处理装置发送第二读请求,该第二读请求用于请求读取所述第一文件的至少部分数据;数据处理装置还用于根据所述读请求从所述数据处理装置中读取所述第一文件的至少部分数据。
通过上述设计,数据处理装置为主机提供备份服务,即将主机的待备份数据存储至数据处理装置的本地持久化存储介质的基础上,还可以为该其他设备提供数据访问服务,提供了数据访问的灵活性。
第二方面,本申请实施例还提供了一种数据处理装置,该数据处理装置具有实现上述第一方面的方法实例中行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述数据处理装置的结构中包括接收模块、处理模块、发送模块。这些模块可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
第三方面,本申请还提供了一种计算装置,所述计算装置包括处理器和存储器,还可以包括通信接口,所述处理器执行所述存储器中的程序指令执行上述第一方面或第一方面任一可能的实现方式提供的方法。所述存储器与所述处理器耦合,其保存有执行数据备份过程中必要的程序指令和数据。所述通信接口,用于与其他设备进行通信,如接收第一主机发送的写请求,或向存储设备发送第二数据。
第四方面,本申请提供了一种计算机可读存储介质,所述计算书可读存储介质被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第五方面,本申请提供了一种计算设备程序产品,所述计算设备程序产品包括计算机指令,在被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
第六方面,本申请还提供一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行所述存储器中存储的软件程序,执行上述第一方面以及第一方面的各个可能的实现方式中所述的方法。
附图说明
图1为本申请实施例提供的一种系统架构示意图;
图2为本申请实施例提供的一种数据备份方式的流程示意图;
图3为本申请实施例提供的另一种系统架构示意图;
图4为本申请实施例提供的另一种数据备份方式的流程示意图;
图5为本申请实施例提供的一种创建文件的流程示意图;
图6为本申请实施例提供的一种文件数据迁移的流程示意图;
图7为本申请实施例提供的一种数据访问方法的流程示意图;
图8为本申请实施例提供的一种数据处理装置的结构示意图。
具体实施方式
为了便于理解本申请实施例所提供的数据备份方法,首先对本申请实施例所涉及的概念和术语进行简单说明。
1,文件系统是一个结构化的数据文件存储和组织形式。计算机中所有的数据都是0和1,存储在硬件介质上的一连串的01组合对用户来说完全无法去分辨以及管理。因此本申请用“文件”这个概念对这些数据进行组织,用于同一用途的数据,按照不同应用程序要求的结构方式组成不同类型的文件。通常用不同的后缀来指代不同的类型,然后给每个文件起一个方便理解记忆的名字。而当文件很多的时候,按照某种划分方式给这些文件分组,每一组文件放在同一个目录(或者叫文件夹)里面。而且目录下面除了文件还可以有下一级目录(称之为子目录或者子文件夹),所有的文件、目录形成一个树状结构。这个树状结构有一个专用的名字:文件系统(File System)。文件系统有很多类型,常见的有Windows的FAT/FAT32/NTFS,Linux的EXT2/EXT3/EXT4/XFS/BtrFS等。为了方便查找,从根节点开始逐级目录往下,一直到文件本身,把这些目录、子目录、文件的名字用特殊的字符(例如Windows/DOS用“\”,类Unix系统用“/”)拼接起来,这样的一串字符称之为文件路径,例如Linux中的“/etc/systemd/system.conf”或者Windows中的“C:\Windows\System32\taskmgr.exe”。路径是访问某个具体的文件的唯一标识。例如,Windows下的D:\data\file.exe就是一个文件的路径,它表示D分区下的data目录下的file.exe文件。
2,本申请中涉及的第一、第二等各种数字编号仅为描述方便进行的区分,并不用来限制本申请实施例的范围,也表示先后顺序。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
下面将结合附图,对本发明实施例中的技术方案进行说明。
图1为本发明实施例提供的备份系统的架构示意图。参阅图1所示,该系统包括主机110(图1仅示出一个主机110,但本申请实施例对此不做限定)、数据处理装置120、存储设备130(图1仅示出一个存储设备130,但本申请实施例对此不做限定)。
1)主机110,可以是部署在用户侧的一种计算设备,计算设备可以是物理机,也可以是虚拟机。物理机包括但不限于桌面电脑、服务器(如应用服务器、文件服务器、数据库服务器等)、笔记本电脑以及移动设备。
主机110,作为需要备份数据的业务主机,一般安装有备份软件,主机110通过运行备份软件对主机110中的数据进行备份,备份软件中还设置有备份策略,所述备份策略可以是备份的策略,也可以是用户设置的策略,所述备份策略例如可以为:备份启动时间、需要备份的数据、备份所述数据的目标存储设备。具体的主机110通过运行备份软件,并根据备份软件中设置的备份策略,将主机中需要备份的数据发送至数据处理装置,下文会 对此进行详细说明。
2)数据处理装置120,连接于主机110与存储设备130之间,用于对主机110发送的数据进行处理,例如进行重删压缩处理,并将处理后的数据发送至存储设备130。关于数据处理装置120对数据的具体的处理过程将在下文进行详细描述。数据处理装置120可以是数据处理器(data processing unit,DPU)、智能网卡(smartnic),还可以是其他组件,本申请实施例对此不做限定。
在硬件上,数据处理装置120包括处理器121、内存122、前端接口123、后端接口124。处理器121、内存122、前端接口123以及后端接口124通过总线125连接。
其中,处理器121是一个中央处理器(central processing unit,CPU)、硬件逻辑电路、处理核、专用集成电路(application specific integrated circuit,ASIC)芯片、AI芯片或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)、片上系统(system on chip,SoC)或其任意组合。
处理器121可以用于处理来自主机110的数据备份请求或者数据恢复请求。示例性地,处理器121通过前端接口123接收主机110发送的数据备份请求时,会将数据备份请求携带的待备份的数据暂时保存至内存122中,当满足预设的条件或内存122中的数据总量达到一定阈值时,处理器121将内存122中的数据进行持久化存储,在本申请实施例中,提供了两种数据持久化的方式,一种是将数据发送给存储设备130进行持久化存储。通过这种方式进行持久化时,处理器121会对持久化的数据进行重删压缩处理后,再发送至存储设备130存储,重删压缩的具体过程将在下文做详细介绍。另外一种是将数据首先持久化至本地硬盘,然后再对持久化至硬盘中的数据进行重删压缩处理,并将重删压缩处理后的数据发送至存储设备130存储,具体过程请参考下文的详细描述。图1中仅示出了一个处理器121,在实际应用中,处理器121的数量往往有多个,其中,一个处理器121又具有一个或多个处理器核。本实施例不对处理器的数量,以及处理器核的数量进行限定。
内存122,是指与处理器121直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存122包括至少两种存储器,例如内存122既可以是随机存取存储器,也可以是只读存储器(read only memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。DRAM是一种半导体存储器,与大部分RAM一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但存取速度上比DRAM慢,在成本上也比DRAM更为便宜。然而,DRAM和SCM在本申请实施例中只是示例性的说明,内存122还可以包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。另外,内存122还可以是双列直插式存储器模块或双线存储器模块(dual in-line memory module,DIMM),即由DRAM组成的模块。实际应用中,数据处理装置120中可配置多个内存122,以及不同类型的内存122。本实施例不对内存122的数量和类型进行限定。
前端接口123,用于在数据处理装置120与主机110之间传输数据。示例性地,前端接口123可以是快捷外围互联标准(Peripheral Component Interconnect Express,PCIe)接口,数据处理装置120和主机110通过PCIe总线连接。前端接口123还可以是其他类型的接口,如非易失性内存主机控制器(non-volatile memory express,NVMe)接口,本申请对此不做限定,任何实现两者通信的方式均适用于本申请实施例。
后端接口124,用于在数据处理装置120与存储设备130之间传输数据。所述后端接口可以是网卡,所述网卡连接至网络,从而使在数据处理装置120与存储设备130之间通过网络通信。网络可以为有线或无线通信方式。示例性地,网络通常表示任何电信或计算机网络,包含例如企业内部网、广域网(wide area network,WAN)、局域网(local area network,LAN)、个域网(personal area network,PAN)、因特网或无线网络(如WIFI、第5代(5th Generation,5G)通信技术)。具体的,数据处理装置120可以使用多种网络协议与存储设备130通信,例如TCP/IP协议、UDP/IP协议、RDMA协议等。除了通过上述网络的示例外,数据处理装置120也可以通过光纤交换机与存储设备130通信。或者,光纤交换机也可以替换成以太网交换机、无限带宽(InfiniBand)交换机、基于融合以太网的远程直接内存访问(RDMA over converged ethernet,RoCE)交换机等。
总线125包括但不限于:PCIe总线、双数据速率(double data rate,DDR)总线、支持多协议的互联总线(如下简称为多协议互联总线,下文将会对其进行详细介绍)、串行高级技术附件(serial advanced technology attachment,SATA)总线和串行连接SCSI(serial attached scsi,SAS)总线、控制器局域网络总线(Controller Area Network,CAN)、计算机标准连接(computer express link,CXL)标准总线等。
3)存储设备130,用于为主机110提供数据备份服务。存储设备130可以但不限于是:存储区域网络(Storage Area Network,SAN)设备或网络附加存储(network attached storage,NAS)设备。如该存储设备130为NAS设备时,NAS设备可以用于为主机提供文件级共享服务。另外,图1仅示出的一个存储设备130,实际应用中,本申请实施例中的存储设备130可以是集中式存储系统中的存储设备130,也可以是分布式存储系统中的任一存储设备130。
接下来以附图1所示的系统为例,对本申请实施例提供的数据备份方法进行说明。为了便于描述,本申请的以下实施例,将以数据处理装置120为DPU为例进行说明。
参见图2,图2为本申请实施例提供的一种数据备份方法的流程示意图,如图2所示,该方法包括如下步骤:
步骤201:主机110响应备份指令向DPU120发送写请求,用于将备份指令所指示的数据备份至存储设备130。该写请求携带需要备份至存储设备130的数据。所述备份指令可以是响应用户的备份操作产生的,也可以是主机110根据预设的备份策略自动生成的,所述备份策略例如可以为进行备份的时间及需要备份的数据,例如每隔一个小时对指定文件系统或指定文件或指定数据库进行备份。
以备份文件为例,主机110在执行备份时,备份软件通过调用open函数来打开该文件系统下需要备份的文件,之后触发fwrite系统调用来传递该文件的数据,示例性地,该fwrite系统调用携带该指定文件的数据(或该数据的存储地址)以及该文件的路径名称等信息,以通过该fwrite系统调用来请求将该指定文件的数据写入存储设备130的文件系统 的对应文件中,以完成该文件的数据备份。
需要说明的是,在一种实施方式中,该写请求可以是该fwrite系统调用,如DPU120使用PCIe接口和virtio-fs半虚拟化技术接入主机110时,可以实现DPU120和主机110的两个操作系统之间进行直接通信,这种情况下,DPU120可以识别并接收主机110发送的fwrite系统调用。在另一种实施方式中,该写请求也可以是基于该fwrite系统调用生成的其他协议帧格式的消息,如DPU120通过其他方式如NVMe接口接入主机110时,该写请求可以是基于fwrite系统调用生成的RPC调用。另外,这里的fwrite是指用于写文件数据,基于fwrite生成写请求仅为举例,本申请实施例对写请求的类型及生成方式不做限定,如写请求还可以是基于write系统调用生成的。
步骤202,DPU120将写请求中的数据存储至内存122中。
步骤203,DPU120向主机110发送请求成功响应,用于指示DPU120已接收到该写请求。需要说明的是,步骤203为可选的步骤,并非必须执行的步骤,因此图2中以虚线框示出。
步骤204,DPU120从内存122获取待备份的数据(如称为第一数据),并对第一数据进行重删压缩操作,将重删压缩后的数据记为第二数据。
DPU120在接收到主机110请求的写请求之后,可以将这些写请求中的数据暂时保存在内存122中,之后,DPU120从内存122中获取第一数据。这里的第一数据可以是一段连续的预设长度的数据,也可以是变长数据,本申请实施例对此不做限定。
DPU120获取第一数据之后,对第一数据进行重删压缩操作,重删压缩操作是指重复数据删除和/或数据压缩。其中,重复数据删除是指利用算法消除重复数据,从而减小数据占用的存储空间。本申请中,在备份时如果检测到重复数据,就会将其丢弃,然后创建指针指向已备份过的数据副本,这样可以减少DPU120与存储设备130之间传输的数据量,减少了网络负担。具体的,根据重删粒度大小,重删方法至少包括文件级重删和子文件级重删(也称块级重删)。文件级重删是以文件为单元进行重删操作。下面分别来介绍:
(1)文件级重删,也称为单例存储(single-instance storage,SIS)检测并移除重复的文件副本。它只存储文件的一个副本,所以其他的副本都会以指向唯一副本的指针代替。文件级去重简单快速,但是不能消除文件重复的内容。例如,两个10MB大小的Powerpoint演示文件只是标题页不同,他们不会被看作重复文件。两个文件会被分别存储。
(2)子文件级重删是指将文件/对象分解成固定大小或不定大小的数据块,以数据块为单元进行重删操作。子文件级去重去掉了文件间的重复数据。子文件去重有两种实现方式:固定长度块和长度可变段。固定长度块去重将文件划分为固定长度的块,使用哈希算法找出重复的数据。固定长度块虽然简单,但是可能会错过不少重复数据,因为相似数据的块边界可能不同。设想一下,在一个文档的标题页上加上一个人名,整个文档会移位,所有块都产生了变化,造成无法探测重复数据。在长度可变段去重中,如果一个段中有变化,那么只有此段的边界被调整,剩余段不变。与固定块方法相比,该方法提升了识别重复数据段的能力。
如下以不定大小的块级重删为例,描述重复数据删除过程:首先,按照预设算法(如内容可变长度分块(Content-Defined Chunking,CDC)算法,本申请实施例对此不做限定)确定第一数据内各块的边界,从而将第一数据分为多个块,每个块的大小可能是不同的;然后,以块为单元进行哈希运算(如SHA1),所得哈希值类似于数据块的指纹信息,内容 相同的数据块具有相同的指纹信息,如此便可以通过匹配指纹的方式确认数据块的内容是否相同;接着,将数据块与已存在于存储设备130中的数据块进行指纹匹配,对于之前已经存储过的数据块不再重复存储,只利用哈希值作为索引信息记录该数据块,并通过映射将数据块的索引信息与具体存放位置对应起来,对于之前没有存储过的新数据块,先进行物理存储,再利用哈希值索引进行记录。按照上述流程,便可以保证相同的数据块在物理介质上至存储一次,达到删除重复数据的目的。
举例来说,本申请中重复数据删除过程可以包括:
DPU120基于CDC算法将第一数据分为多个数据块,如多个数据为块1、块2、块3和块4。然后计算每个数据块的指纹(即哈希值),之后将每个数据块的指纹发送至存储设备130。存储设备130遍历本地的指纹库,指纹库包括存储设备130已存储的数据块的指纹,查询指纹库中是否存在第一数据的块1的指纹(fp1)、块2的指纹(fp2)、块3的指纹(fp3)、块4的指纹(fp4),如果存在,则该数据块为重复的数据块,例如,指纹库中存在fp1纹和fp3,不存在fp2和fp4,则块1和块3为重复的数据块,块2和块4为不重复的数据块。之后,存储设备130将查询结果发送给DPU120。这里的查询结果用于指示第一数据中是否存在重复的数据块,以及重复的数据块的标识信息等等。
对应的,DPU120根据查询结果确定第一数据中的重复的数据块,DPU120生成重复的数据块的元数据,如该元数据包括但不限于:该数据块的指纹、该数据块在该第一数据中的偏移量以及该数据块的长度等,本申请实施例对元数据不做限定。若仅对第一数据进行重复数据删除,而不进行其他数据处理,这种情况下DPU120便可以将重复的数据块的元数据和不重复的数据块的数据发送至存储设备130。如上述示例中,重删后的数据包括fp1、fp3、数据块2的数据和数据块4的数据。经过重复数据删除后,可以减少数据传输量,降低用于备份的资源开销以及网络负担,提高逻辑备份带宽。
本申请还可以使用压缩算法对第一数据进行压缩,压缩算法如可以是香农-范诺算法、哈夫曼编码、算数编码、LZ77/LZ78编码等等,本申请对此不做限定,任何可以对数据进行压缩的已有算法以及未来可能应用的压缩算法均适用于本申请实施例。另外,压缩算法可以是用户指定的,也可以是系统根据预设策略自适应压缩,本申请实施例对此也不做限定。
上述数据处理方式可以单独使用也可以组合使用,比如,DPU120可以仅对第一数据进行重复数据删除。或者,DPU120也可以仅对第一数据进行数据压缩。再比如,DPU120可以先对数据进行重复数据删除,之后再对重删后的数据进行压缩,等等,本申请实施例对此不做限定。
步骤205,DPU120将第二数据发送至存储设备130。
步骤206,存储设备130存储第二数据。
存储设备130在存储第二数据的过程中,在存储介质中写入未重复的数据块的数据,并生成该数据的元数据,该元数据可以包括该数据在存储设备130中的存储位置、指纹等。对于重复的数据块只需要记录该重复的数据块的元数据,令元数据指向当前已存储的数据块即可。至此,便完成了第一数据的备份。
需要说明的是,DPU120可以将第二数据发送给存储设备130,也可以再将多个第二数据进行聚合成一个指定大小的数据块之后再发送给存储设备130,即重复执行上述步骤得到多个第一数据对应的第二数据,将该多个第二数据进行聚合后一起发送给存储设备 130,从而减少写IO数量。
上述方式,将待备份的数据写入内存122,以及对待备份的数据进行重复数据删除等计算操作均发生在DPU120内,不消耗主机110的CPU资源,减少对主机110生产环境的影响,提升主机110的CPU利用率。
本申请实施例还提供了另一种数据处理装置220。图3示出了该数据处理装置220的架构示意图,图3在图1的基础上,增加了硬盘126,图3所示的数据处理装置220所包含的处硬盘126之外的其他组件及其功能可以参见图1中的数据处理装置120中相关组件的介绍,这里不再赘述,下面仅对硬盘126进行介绍。
硬盘126,也可以称为辅助存储器,硬盘126可以为非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),硬盘驱动器(hard disk drive,HDD)或固态驱动器(solid state disk,SSD)等。与内存122不同的是,硬盘126读写数据的速度比内存122慢,通常用于持久性地存储数据。在本申请中,DPU220可以用于为主机110提供备份服务,硬盘126可以用于存储主机110发送的备份数据。其中,DPU220内设置的硬盘126的大小可以根据主机110每次需要备份的数据大小决定。例如,主机110在一个备份周期内产生1TB的数据,则硬盘126可以是至少1TB大小。
下面结合图4,以附图3所示的系统为例,对本申请实施例提供的另一种数据备份方法进行说明。如下仍以数据处理装置220为DPU220为例进行介绍,在该系统中实施该方法可以分为两个过程,过程一为DPU220创建文件系统/文件(步骤401~步骤406)。过程二为DPU220将主机110的文件备份至存储设备130(步骤407~步骤413)。
步骤401,DPU220创建文件系统。
在该方法中,DPU220在硬盘126进行“格式化”以建立本地文件系统,用于对本地存储空间(包括硬盘126)进行管理。该文件系统可为任意类型(比如ext3、zfs等),本申请实施例对此不做限定。需要说明的是,DPU220的本地文件系统的类型与第一主机的文件系统的类型可以相同,也可以不同,同样的,DPU220的本地文件系统的类型与存储设备130上的文件系统的类型可以相同,也可以不同,本申请实施例对此均不做限定。另外,本申请实施例对DPU220创建本地文件系统的时机不做限定。
步骤402:主机110向DPU220发送创建请求,以请求在从存储设备130的文件系统下创建对象。这里的对象是指文件系统中的目录或文件。
在一种实施方式中,主机110可以将存储设备130的文件系统挂载至主机110本地某目录下,之后,主机110便可以像使用本地文件系统一样对存储设备130的文件系统执行如创建文件、目录等操作。以创建文件为例,主机110可以发送用于请求在存储设备130的文件系统下创建文件的创建请求,以令存储设备130在本地文件系统创建文件。
举例来说,参见图5的(a)所示,主机110将存储设备130中根目录为/FS0/的文件系统挂载至主机110的/mnt/目录下,示例性地,主机110的应用程序调用open函数以请求在/mnt/FS0/下创建对象(目录或文件),该open系统调用会携带指定路径名、对象名以及对象类型。例如,该open请求为open{“mnt/FS0/vm0.vmdk”,O_CREAT},含义为在路径mnt/FS0/下创建vm0.vmdk。需要说明的是,(1)上文中open函数仅为举例,本申请实施例对创建请求的类型及生成方式不做限定。(2)步骤401和步骤402之间没有严格是时序限定,步骤401与步骤402可以同时执行,也可以是步骤401在步骤402之前执行,或 者步骤401在步骤402之后执行,本申请实施例对此不做限定。
步骤403:DPU220将该创建请求发送至存储设备130。
步骤404:存储设备130响应于该创建请求,在指定路径下创建对象。
存储设备130在接收到该创建请求后,在文件系统相应路径下创建该文件,继续参阅图5的(b),存储设备130响应于上文的open请求,在本地文件系统FS0/路径下创建vm0.vmdk。如果创建成功,则向DPU220发送创建成功的响应。如果创建失败,如存储设备130本地文件系统中已存在相同名称的文件,则创建失败,存储设备130向DPU220返回创建失败的响应。如下以创建成功为例进行说明。
值得注意的是,若存储设备130创建成功之后,DPU220也会创建相同的文件系统或文件,并且后续DPU220接收到主机110发送的该文件的写请求之后,会暂时将该文件的数据写入DPU220本地文件系统的对应文件中,也就是说,这期间存储设备130上的该文件并没有实际存储数据。为此,在一种实现方式中,存储设备130为创建的文件生成文件属性信息,文件属性信息用于指示文件属性,文件属性至少包括两种类型:正常文件(regular)和存根文件(stub),其中,正常文件是指文件数据存储在存储设备130本地。存根文件是指文件的数据未存储在存储节点本地,而是存储在该存根文件的映射地址处。该映射地址用于指示该文件数据的实际存储位置。示例性地,该映射地址包括设备标识,设备标识用于唯一指示一个设备,可选的,还可以包括文件路径等,本申请实施例对此不做限定,能够指示出数据来源即可。例如,参见图5的(b),存储设备130上的FS0/vm0.vmdk的映射地址为DPU220,或映射地址为DPU220:/FS0/vm0.vmdk。当存储设备130接收到存根文件的访问请求时,存储设备130从存根文件的映射地址获取该文件的数据。下文会对该场景下的数据访问方法进行详细说明。
步骤405,存储设备130向DPU220发送用于指示创建成功的响应。
步骤406,DPU220基于前述的创建请求,在DPU220本地文件系统对应的指定路径下创建相同的对象。
继续参见图5的(b),DPU220接收到存储设备130发送的创建成功的响应之后,DPU220响应于上文的open{“mnt/FS0/vm0.vmdk”,O_CREAT}请求,在本地文件系统中的某目录下创建相同的文件系统,如图5的(b)所示,DPU220在data/目录下创建FSO目录,并在FSO目录中创建vm0.vmdk。至此,一个新的文件便创建完成。通过上述方式可以创建一个或多个文件或目录,如图5的(c)中的vm1.vmdk也可以基于上述方式创建,为了简洁,在此不再赘述。
通过上述流程,DPU220将会创建与主机110、存储设备130相同的文件系统(如FS0:/)及文件。后续,DPU220可以将主机110中相同的文件的备份数据存储至本地文件系统的对应文件中。
步骤407,主机110响应于备份指令向DPU220发送写请求,用于将备份指令所指示的数据备份至存储设备130。
步骤408,DPU220将写请求中的数据存储至内存122中。
步骤409,DPU220向主机110发送请求成功响应。
其中,步骤407至步骤409分别与步骤201至步骤203相同,此处不再赘述。
步骤410,DPU220将内存122中待备份的数据存储至硬盘126中。
具体的,DPU220将待备份的数据写入本地文件系统的对应文件中,以进行持久化存 储。以图5所示的场景进行举例说明,假设在步骤407中,主机110的备份软件触发fwrite系统调用以请求将vm0.vmdk的数据写入存储设备130的vm0.vmdk中。对应的,DPU220响应于该fwrite系统调用,将vm0.vmdk的数据首先写入本地内存122,后续从内存122获取该vm0.vmdk的至少部分数据,并写入DPU220本地文件系统的vm0.vmdk中,以完成vm0.vmdk的持久化存储。
在一种可选的实现方式中,DPU220可以首先判断本地文件系统是否存在vm0.vmdk文件,如果存在,则将vm0.vmdk的数据写入本地文件系统的vm0.vmdk中。若不存在,则可以通过上述步骤在本地文件创建该文件。或者,如果本地文件系统不存在vm0.vmdk文件,或硬盘126中没有剩余存储空间时,DPU220也可以将该vm0.vmdk文件的数据发送至存储设备130。
DPU220触发将内存122中的数据写入硬盘126的方式有多种,在一种实施方式中,DPU220可以在检测到满足预设的触发条件时将内存122中的数据写入硬盘126,如预设的触发条件为内存122中的数据的数据量达到预设阈值,或达到预设的时间,如DPU220周期性的将内存122中的数据写入硬盘126。
在另一种实施方式中,DPU220根据主机110的指示触发将内存122中的数据写入硬盘126,如主机110向DPU220发送指示信息(如称为第一指示信息),该第一指示信息用于指示DPU220将内存122中的数据进行持久化存储。示例性地,主机110的备份软件可以基于预设的策略触发fsync调用,该fsync调用用于指示将数据写入持久化存储介质。其中,预设的策略可以是但不限于:1)周期性触发fsync调用。2)以文件为单位,针对每个文件触发一个fsync调用;3)DPU220在发送写请求的一段预设时间之后触发fsync调用,等等,本申请实施例对此不做限定。对应的,DPU220接收到fsync调用后,将内存122中的数据写入硬盘126。当DPU220将内存122中的数据写入硬盘126之后,向主机返回fsync调用成功响应。具体的,DPU220可以将当前内存122中的全部数据均写入硬盘126,也可以是将属于指定文件的数据写入硬盘126。例如,备份软件可以向DPU220发送携带vm0.vmdk对应的文件句柄的fysch调用,以指示DPU将内存122中的vm0.vmdk的全部数据写入硬盘126。PDU220获取其他文件(如vm1.vmdk)的数据、以及将vm1.vmdk的数据写入本地文件系统的方式参见上述说明,此处不再赘述。
步骤411,DPU220从硬盘126中获取待备份的数据(如记为第一数据),并对第一数据进行重删压缩操作,以得到重删压缩后的数据(如记为第二数据)。
示例性地,DPU220将数据写入本地文件系统对应文件的同时,后台将本地文件系统中的数据从文件头顺序读出进行聚合,并对聚合后的一段数据如第一数据进行重删压缩,以得到第二数据。其中,DPU220对第一数据进行重删压缩操作的具体方式可以参见前述的说明,此处不再赘述。
步骤412,DPU220将第二数据发送至存储设备130。
步骤413,存储设备130将该数据写入存储设备130本地文件系统的对应文件中。
需要说明的是,DPU220每次向存储设备130发送的数据可以是一个文件的部分数据,也就是说,DPU220和存储设备130,可能重复执行多次步骤411~步骤413来完成一个完整文件的数据迁移。那么在该过程中,作为一种可选的实施方式,DPU220还可以记录文件中已迁移数据的偏移位置(如偏移量),换言之记录下一次待读取数据的偏移位置,作为迁移任务的游标。举例来说,文件vm0.vmdk的数据大小为100M,DPU首次从硬盘126 读取该文件的0-60M的数据,并记录下偏移位置60M,下一次便可以从该文件的60M的位置开始读取数据。对应的,DPU220可以将该文件中已迁移数据的偏移位置通知给存储设备130。
当DPU220将文件的数据全部迁移至存储设备130之后,可以向存储设备130发送用于指示文件数据已全部迁移完成的指示信息(如称为第二指示信息)。存储设备130接收到该第二数指示信息后,便可以将该文件的文件属性修改为“regular”,即正常文件,标识该文件的数据不再指向DPU220的本地数据。示例性地,该指示信息可以和文件的数据在同一个数据包中,例如,一个数据包包括头部和载荷,头部包括但不限于下列中的一项或多个:已迁移数据的偏移位置、第二指示信息。该第二指示信息可以包括1个比特,该比特的不同值表示文件的数据是否完全发送给存储设备130,例如,该值为0时,表示已完全发送,该值为1时,表示未完全发送。
可选的,存储设备130执行成功之后,向DPU220发送执行成功的响应,DPU220接收到该响应,便可以将DPU220本地该文件的数据删除。如图6的(a)所示,DPU220将vm0.vmdk的数据发送给存储设备130,存储设备130将vm0.vmdk的数据写入vm0.vmdk,全部写入完成之后,参见图6的(b)所示,存储设备130将vm0.vmdk的文件属性修改为正常文件,并向DPU220发送执行成功的响应,之后,DPU220便可以在本地文件系统删除该vm0.vmdk。
上述方式,DPU220可以先将待备份的文件数据暂时写入本地文件系统,即写入硬盘126,由于该文件数据已被持久化存储,便完成了逻辑上的数据备份。这样,在DPU220内便可完成数据备份,备份过程仅取决于DPU220的计算能力和盘的读写带宽和大小,不再受到主机至存储设备130的带宽性能和存储设备130的处理能力的限制,对于大数据的备份场景下可以明显提高备份性能、缩短备份窗口。另外由于不需要对文件数据进行重复数据删除等处理,并且不需要发送至存储设备130进行存储,也就不涉及网络通信的开销,可以明显缩短备份窗口,提高备份效率。
基于上述图4所示的数据处理方法,本申请实施例还可以提供了一种数据访问方法。
图7为本申请实施例提供的一种系统,如图7所示,该系统包括主机0、主机1、DPU0、DPU1、主机2和存储设备130,其中,主机0、DPU0、存储设备130分别是图3中的主机110、DPU220、存储设备130,另外,主机1与DPU1相连。
图7示出了应用于该系统的数据访问方法所对应的流程示意图,该流程包括:
1)存储设备130接收主机1通过DPU1发送的数据访问请求(如称为第一访问请求)。
假设该数据访问请求用于请求访问FS0/vm1.vmdk的数据。如图7所示,存储设备130的文件系统中文件vm1.vmdk的文件属性为存根文件,映射地址为DPU0。
示例性地,第一访问请求可能访问vm1.vmdk的部分数据,存储设备130可以根据记录的vm1.vmdk已迁移数据的偏移位置判断这部分数据是否存储在存储设备130本地,如果存储在本地,则存储设备130可以将该部分数据直接发送给DPU1。否则,存储设备130从DPU0获取该数据。
2)存储设备130向DPU0发送数据访问请求(如称为第二访问请求)。对应的,DPU0接收第二数据访问请求。
该第二访问请求用于请求访问FS0/vm1.vmdk的数据。
需要说明的是,第一访问请求和第二访问请求可以相同也可以不同,本申请实施例对此不做限定。如当DPU0可以识别第一访问请求时,存储设备130可以直接将第一访问请求转发至DPU0,这时第二访问请求就是第一访问请求。反之,DPU0不能识别第一访问请求时,存储设备130基于第一访问请求生成DPU0可以识别的第二访问请求,这时第二访问请求和第一访问请求是不同的。
3)DPU0响应于该第二访问请求,从DPU0的本地文件系统中获取vm1.vmdk的数据,并将该数据发送至存储设备130。对应的,存储设备130接收DPU0发送的数据。
在一种可选的实施方式中,当存储设备130接收到DPU0发送的vm1.vmdk的数据时,存储设备130可以将vm1.vmdk的数据写入存储节点的本地文件系统的vm1.vmdk。
4)存储节点将vm1.vmdk的数据发送至主机1。
本申请并不限定与存储节点与PDU通信,如存储设备130也可以接收主机2发送的数据访问请求,并通过上述方式获取主机2所请求访问的数据,并返回给主机2,这里不再赘述。
上述方式,可以在数据备份过程中提供数据访问的灵活性。
基于与方法实施例同一发明构思,本申请实施例还提供了一种数据处理装置,该数据处理装置用于执行上述方法实施例中DPU执行的方法。如图8所示,数据处理装置800包括接收模块801、处理模块802和发送模块803;具体地,在数据处理装置800中,各模块之间通过通信通路建立连接。
接收模块801,用于接收第一主机发送的写请求,写请求携带需要备份至存储设备130的第一数据;具体实现方式请参见图2中的步骤201的描述,或参见图4中的步骤407的描述,此处不再赘述。
处理模块802,用于对所述第一数据进行重删压缩操作,以得到第二数据;具体实现方式请参见图2中的步骤204的描述,或参见图4中的步骤411的描述,此处不再赘述。
发送模块803,用于将所述第二数据发送给存储设备130。具体实现方式请参见图2中的步骤205的描述,或参见图4中的步骤412的描述,此处不再赘述。
作为一种可能的实施方式,该数据处理装置800为网卡或DPU。
作为一种可能的实施方式,在接收模块801接收到写请求之后,处理模块802还用于将第一数据存储至所述数据处理装置的内存122;具体实现方式请参见图2中的步骤202的描述,或参见图4中的步骤408的描述,此处不再赘述。
发送模块还用于向第一主机返回写请求完成响应;具体实现方式请参见图2中的步骤203的描述,或参见图4中的步骤409的描述,此处不再赘述。
作为一种可能的实施方式,在接收模块801接收到写请求之后,处理模块802还用于将第一数据存储至所述数据处理装置的内存122;具体实现方式请参见图4中的步骤408的描述,此处不再赘述。处理模块802还用于从所述内存122中获取所述第一数据并存储至所述数据处理装置的持久性存储介质;具体实现方式请参见图4中的步骤410的描述,此处不再赘述。
作为一种可能的实施方式,该数据处理装置800还包括第一文件系统,该第一文件系统与存储设备130的第二文件系统相同。接收模块801还用于:接收文件创建请求,该文件创建请求用于请求在存储设备130的文件系统下创建第一文件;具体实现方式请参见图 4中的步骤402的描述,此处不再赘述。
发送模块803还用于:将该文件创建请求发送至存储装置;具体实现方式请参见图4中的步骤403的描述,此处不再赘述。
接收模块801还用于:接收存储设备130发送的创建成功响应;具体实现方式请参见图4中的步骤405的描述,此处不再赘述。
处理模块802还用于:在第一文件系统中创建所述第一文件;具体实现方式请参见图4中的步骤406的描述,此处不再赘述。
若第一主机发送的写请求用于请求将第一数据存储至第二文件系统的第一文件中;
则在将所述第一数据存储至所述数据处理装置的持久性存储介质时,所述处理模块802具体用于:将所述第一数据写入所述第一文件系统的所述第一文件中。具体实现方式请参见图4中的步骤411的描述,此处不再赘述。
作为一种可能的实施方式,处理模块802还用于:在将第一文件的数据经过重删压缩操作,存储至所述存储装置后,删除存储在所述装置中的第一文件。
作为一种可能的实施方式,接收模块801,还用于接收存储设备130发送的用于请求读取所述第一文件的至少部分数据读请求;具体实现方式请参见图7中的步骤2的描述,此处不再赘述。发送模块,还用于向存储设备130发送第一文件的至少部分数据。具体实现方式请参见图7中的步骤3的描述,此处不再赘述。
本申请实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在存储装置上运行时,使得存储装置执行上述相关方法步骤以实现上述实施例中的DPU120所执行的方法,参见图2中步骤201~步骤205的描述,此处不再赘述,或执行上述相关方法步骤以实现上述实施例中的DPU220所执行的方法,参见图4中步骤401~步骤403、步骤405~412的描述此处不再赘述。
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的DPU120所执行的方法,参见图2中步骤201~步骤205的描述,此处不再赘述,或执行上述相关方法步骤以实现上述实施例中的DPU220所执行的方法,参见图4中步骤401~步骤403、步骤405~412的描述此处不再赘述。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的DPU120所执行的方法,参见图2中步骤201~步骤205的描述,此处不再赘述,或执行上述相关方法步骤以实现上述实施例中的DPU220所执行的方法,参见图4中步骤401~步骤403、步骤405~412的描述此处不再赘述。
其中,本申请实施例提供的数据处理装置、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的DPU120、DPU220或存储设备130对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其他的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其他的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元(或模块)可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
本申请实施例中所描述的各种说明性的逻辑单元和电路可以通过通用处理器,数字信号处理器,专用集成电路(ASIC),现场可编程门阵列(FPGA)或其它可编程逻辑装置,离散门或晶体管逻辑,离散硬件部件,或上述任何组合的设计来实现或操作所描述的功能。通用处理器可以为微处理器,可选地,该通用处理器也可以为任何传统的处理器、控制器、微控制器或状态机。处理器也可以通过计算装置的组合来实现,例如数字信号处理器和微处理器,多个微处理器,一个或多个微处理器联合一个数字信号处理器核,或任何其它类似的配置来实现。
本申请实施例中所描述的方法或算法的步骤可以直接嵌入硬件、处理器执行的软件单元、或者这两者的结合。软件单元可以存储于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动磁盘、CD-ROM或本领域中其它任意形式的存储媒介中。示例性地,存储媒介可以与处理器连接,以使得处理器可以从存储媒介中读取信息,并可以向存储媒介存写信息。可选地,存储媒介还可以集成到处理器中。处理器和存储媒介可以设置于ASIC中。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。

Claims (18)

  1. 一种数据备份系统,其特征在于,所述系统包括数据处理装置和存储设备,
    所述数据处理装置用于接收第一主机发送的写请求,所述写请求携带需要备份至所述存储设备的第一数据;
    所述数据处理装置用于对所述第一数据进行重删压缩操作,以得到第二数据;
    所述数据处理装置用于将所述第二数据写入所述存储设备。
  2. 如权利要求1所述的系统,其特征在于,所述数据处理装置为网卡或数据处理单元DPU。
  3. 如权利要求1或2所述的系统,其特征在于,所述数据处理装置接收到所述写请求之后,还用于将所述第一数据存储至所述数据处理装置的内存后,向所述第一主机返回写请求完成响应;
    在对所述第一数据进行重删操作时,所述数据处理装置具体用于:
    从所述内存中获取所述第一数据,从所述第一数据中删除与所述存储设备中已存储的数据块重复的数据块。
  4. 如权利要求1或2所述的系统,其特征在于,所述数据处理装置接收到所述写请求之后,还用于将所述第一数据存储至所述数据处理装置的内存后,向所述第一主机返回写请求完成响应;
    从所述内存中获取所述第一数据并存储至所述数据处理装置的持久性存储介质;
    所述数据处理装置还包括持久性存储介质,在对所述第一数据进行重删压缩操作时,所述数据处理装置具体用于:
    从所述持久性存储介质中获取所述第一数据;从所述第一数据中删除与所述存储设备中已存储的数据块重复的数据块,以生成所述第二数据。
  5. 如权利要求4所述的系统,其特征在于,所述数据处理装置还包括第一文件系统,所述第一文件系统与所述存储设备的第二文件系统相同,所述写请求为基于所述第二文件系统的写请求;
    在将所述第一数据存储至所述数据处理装置的持久性存储介质时,所述数据处理装置具体用于通过所述第一文件系统将所述第一数据存储至所述数据处理装置的持久性存储介质。
  6. 如权利要求4或5所述的系统,其特征在于,所述数据处理装置还用于:接收文件创建请求;将所述文件创建请求转发至所述存储装置;
    所述存储设备还用于:在所述存储设备的第二文件系统中创建所述文件创建请求所请求创建的第一文件,并生成所述第一文件的映射地址信息,所述映射地址信息用于指示所述第一文件的数据位于所述数据处理装置,或所述第一文件的数据在所述数据处理装置中的访问路径;向数据处理装置发送创建成功响应;
    所述数据处理装置还用于:接收存储设备发送的创建成功响应,并在所述第一文件系统中创建所述第一文件;
    所述写请求用于请求将所述第一数据存储至所述第二文件系统的第一文件;
    在将所述第一数据存储至所述数据处理装置的持久性存储介质时,所述数据处理装置具体用于:将所述第一数据写入所述第一文件中。
  7. 如权利要求6所述的系统,其特征在于,所述数据处理装置还用于:
    在将所述第一文件的数据经过重删压缩操作,存储至所述存储装置后,删除存储在所述数据处理装置中的第一文件;
    所述存储装置用于:将所述第一文件的所述映射地址信息修改为用于指示所述第一文件的数据存储于所述存储装置。
  8. 如权利要求6所述的系统,其特征在于,
    所述存储设备还用于:接收第二主机发送的第一读请求,所述第一读请求用于读取所述第一文件的至少部分数据;确定所述第一文件的至少部分数据位于所述数据处理装置时,向所述数据处理装置发送用于读取所述第一文件的至少部分数据第二读请求;
    所述数据处理装置还用于:根据所述第二读请求从所述数据处理装置中读取所述第一文件的至少部分数据,并发送至所述存储设备。
  9. 一种数据处理装置,其特征在于,所述装置包括:
    接收模块,用于接收第一主机发送的写请求,所述写请求携带需要备份至存储设备的第一数据;
    处理模块,用于对所述第一数据进行重删压缩操作,以得到第二数据;
    发送模块,用于将所述第二数据发送给所述存储设备。
  10. 如权利要求9所述的装置,其特征在于,所述装置为网卡或数据处理单元DPU。
  11. 如权利要求9或10所述的装置,其特征在于,在所述接收模块接收到所述写请求之后,所述处理模块还用于将所述第一数据存储至所述数据处理装置的内存;
    所述发送模块还用于:向所述第一主机返回写请求完成响应;
    在对所述第一数据进行重删操作时,所述处理模块具体用于:
    从所述内存中获取所述第一数据,从所述第一数据中删除与所述存储设备中已存储的数据块重复的数据块。
  12. 如权利要求9或10所述的装置,其特征在于,在所述接收模块接收到所述写请求之后,所述处理模块还用于将所述第一数据存储至所述数据处理装置的内存;所述发送模块还用于:向所述第一主机返回写请求完成响应;
    所述处理模块还用于:从所述内存中获取所述第一数据并存储至所述数据处理装置的持久性存储介质;
    在对所述第一数据进行重删压缩操作时,所述处理模块具体用于:
    从所述持久性存储介质中获取所述第一数据;从所述第一数据中删除与所述存储设备中已存储的数据块重复的数据块。
  13. 如权利要求12所述的装置,其特征在于,所述装置还包括第一文件系统,所述第一文件系统与所述存储设备的第二文件系统相同,所述写请求为基于所述第二文件系统的写请求;
    在将所述第一数据存储至所述数据处理装置的持久性存储介质时,所述处理模块具体用于:通过所述第一文件系统将所述第一数据存储至所述数据处理装置的持久性存储介质。
  14. 如权利要求12或13所述的装置,其特征在于,所述接收模块还用于:接收文件创建请求;
    所述发送模块还用于:将所述文件创建请求发送至所述存储装置;
    所述接收模块还用于:接收所述存储设备发送的创建成功响应;
    所述处理模块还用于:在所述第一文件系统中创建所述第一文件;
    所述写请求用于请求将所述第一数据存储至所述第二文件系统的第一文件;
    在将所述第一数据存储至所述数据处理装置的持久性存储介质时,所述处理模块具体用于:将所述第一数据写入所述第一文件系统的所述第一文件中。
  15. 如权利要求14所述的装置,其特征在于,所述处理模块还用于:在将所述第一文件的数据经过重删压缩操作,存储至所述存储装置后,删除存储在所述装置中的第一文件。
  16. 如权利要求14所述的装置,其特征在于,
    所述接收模块,还用于接收所述存储设备发送的读请求,所述读请求用于请求读取所述第一文件的至少部分数据;
    所述发送模块,还用于向所述存储设备发送所述第一文件的至少部分数据。
  17. 一种计算装置,其特征在于,所述计算设备包括处理器和存储器;
    所述存储器,用于存储计算机程序指令;
    所述处理器执行调用所述存储器中的计算机程序指令执行如权利要求9至16中任一项所述的数据处理装置的功能。
  18. 一种计算机可读存储介质,其特征在于,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令用于执行如权利要求9至16中任意一项所述的数据处理装置的功能。
PCT/CN2022/092467 2021-09-18 2022-05-12 一种数据备份系统及装置 WO2023040305A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22868683.8A EP4394606A1 (en) 2021-09-18 2022-05-12 Data backup system and apparatus
US18/606,476 US20240220371A1 (en) 2021-09-18 2024-03-15 Data backup system and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111101810.2A CN115840662A (zh) 2021-09-18 2021-09-18 一种数据备份系统及装置
CN202111101810.2 2021-09-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/606,476 Continuation US20240220371A1 (en) 2021-09-18 2024-03-15 Data backup system and apparatus

Publications (1)

Publication Number Publication Date
WO2023040305A1 true WO2023040305A1 (zh) 2023-03-23

Family

ID=85575186

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092467 WO2023040305A1 (zh) 2021-09-18 2022-05-12 一种数据备份系统及装置

Country Status (4)

Country Link
US (1) US20240220371A1 (zh)
EP (1) EP4394606A1 (zh)
CN (1) CN115840662A (zh)
WO (1) WO2023040305A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116301663B (zh) * 2023-05-12 2024-06-21 新华三技术有限公司 一种数据存储方法、装置及主机

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060476B1 (en) * 2008-07-14 2011-11-15 Quest Software, Inc. Backup systems and methods for a virtual computing environment
US20120084523A1 (en) * 2010-09-30 2012-04-05 Littlefield Duncan A Data recovery operations, such as recovery from modified network data management protocol data
CN105320773A (zh) * 2015-11-03 2016-02-10 中国人民解放军理工大学 一种基于Hadoop平台的分布式重复数据删除系统和方法
CN108268219A (zh) * 2018-02-01 2018-07-10 杭州宏杉科技股份有限公司 一种处理io请求的方法及装置
US10114829B1 (en) * 2015-06-26 2018-10-30 EMC IP Holding Company LLC Managing data cache for file system realized within a file

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060476B1 (en) * 2008-07-14 2011-11-15 Quest Software, Inc. Backup systems and methods for a virtual computing environment
US20120084523A1 (en) * 2010-09-30 2012-04-05 Littlefield Duncan A Data recovery operations, such as recovery from modified network data management protocol data
US10114829B1 (en) * 2015-06-26 2018-10-30 EMC IP Holding Company LLC Managing data cache for file system realized within a file
CN105320773A (zh) * 2015-11-03 2016-02-10 中国人民解放军理工大学 一种基于Hadoop平台的分布式重复数据删除系统和方法
CN108268219A (zh) * 2018-02-01 2018-07-10 杭州宏杉科技股份有限公司 一种处理io请求的方法及装置

Also Published As

Publication number Publication date
EP4394606A1 (en) 2024-07-03
CN115840662A (zh) 2023-03-24
US20240220371A1 (en) 2024-07-04

Similar Documents

Publication Publication Date Title
US10489059B2 (en) Tier-optimized write scheme
US8694469B2 (en) Cloud synthetic backups
US10031703B1 (en) Extent-based tiering for virtual storage using full LUNs
US9977746B2 (en) Processing of incoming blocks in deduplicating storage system
US20180011657A1 (en) Use of predefined block pointers to reduce duplicate storage of certain data in a storage subsystem of a storage server
US10055420B1 (en) Method to optimize random IOS of a storage device for multiple versions of backups using incremental metadata
US9207875B2 (en) System and method for retaining deduplication in a storage object after a clone split operation
US10437682B1 (en) Efficient resource utilization for cross-site deduplication
US9904480B1 (en) Multiplexing streams without changing the number of streams of a deduplicating storage system
WO2023165196A1 (zh) 一种日志存储加速方法、装置、电子设备及非易失性可读存储介质
JP2015503777A (ja) ファイルクローンを利用したシングルインスタンス化方法及びそれを用いたファイルストレージ装置
US10042719B1 (en) Optimizing application data backup in SMB
US9996426B1 (en) Sparse segment trees for high metadata churn workloads
US20240220371A1 (en) Data backup system and apparatus
US20240241798A1 (en) Multi-phase file recovery from cloud environments
CN115525602A (zh) 数据处理方法以及相关装置
WO2022262381A1 (zh) 一种数据压缩方法及装置
WO2023050856A1 (zh) 数据处理方法及存储系统
US20230236725A1 (en) Method to opportunistically reduce the number of SSD IOs, and reduce the encryption payload, in an SSD based cache in a deduplication file system
CN115495412A (zh) 一种查询系统及装置
US10521400B1 (en) Data reduction reporting in storage systems
US10762044B2 (en) Managing deletion of data in a data storage system
WO2023279833A1 (zh) 一种数据处理方法及装置
WO2022267627A1 (zh) 数据处理方法以及相关装置
US11836388B2 (en) Intelligent metadata compression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868683

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022868683

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022868683

Country of ref document: EP

Effective date: 20240328

NENP Non-entry into the national phase

Ref country code: DE