WO2023109880A1 - Service recovery method, data processing unit and related device - Google Patents

Service recovery method, data processing unit and related device Download PDF

Info

Publication number
WO2023109880A1
WO2023109880A1 PCT/CN2022/139182 CN2022139182W WO2023109880A1 WO 2023109880 A1 WO2023109880 A1 WO 2023109880A1 CN 2022139182 W CN2022139182 W CN 2022139182W WO 2023109880 A1 WO2023109880 A1 WO 2023109880A1
Authority
WO
WIPO (PCT)
Prior art keywords
dpu
interface card
memory
host
software
Prior art date
Application number
PCT/CN2022/139182
Other languages
French (fr)
Chinese (zh)
Inventor
冷超
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023109880A1 publication Critical patent/WO2023109880A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation

Definitions

  • the present application relates to the technical field of data processing, and in particular to a service restoration method, a data processing unit and related equipment.
  • the computing power of a central processing unit central processing unit, CPU
  • the data processing unit (data processing unit, DPU) interface card can be used as the offload engine of the CPU, and can achieve more efficient data processing capabilities by cooperating with the CPU to process services.
  • the DPU interface card may include an application specific integrated circuit (ASIC), a processor, and a memory.
  • ASIC application specific integrated circuit
  • the ASIC and the processor may provide computing power for the CPU, and the memory may temporarily store business information of the DPU interface card.
  • the DPU interface card due to the upgrade of the operating system of the DPU interface card, or a memory failure in the DPU interface card, such as a row (row), column (column) or storage array (bank) failure in the memory, etc., the DPU interface card The operating system is restarted, which causes the loss of service information stored in the memory of the DPU interface card, resulting in service interruption.
  • a memory failure in the DPU interface card such as a row (row), column (column) or storage array (bank) failure in the memory, etc.
  • the embodiment of the present application provides a service recovery method, so as to enable the DPU interface card to recover services after the software in the DPU interface card is restarted.
  • the present application also provides corresponding data processing units, computing devices, interface cards, computer-readable storage media, and computer program products.
  • the embodiment of the present application provides a service recovery method, the method is executed by a DPU interface card, and the DPU interface card is coupled with the host, for example, can be coupled through a bus; wherein, when the DPU interface card resumes services, Specifically, after the software (such as the operating system, etc.) of the DPU interface card is restarted, the service information stored in the memory of the host is obtained. The generated information, so that the DPU interface card can resume and process the interrupted service according to the service information stored in the memory of the host.
  • the software such as the operating system, etc.
  • the DPU interface card can use the service information stored in the memory of the host computer to restore the service in time after restarting the software, it does not need to rely on the local memory of the DPU interface card to restore the interrupted service. System, etc.) due to version upgrade or memory failure and other reasons cause the software of the DPU interface card to restart, and business information is lost.
  • the DPU interface card can also use the host memory to quickly restore the business, reducing the impact on the business.
  • the DPU interface card is coupled to the host based on the PCIe bus, and the PCIe link between the DPU interface card and the host is not disconnected during the restart process of the software in the DPU interface card. In this way, in the process of restarting the software, the host may not perceive the state change of the DPU interface card, thereby reducing the impact on the host.
  • the software restart of the DPU interface card is triggered by a failure of the DPU interface card or is triggered by an upgrade of the previous version of the software of the DPU interface card, so as to realize the fault recovery of the DPU interface card Or a software upgrade.
  • the DPU interface card may also restart the software after receiving an instruction to restart the software sent by the host.
  • the DPU interface card restarts the software, so as to realize the repair of the faulty memory of the DPU interface card.
  • the DPU interface card restarts the operating system of the DPU interface card.
  • the DPU interface card can decide whether to restart the operating system of the DPU interface card according to the failure of the memory, so as to implement constraints on restarting the operating system of the DPU interface card.
  • the DPU interface card when the memory of the DPU interface card fails and the failed memory meets the preset conditions, the DPU interface card can also restart the kernel of the operating system of the DPU interface card to use the failed memory area service component. In this way, the DPU interface card does not need to restart the entire operating system, thereby reducing the impact of the faulty memory on the DPU interface card as much as possible.
  • the preset condition that the failed memory in the DPU interface card satisfies for example, the size of the failed memory does not exceed the preset size, such as the number of failed rows (or columns) in the memory does not exceed the preset number of rows (or preset number of columns), etc.
  • the impact of the faulty memory on the DPU interface card is relatively small, so that the DPU interface card can implement fault recovery without restarting the entire operating system.
  • the preset condition satisfied by the faulty memory may specifically be that the system component using the faulty memory is a preset system component, so that when the faulty memory affects a specific system component, the DPU interface card can restart the Part of the system components to achieve failure recovery.
  • the preset condition that the faulty memory satisfies may specifically be that the number of system components using the faulty memory does not exceed a preset number. At this time, the faulty memory only affects a small number of service components, but does not affect the rest of the service components. Therefore, the DPU interface card can restart the part of the affected service components without restarting the entire operating system and all service components. for fault recovery.
  • the DPU interface card when it obtains the service information stored in the memory of the host, it may specifically obtain the first address identifier from the memory area, where the first address identifier is used to identify the memory area in the host,
  • the memory area is a storage area in the memory of the host for storing business information, where the memory area can be a storage area in a volatile memory or a storage area in a non-volatile memory, etc., and, when When the operating system in the DPU interface card is restarted, the data stored in the memory area will not be lost; in this way, after the operating system is restarted, the DPU interface card can access the memory area of the host according to the first address identifier to obtain business information .
  • the memory area can be located inside the DPU interface card, for example, it can be a logic block in the CPLD included in the DPU interface card, or the memory area can be located outside the DPU interface card, for example, it can be connected to the DPU interface card Storage area in external memory, etc.
  • the DPU interface card may also apply to the host for a memory area for storing service information, and obtain the first address identifier of the memory area.
  • the first address identifier may be, for example, is the first address of the memory area, etc., so that the DPU interface card can store service information in the memory area according to the first address identifier. In this way, after the software in the DPU interface card is restarted, the DPU interface card can use the service information stored in the memory area to implement service recovery.
  • the DPU interface card may also obtain configuration information from the memory area, and the configuration information is used to configure the DPU interface card, where the configuration information may specifically include the second The address identifier (such as the first address of the sending queue, etc.) and the third address identifier of the completion queue (such as the first address of the completion queue, etc.), the sending queue is used to store the IO sent by the processor of the host, and the completion queue is used to store the DPU The execution result of the interface card for this IO.
  • the configuration information may specifically include the second The address identifier (such as the first address of the sending queue, etc.) and the third address identifier of the completion queue (such as the first address of the completion queue, etc.)
  • the sending queue is used to store the IO sent by the processor of the host
  • the completion queue is used to store the DPU The execution result of the interface card for this IO.
  • the configuration information may also include a communication format, a communication protocol version, etc. during data exchange between the DPU interface card and the host, or the configuration information may also include other content.
  • the embodiment of the present application further provides a data processing unit DPU device, configured to execute the service recovery method described in the first aspect or any implementation manner of the first aspect.
  • the present application provides a computing device, the computing device includes a host and a DPU (data processing unit) interface card, wherein the host includes a memory and a processor, and the DPU interface card is used to perform the following operations: After the software of the DPU interface card is restarted, obtain the business information stored in the memory of the host, the business information is before the software restarts, the DPU interface card processes the input and output IO sent by the processor Generated information; resume services according to the service information.
  • the DPU interface card is used to execute the service restoration method described in the first aspect or any implementation manner of the first aspect.
  • the present application provides a data processing unit DPU interface card
  • the DPU interface card includes a printed circuit board, an interface, and a data processing unit DPU chip
  • the interface card communicates with the host through the interface
  • the interface communicates with the The DPU
  • the DPU chip is used to obtain the service information stored in the memory of the host after the software of the DPU interface card is restarted, and the service information is in the software
  • the DPU interface card processes the information generated by the input and output IO sent by the processor of the host; resumes the service according to the service information.
  • the DPU chip in the DPU interface card may be used to execute the service restoration method described in the first aspect or any implementation manner of the first aspect.
  • the present application provides a data processing unit DPU chip, which is applied to a DPU interface card, the DPU interface card is coupled to a host, and the DPU chip includes an acquisition circuit and a processing circuit, wherein the acquisition circuit is used for After the software of the DPU interface card is restarted, the business information stored in the memory of the host is obtained, and the business information is before the software restarts, the processing circuit processes the input and output sent by the processor of the host Information generated by the IO; the processing circuit is used to restore services according to the service information.
  • the obtaining circuit and the processing circuit cooperate with each other and may be used to execute the service recovery method as described in the first aspect or any implementation manner of the first aspect.
  • Fig. 1 is the schematic diagram of the architecture of an exemplary DPU interface card that the embodiment of the present application provides;
  • FIG. 2 is a schematic flowchart of a service recovery method provided in an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a data processing unit DPU device provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a hardware structure of a computing device provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a DPU interface card provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a DPU chip provided by an embodiment of the present application.
  • the DPU interface card is used as the offload engine of the CPU. During the process of assisting the CPU in processing services, it usually temporarily stores service data in the local memory of the DPU interface card. If the memory of the DPU interface card fails or the software (such as the operating system) etc.) the version upgrade causes the corresponding software of the DPU interface card to restart, resulting in the loss of data in the memory of the DPU interface card. For example, in actual application, some memory areas of the DPU interface card will inevitably fail. In order to repair the memory failure, the DPU software needs to be restarted. It is difficult for Kaka to continue processing the business due to missing business data.
  • an embodiment of the present application provides a service recovery method to recover services interrupted by the DPU interface card.
  • the business information generated by the DPU interface card processing business is pre-stored in the memory of the host computer, so that after the DPU interface card restarts the software (such as because of memory failure or software version)
  • the upgrade triggers the DPU interface card to re-operate the operating system, etc.), and the service information can be obtained from the memory of the host, so that the DPU interface card can resume and process the interrupted service by using the service information.
  • the DPU interface card can also use the host memory to quickly restore services without relying on the local memory of the DPU interface card. Reduce impact on business.
  • the above service recovery method may be applied to the DPU interface card 100 shown in FIG. 1 .
  • the DPU interface card 100 is coupled to the host 200 through a peripheral component interconnect express (PCIe) bus or other buses.
  • the DPU interface card 100 includes a printed circuit board 1011 , an interface 1012 , a DPU chip 101 and software 1013 , and the interface 1012 and the DPU chip 101 are installed on the printed circuit board 1011 .
  • the interface 1012 may be a PCIe interface.
  • the software 1013 can be the operating system of the DPU interface card 100, and the operating system includes a system service component unit 102, a soft reset (soft reset) unit 103, a memory supervision unit 104, and a microkernel (micro kernel) unit 105. Further, the DPU interface card 100 may also include a micro reset (micro reset) unit 106 and the like.
  • the specific implementation of the software shown in FIG. 1 can be located in the memory of the DPU interface card; the software shown in FIG. 1 can also be embedded, which is not limited in the embodiment of the present invention.
  • the DPU chip 101 is used to control the DPU interface card 100 to provide storage services for the host 200, that is, to process storage-type services, such as fast non-volatile storage (non-volatile memory express, NVMe), virtiofs (see https: //virtio-fs.gitlab.io/), virtio_scsi (for details, please refer to https://www.ovirt.org/develop/release-management/features/storage/virtio-scsi.html) business, etc.
  • the DPU chip 101 may also control the DPU interface card 100 to provide computing services for the host 200, that is, to process computing-type services.
  • the DPU chip 101 can control the DPU interface card 100 to resume the interrupted service (such as the above-mentioned storage type service or calculation type service, etc.).
  • the system service component unit 102 includes multiple service components in the kernel of the operating system, and the multiple service components can use the memory in the DPU interface card 100 to provide services for the operating system in the DPU interface card 100, as shown in Figure 1 Drivers components, file system components, memory management components, network protocol components, etc. are shown.
  • the driver component is used to drive the DPU interface card 100 to perform data communication with the host 200, and may include a driver framework and a group of entity business drivers;
  • the file system component is used to provide file system services, such as data storage in the form of files, Reading and management, etc.; memory management components, used to provide memory management services, such as allocation, recycling and isolation of memory areas, etc.; network protocol components, used to provide network protocol services, such as Hyper Text Transfer Protocol (Hyper Text Transfer Protocol) , HTTP) etc.
  • the soft reset unit 103 is used to reset the hardware units in the DPU interface card 100, and restart software such as the operating system of the DPU interface card 100.
  • the memory supervision unit 104 is used to perform fault monitoring, fault repair, fault isolation, redundant area replacement, etc. to the memory in the DPU interface card 100, so that the memory in the DPU interface card 100 has reliability, availability, and serviceability ( reliability, availability, serviceability, RAS).
  • the microkernel unit 105 is configured to manage the resources in the DPU interface card 100 and split the service components of the kernel so that the service components of the kernel can be restarted separately.
  • the micro-reset unit 106 is configured to individually restart the service components in the kernel of the control operating system through the micro-kernel architecture.
  • the DPU interface card 100 can assist the host 200 to process services.
  • the memory supervision unit 104 detects that the memory in the DPU interface card 100 fails, if the memory failure causes business interruption and the software of the DPU interface card 100 restarts, then the DPU interface card 100 can use the memory of the host computer 200 to store business information to restore the business.
  • the memory monitoring unit 104 can isolate the faulty memory location or replace the failed unit, and trigger the soft reset unit 103 to reset the hardware units in the DPU interface card 100 and restart the operating system in the DPU interface card 100 .
  • the microkernel unit 105 initializes the restarted hardware unit, and restarts each service component in the system service component unit 102 .
  • the DPU chip 101 obtains the service information stored in the memory area 201 of the host 200, and uses the service information of the DPU interface card 100 to restore the service.
  • the memory supervision unit 104 can correct the failed memory. location, and trigger the micro reset unit 106 to execute the micro reset process, specifically trigger the micro reset 106 to restart the service components that use the part of the failed memory, for example, the network protocol components can be restarted by the micro reset unit 106 (the rest are not Affected service components do not need to perform a restart process), so as to restore the normal network communication function of the DPU interface card 100 and the like.
  • the coupling between the DPU interface card 100 and the host 200 via the PCIe bus is taken as an example for illustration. In practical applications, the DPU interface card 100 can also be coupled with the host 200 in other ways. This embodiment does not limit it.
  • the architecture of the DPU interface card 100 shown in FIG. 1 is only used as an exemplary illustration. In actual application, the DPU interface card 100 can also adopt other architectures, such as the DPU interface card 100 can also include more other types of services components etc.
  • FIG. 2 it is a schematic flowchart of a service restoration method in the embodiment of the present application. This method may be applied to the DPU interface card 100 shown in FIG. 1 above, or may also be applied to other applicable DPU interface cards. The following uses the DPU interface card 100 shown in FIG. 1 as an example for description.
  • the service recovery method shown in Figure 2 may specifically include:
  • the DPU interface card 100 applies for a memory area 201 from the host 200, and acquires a first address identifier of the applied memory area 201.
  • the first address identifier may be, for example, the first address of the memory area 201 , or may be other identification information for indicating the memory area 201 such as the last address, which is not limited in this embodiment.
  • the DPU interface card 100 may apply for a section of memory area from the host 200 in advance, so that the applied memory area can be used later to store service information related to the service processed by the DPU interface card 100 .
  • the DPU chip 101 in the DPU interface card 100 can send a request to the host 200 to apply for a memory area, so that the host 200 can respond to the request and determine the memory area 201 of a preset size from the available memory area, Allocate it to the DPU interface card 100, and return the first address identifier of the memory area 201 to the DPU chip 101.
  • the host 200 actively allocates the memory area 201 for the DPU interface card 100, and sends its corresponding first address identifier to the DPU interface card 100, etc. .
  • the DPU interface card 100 stores the first address identifier in the target storage area.
  • the DPU interface card 100 may be configured with a target storage area, and when the DPU interface card 100 restarts software such as an operating system, the data stored in the target storage area may not be lost.
  • a complex programmable logic device (complex programmable logic device, CPLD) may be configured in the DPU interface card 100, and the DPU interface card 100 may store the first address identifier into the logic block in the CPLD (ie, the above-mentioned target storage area ).
  • the target storage area can be realized by a non-volatile memory, such as an electrically alterable read only memory (EAROM), an electrically erasable programmable read only memory (electrically erasable programmable read only memory) , EEPROM), at least one implementation of flash memory.
  • a non-volatile memory such as an electrically alterable read only memory (EAROM), an electrically erasable programmable read only memory (electrically erasable programmable read only memory) , EEPROM), at least one implementation of flash memory.
  • the target storage area may also be implemented by a volatile memory, such as at least one of static random access memory (static random access memory, SRAM) and dynamic random access memory (dynamic random access memory, DRAM).
  • the target storage area can also be deployed outside the DPU interface card 100.
  • the DPU interface card 100 can be externally connected with a non-volatile memory or a volatile memory, so that the DPU interface card 100 can store the acquired first The address identification is written into an external non-volatile memory or a volatile memory.
  • step 202 and step S203 are not limited in this embodiment.
  • the DPU interface card 100 may also execute step S203 first, and then execute step S202, etc., or both The steps are executed simultaneously.
  • the DPU interface card 100 stores the service information generated by processing the service in the memory area 201 of the host 200 according to the first address identifier, and the service information is processed by the DPU interface card 100 Information generated by the IO sent by the processor of the host 200 .
  • the DPU interface card 100 can start to assist the host 200 to process one or more services.
  • the processor in the host 200 can send the input and output (input output, IO) corresponding to the business to the sending queue for storage (the number of IOs sent by the processor can be one or more), so that The DPU interface card 100 can read the IO from the sending queue of the host 200, and parse and execute the read IO, and the data obtained by parsing and executing the IO can be temporarily stored in the memory of the DPU interface card 100 .
  • the DPU interface card 100 can also store the IO-related information generated during the execution of the IO into the memory area 201 of the host 200 .
  • the IO-related information is business information, including, for example, IO execution stages, key states of IO execution, and the like.
  • the subsequent DPU interface card 100 can also read business information such as the IO execution stage and the key state of IO execution from the memory area 201, and use the business information Continue to execute the IO to realize service recovery, and in this way, the DPU interface card 100 can also avoid re-execution of the IO, reducing service recovery delay.
  • the DPU interface card 100 may store at least part of service information generated during the execution of the IO in the memory area 201, so as to reduce resource consumption of the DPU interface card 100 for processing services. For example, in the initial stage of IO execution, the DPU interface card 100 may not store the current execution stage of the IO and key states of IO execution in the memory in the host 200 . Correspondingly, if the service needs to be resumed based on the IO later, the DPU interface card 100 can re-execute the IO to resume the service.
  • the DPU interface card 100 Since the DPU interface card 100 has not started to execute the IO or has just started the IO before restarting software such as the operating system, even if the DPU interface card 100 subsequently re-executes the IO in the process of resuming business, the DPU interface card 100 recovers The delay impact of processing services is also small.
  • DPU interface card 100 can store information such as the key state and IO execution stage of the IO execution in the memory area 201 of the host computer 200, like this, if subsequent needs based on this When the IO resumes processing services, the DPU interface card 100 can continue to execute IOs according to the information saved by the host 200 without re-executing the IOs, thereby reducing the service recovery delay.
  • the DPU interface card 100 may determine whether to store the service information generated by executing the IO into the memory area 201 according to the size of the IO. For example, when the IO size read by the DPU interface card 100 from the sending queue does not exceed the preset threshold, the DPU interface card 100 may not need to send the business information generated by processing the IO to the host 200 during the execution of the IO. stored in memory. In this way, even if the DPU interface card 100 restores the service by re-executing the IO, the cost to be paid is relatively small.
  • the DPU interface card 100 can store business information such as the key status of the IO execution and the IO execution stage in the memory of the host 200, so as to avoid the DPU interface
  • the card 100 re-executes the IO to reduce service recovery delay.
  • the DPU interface card 100 can also comprehensively determine whether to send the service information generated by executing the IO to the memory of the host 200 for storage in consideration of the IO size, IO execution progress and other aspects.
  • the DPU interface card 100 when the DPU interface card 100 is executing the IO, it can also send the IO execution result obtained during the IO execution to the memory of the host 200 for storage. In this way, when the IO is interrupted because the operating system of the DPU interface card 100 is restarted, the DPU interface card 100 can continue to process the IO from the interrupted position according to the IO execution result and the above-mentioned business information stored in the memory of the host computer 200, Therefore, the service recovery delay can be further reduced.
  • the software in the DPU will be restarted when the restart condition is met, and the restart of the software will affect the processing or service recovery of the DPU interface card 100 .
  • the software may be, for example, the operating system in the DPU interface card 100, or may be other software.
  • the software is specifically an operating system as an example for illustrative description below.
  • the DPU interface card 100 may restart the operating system in some scenarios.
  • conditions to qualify for an operating system reboot could include the following:
  • Example 1 It is detected that the memory in the DPU interface card 100 fails.
  • the DPU interface card 100 can sense in real time (or periodically) whether the memory in the DPU interface card 100 fails, such as sensing at least one row, column or bank failure in the memory and causing an uncorrectable error in data access ( uncorrected errors, UCE) etc. (or may be other failures), and report the location information of the memory failure to the memory supervision unit 104.
  • the memory monitoring unit 104 may isolate or replace the failed memory part according to the location information of the failure, and trigger the soft reset unit 103 to perform a soft reset process. Then, the soft reset unit 103 can reset the hardware unit (such as the DPU chip 101 ) in the DPU interface card 100 and restart the operating system of the DPU interface card 100 .
  • the soft reset unit can reset all hardware units in the DPU interface card 100, and at this time, the PCIe link between the DPU interface card 100 and the host 200 is disconnected.
  • the soft reset unit 103 can reset hardware units other than the PCIe core (core), so that the PCIe core can continue to be connected to the host 200 because it is not reset, thereby maintaining the connection between the DPU interface card 100 and the host computer 200.
  • the PCIe link between the hosts 200 is not disconnected.
  • the PCIe core is used to establish a PCIe link with the host.
  • the microkernel unit 105 can initialize the hardware unit, and restart each service component in the system service component unit 102 to start each system service of the kernel, such as the kernel driver services, file system services, memory management services, and network protocol services.
  • the DPU interface card 100 is detected to trigger the restart of the operating system as an example.
  • the DPU interface card 100 can also be triggered to restart the operating system.
  • Example 2 The operating system of the previous version of the operating system of the DPU interface card 100 is upgraded.
  • the host 200 can generate an upgrade command for the previous version of the operating system of the DPU interface card 100, and send it to the DPU interface card 100, so that the DPU interface card 100 can perform an upgrade of the DPU interface according to the received upgrade command.
  • the flow of the operating system of the card 100 For example, the DPU interface card 100 can read the new version of the operating system from the host computer 200 according to the upgrade instruction, and replace the previous version of the operating system of the DPU interface card 100 with a new version, and then the DPU interface card 100 can be in the After confirming that the version upgrade is complete, the operating system running the new version can be started.
  • the host 200 can periodically issue upgrade instructions to realize the periodic update of the DPU interface card 100 operating system; or, the host 200 can generate corresponding upgrade instructions according to the user's upgrade operation for the DPU interface card 100 operating system And send it to the DPU interface card 100 and so on.
  • Example 3 An instruction to restart the operating system is received.
  • the host computer 200 can generate a corresponding restart command according to the user's restart operation for the DPU interface card 100 operating system, and send it to the DPU interface card 100, so that the DPU interface card 100 can restart after receiving the restart command. , reboot and run the OS.
  • the DPU interface card 100 may also restart the operating system when other possible conditions are met, for example, when the DPU interface
  • the operating system of the card 100 can automatically trigger the restart of the operating system when an error occurs during operation; Restarting other software in the DPU interface card 100 is implemented, which is not limited in this embodiment.
  • the DPU interface card 100 After the DPU interface card 100 restarts the operating system, the service data temporarily stored in the internal memory of the DPU interface card 100 is lost, so the DPU interface card 100 may interrupt processing services due to the loss of service data. For this reason, in this embodiment, the DPU interface card 100 continues to execute the following steps to realize the recovery and processing of interrupted services.
  • the DPU chip 101 in the DPU interface card 100 can obtain the first address identifier from the target storage area, and the first address identifier (for example, the memory area 201 first address) is used to indicate the memory area 201 that the DPU interface card 100 pre-applied to the host 200, so that the DPU chip 101 can access the memory area 201 of the host according to the first address identification, and read the business information stored in the memory area 201 .
  • the first address identifier for example, the memory area 201 first address
  • the DPU interface card 100 restores the service according to the acquired service information.
  • the acquired business information may specifically be the data generated when the DPU interface card 100 executes the unfinished IO, so that the DPU chip 101101 can acquire the unfinished IO from the host 200, and according to the memory area 201
  • the current execution stage of the IO and the key state of the IO execution are stored in the IO, and the IO is continued to be executed from the current execution stage, so as to realize the restoration of the processing of the business.
  • the DPU chip 101 may continue to execute the IO from the interrupted position of the current execution stage according to the IO execution result stored in the memory area 201 .
  • the memory area 201 may not record the relevant information of the IO. At this time, the DPU chip 101 can directly re-execute the IO.
  • the business information stored in the memory area 201 may be the information of some unfinished IOs, such as the execution stage of the IO and the corresponding execution results in the execution stage, etc., while the DPU interface card 100 Another part of IO that has been executed but not completed may not store relevant information of this part of IO in the memory area 201 . Therefore, after the DPU interface card 100 acquires an unfinished IO from the sending queue of the host 200, it may check whether the service information stored in the memory area 201 includes information related to the IO. And, if the information related to this IO is found, then the DPU interface card 100 can continue to execute the IO from the interrupted position according to the information found; Execute the IO.
  • the DPU interface card 100 can quickly restore the service through the service information stored in the memory of the host computer 200. Reduce impact on business.
  • the DPU interface card 100 may not reset the PCIe core, so that the DPU interface card 100 can continue to maintain the connection of the PCIe link with the host computer 200 through the PCIe core, thereby realizing the DPU interface card
  • the PCIe link between 100 and host 200 is not disconnected.
  • the host 200 may not perceive the fault state of the DPU interface card 100 and the change of the upgrade state, thereby reducing the impact on the host 200.
  • the service recovery process of the DPU interface card 100 has relatively low requirements on hardware and operating systems, and can be compatible with various types of computing devices and operating systems, thereby improving the universality of solution implementation.
  • the DPU interface card 100 can directly trigger the soft reset unit 103 to reset the hardware unit and restart software such as the operating system when a memory failure occurs. Fault recovery is realized by starting some service components in the kernel of the operating system.
  • the DPU interface card 100 detects that there is a memory failure, it can further determine whether the failed memory meets the preset condition, and when the failed memory meets the preset condition, the DPU interface card 100 Restart the service component using the faulty memory in the kernel of the operating system, and determine the IO corresponding to the data stored in the faulty memory, so as to resume business operation by re-executing the IO, or the DPU interface card 100 can be based on the data stored in the memory area 201.
  • the relevant information of the IO continues to execute the IO to resume business operation, etc.
  • the DPU interface card 100 can implement fault recovery without restarting the entire operating system and reconfiguring the DPU interface card 100 , thereby reducing the cost of fault recovery.
  • the DPU interface card 100 can resume the interrupted service through the method of the embodiment shown in FIG. 2 .
  • different processing methods can be used to repair the fault according to the fault condition of the memory of the DPU interface card 100 , and the flexibility of the DPU interface card 100 to repair the faulty memory can be improved.
  • the preset condition that the faulty memory satisfies may specifically be that the size of the faulty memory does not exceed the preset size, such as the number of faulty rows (or columns) in the memory does not exceed the preset number of rows (or preset Set the number of columns), etc.
  • the faulty memory portion has relatively little impact on the DPU interface card 100 , therefore, the DPU interface card 100 can implement fault recovery without restarting the entire operating system.
  • the preset condition that the faulty memory satisfies may specifically be that the system component using the faulty memory is a preset system component, so that when the faulty memory affects a specific system component, therefore, the DPU interface card 100 can isolate or replace a faulty portion of memory and restart that portion of the system components to effectuate fault recovery.
  • the preset condition that the faulty memory satisfies may specifically be that the number of system components using the faulty memory does not exceed a preset number. At this point, the faulty memory only affects a small number of service components, but does not affect the rest of the service components. Therefore, the DPU interface card 100 can restart the affected service components without restarting the entire operating system (or other software) and all service components for fault recovery.
  • the preset condition satisfied by the faulty memory may also be other conditions, which are not limited in this embodiment.
  • the DPU interface card 100 may complete corresponding configuration in advance, so as to realize normal communication between the DPU interface card 100 and the host 200 .
  • the DPU interface card 100 and the host 200 can be pre-configured to have a unified data communication format, communication protocol version, command parsing rules, etc., and the DPU interface card 100 can be configured with a sending queue (SQ) and a completion queue (CQ) in the host memory , where the sending queue is used to store at least one IO sent by the processor in the host 200 to the DPU interface card 100 for processing services, and the receiving queue is used to store the execution result of the IO fed back by the DPU interface card 100 .
  • SQ sending queue
  • CQ completion queue
  • the DPU interface card 100 may lose the original configuration of the DPU interface card 100 after restarting the software, therefore, in a further possible implementation, the DPU interface card 100 may also use the DPU interface card 100 after obtaining the first address identifier.
  • the interface card 100 stores configuration information for configuring the DPU interface card 100 in the memory area 201 according to the first address identifier.
  • the DPU interface card 100 may be manually configured by a technician in advance, so that the DPU interface card 100 may generate a corresponding configuration file based on the configuration operation of the technician and send it to the memory area 201 .
  • the configuration file is generated by the host computer 200, and the configuration file is automatically used to configure the DPU interface card 100, etc., and the configuration file is written by the host computer 200.
  • this embodiment does not limit it.
  • the configuration information includes the second address identifier of the send queue and the third address identifier of the completion queue in the host 200, so that after the DPU interface card 100 is configured, the DPU interface card 100 can access the host 200 according to the second address identifier.
  • the final result of the IO performed by the DPU interface card 100 can be sent to the completion queue of the host 200 according to the third address identifier, so as to resume and process the interrupted service.
  • the service recovery method provided by the embodiment of the present application is introduced above with reference to FIG. 1 and FIG. 2 .
  • the functions of the data processing unit DPU device provided by the embodiment of the present application and the computing equipment for implementing the data processing unit are introduced in conjunction with the accompanying drawings.
  • FIG. 3 it shows a schematic structural diagram of a data processing unit DPU device.
  • the DPU device 300 shown in FIG. 3 is coupled with a host (not shown in FIG. 3 ), and the DPU device 300 includes:
  • the acquiring module 301 is configured to acquire the business information stored in the memory of the host after the software of the DPU device 300 is restarted, the business information is that the DPU device 300 processes the The information generated by the input and output IO sent by the processor of the host;
  • a recovery module 302, configured to recover services according to the service information.
  • the restarted software may be, for example, an operating system, a component in the kernel of the operating system, or other software except the operating system.
  • the DPU device 300 is coupled with the host based on the peripheral component interconnection PCIe bus, and, during the software restart process, the DPU device 300 is connected to the host The PCIe link between them is not disconnected.
  • the restart of the software of the DPU device 300 is triggered by a failure of the DPU device 300 or is triggered by a software upgrade of a previous version of the software of the DPU device 300 .
  • the DPU device 300 further includes the startup module 303, configured to restart the software when the memory of the DPU device 300 fails.
  • the startup module 303 is configured to restart the operating system of the DPU 300 when the memory of the DPU device 300 fails and the failed memory does not meet a preset condition.
  • the startup module 303 is further configured to restart the operation of the DPU device 300 when the memory of the DPU device 300 fails and the failed memory satisfies the preset condition A service component in the system's kernel that uses the failed memory.
  • the obtaining module 301 is configured to:
  • the DPU device 300 further includes:
  • An application module 304 configured to apply for the memory area from the host and obtain the first address identifier of the memory area before the software is restarted;
  • the storage module 305 is configured to store the service information in the memory area according to the first address identifier.
  • the DPU device 300 may also obtain configuration information from the memory area, and the configuration information is used to configure the DPU device 300, where the configuration information includes the send queue in the memory of the host The second address identifier and the third address identifier of the completion queue, the sending queue is used to store the IO, and the completion queue is used to store the execution result of the IO by the DPU device 300 .
  • the DPU device 300 shown in FIG. 3 can implement the method shown in FIG. 2, the specific implementation of the DPU device 300 shown in FIG. 3 and its technical effects can be referred to the relevant descriptions in the foregoing embodiments. , which will not be described here.
  • the DPU device 300 shown in FIG. 3 may be implemented by an ASIC, or by a general-purpose CPU and an ASIC, or by software, or by a combination of software and hardware, which is not limited in this embodiment of the present invention.
  • Figure 4 provides a computing device.
  • the computing device 400 includes a host 401 and a DPU interface card 402 , and the host 401 and the DPU interface card 402 are coupled through a bus 403 .
  • the bus 403 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 4 , but it does not mean that there is only one bus or one type of bus.
  • the host 401 includes a memory 4011 and a processor 4012 , and the memory 4011 and the processor 4012 may be coupled through a bus 4013 .
  • the bus 4013 can be a PCI bus or an EISA bus, etc.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 4 , but it does not mean that there is only one bus or one type of bus.
  • the processor 4012 can be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP) or a digital signal processor (digital signal processor, DSP) etc. Any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the memory 4011 may be implemented by a memory, and the memory may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM). And, the memory can also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), flash memory, mechanical hard disk (hard drive drive, HDD) or solid state disk (solid state drive, SSD).
  • volatile memory such as a random access memory (random access memory, RAM
  • non-volatile memory such as read-only memory (read-only memory, ROM), flash memory, mechanical hard disk (hard drive drive, HDD) or solid state disk (solid state drive, SSD).
  • the DPU interface card 402 may specifically be used to implement the method executed by the DPU interface card 100 in the embodiment shown in FIG. 2 above.
  • Computing device 400 may be a server, a storage array, or a distributed storage system.
  • FIG. 5 shows a schematic structural diagram of a DPU interface card.
  • the DPU interface card 500 includes a printed circuit board 501, an interface 502 and a DPU chip 503, and the DPU interface card 500 communicates with the host through the interface 502, and the interface 502 and the DPU chip 503 are installed on the printed circuit board.
  • the interface 502 and the DPU chip 503 can communicate through lines on the printed circuit board, or through cable communication, or bus communication, or the interface 502 and the DPU chip 503 are integrated together.
  • an implementation in which the interface 502 and the DPU chip 503 are integrated is packaged in one chip.
  • the DPU interface card 500 is used to implement the service recovery method performed by the DPU interface card 100 in the above embodiment shown in FIG. 2 .
  • the printed circuit board 501 the interface 502 and the DPU chip 503
  • FIG. 6 shows a schematic structural diagram of a DPU chip.
  • the DPU chip 600 is applied to a DPU interface card (not shown in Figure 6) coupled with the host, such as the DPU interface card 100 in the foregoing embodiment;
  • the DPU chip 600 includes an acquisition circuit 601 and a processing Circuit 602, wherein the acquisition circuit 601 is used to realize the function of the DPU chip 600 to acquire data, such as acquiring the business information stored in the memory of the host, the business information is before the software on the DPU interface card restarts, the processing circuit 602 Process the information generated by the input and output IO sent by the processor of the host; the processing circuit 602 is used to realize the data processing function of the DPU chip 600, such as recovering services according to the service information obtained by the obtaining circuit 601.
  • the DPU chip 600 may be an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the acquisition circuit 601 and the processing circuit 602 cooperate with each other, and can be used to implement the service recovery method executed by the DPU chip 101 in the DPU interface card 100 in the embodiment shown in FIG. 2 above.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that includes one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state hard disk), etc.
  • the computer-readable storage medium includes instructions, and the instructions instruct a computing device to execute the above service recovery method.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computing device, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wirelessly (such as infrared, wireless, microwave, etc.) to another website site, computer or data center.
  • another computer-readable storage medium e.g. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wirelessly (such as infrared, wireless, microwave, etc.) to another website site, computer or data center.
  • the computer program product may be a software installation package, and if any of the aforementioned service recovery methods needs to be used, the computer program product may be downloaded and executed on the computing device.

Abstract

The present application provides a service recovery method. The method is executed by a DPU interface card, and specifically, after software (e.g. an operating system, another application program, etc.) of the DPU interface card is restarted, the DPU interface card acquires service information stored in a memory of a host, wherein the service information is information generated when the DPU interface card processes, before the software is restarted, an IO sent by a processor of the host, such that the DPU interface card recovers a service according to the service information stored in the memory of the host. Therefore, a DPU interface card does not need to depend on a local memory to recover an interrupted service, such that even if service data of a service is lost due to a version upgrade of software (e.g. an operating system, etc.) of the DPU interface card, a memory fault of the DPU interface card or other reasons, the DPU interface card can also quickly recover the service by using a memory of a host, thereby reducing the impact on the service.

Description

一种业务恢复方法、数据处理单元及相关设备A service recovery method, data processing unit and related equipment
本申请要求于2021年12月16日提交中国国家知识产权局、申请号为202111540861.5、申请名称为“一种数据处理单元处理方法、据处理单元处理和系统”以及2022年03月18日提交中国国家知识产权局、申请号为202210269274.5、申请名称为“一种业务恢复方法、数据处理单元及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application is required to be submitted to the State Intellectual Property Office of China on December 16, 2021, the application number is 202111540861.5, and the application name is "a data processing unit processing method, data processing unit processing and system" and submitted to China on March 18, 2022. The State Intellectual Property Office has the priority of a Chinese patent application with application number 202210269274.5 and titled "A Service Restoration Method, Data Processing Unit and Related Equipment", the entire contents of which are incorporated by reference in this application.
技术领域technical field
本申请涉及数据处理技术领域,尤其涉及一种业务恢复方法、数据处理单元及相关设备。The present application relates to the technical field of data processing, and in particular to a service restoration method, a data processing unit and related equipment.
背景技术Background technique
随着业务处理需求的发展,对于中央处理器(central processing unit,CPU)的运算能力要求越来越高。目前,数据处理单元(data processing unit,DPU)接口卡可以作为CPU的卸载引擎,可以通过与CPU协同处理业务以达到更高效的数据处理能力。其中,DPU接口卡可以包括专用集成电路(application specific integrated circuit,ASIC)、处理器以及内存,该ASIC以及处理器可以为CPU提供算力,内存可以暂存DPU接口卡的业务信息。With the development of business processing requirements, the computing power of a central processing unit (central processing unit, CPU) is required to be higher and higher. At present, the data processing unit (data processing unit, DPU) interface card can be used as the offload engine of the CPU, and can achieve more efficient data processing capabilities by cooperating with the CPU to process services. Wherein, the DPU interface card may include an application specific integrated circuit (ASIC), a processor, and a memory. The ASIC and the processor may provide computing power for the CPU, and the memory may temporarily store business information of the DPU interface card.
实际应用场景中,由于DPU接口卡的操作系统升级,或者,DPU接口卡中的内存故障,如内存中的行(row)、列(column)或存储阵列(bank)故障等,使得DPU接口卡的操作系统进行重启,这使得DPU接口卡的内存中所存储的业务信息丢失,从而导致业务中断。In actual application scenarios, due to the upgrade of the operating system of the DPU interface card, or a memory failure in the DPU interface card, such as a row (row), column (column) or storage array (bank) failure in the memory, etc., the DPU interface card The operating system is restarted, which causes the loss of service information stored in the memory of the DPU interface card, resulting in service interruption.
发明内容Contents of the invention
有鉴于此,本申请实施例提供了一种业务恢复方法,以实现在DPU接口卡中的软件重新启动后使得DPU接口卡恢复业务。本申请还提供了对应的数据处理单元、计算设备、接口卡、计算机可读存储介质以及计算机程序产品。In view of this, the embodiment of the present application provides a service recovery method, so as to enable the DPU interface card to recover services after the software in the DPU interface card is restarted. The present application also provides corresponding data processing units, computing devices, interface cards, computer-readable storage media, and computer program products.
第一方面,本申请实施例提供了一种业务恢复方法,该方法由DPU接口卡执行,该DPU接口卡与主机耦合,例如可以通过总线进行耦合等;其中,DPU接口卡在恢复业务时,具体是在DPU接口卡的软件(如操作系统等)重新启动后,获取主机的内存中存储的业务信息,该业务信息为在该软件重新启动之前,DPU接口卡处理主机的处理器发送的IO所产生的信息,从而DPU接口卡根据主机的内存中存储的业务信息,能够恢复处理被中断的业务。In the first aspect, the embodiment of the present application provides a service recovery method, the method is executed by a DPU interface card, and the DPU interface card is coupled with the host, for example, can be coupled through a bus; wherein, when the DPU interface card resumes services, Specifically, after the software (such as the operating system, etc.) of the DPU interface card is restarted, the service information stored in the memory of the host is obtained. The generated information, so that the DPU interface card can resume and process the interrupted service according to the service information stored in the memory of the host.
由于DPU接口卡在重新启动软件后,能够利用主机的内存中存储的业务信息及时恢复业务,可以不用依赖DPU接口卡的本地内存恢复中断的业务,这样,即使因为DPU接口卡的软件(如操作系统等)进行版本升级或者内存故障等原因导致DPU接口卡的软件重新启动,发生业务信息丢失,DPU接口卡也能利用主机内存快速恢复业务,降低对于业务的影响。Since the DPU interface card can use the service information stored in the memory of the host computer to restore the service in time after restarting the software, it does not need to rely on the local memory of the DPU interface card to restore the interrupted service. System, etc.) due to version upgrade or memory failure and other reasons cause the software of the DPU interface card to restart, and business information is lost. The DPU interface card can also use the host memory to quickly restore the business, reducing the impact on the business.
在一种可能的实施方式中,DPU接口卡基于PCIe总线与主机进行耦合,并且,在DPU接口卡中的软件重新启动的过程中,DPU接口卡与主机之间的PCIe链路不断开。如此,在软件重新启动的过程中,主机可以不感知DPU接口卡的状态变化,从而可以降低对于主机的影响。In a possible implementation manner, the DPU interface card is coupled to the host based on the PCIe bus, and the PCIe link between the DPU interface card and the host is not disconnected during the restart process of the software in the DPU interface card. In this way, in the process of restarting the software, the host may not perceive the state change of the DPU interface card, thereby reducing the impact on the host.
在一种可能的实施方式中,DPU接口卡的软件重新启动是由DPU接口卡发生故障触发或者由对DPU接口卡的软件的前一版本软件升级触发,以此实现对DPU接口卡的故障修复或者软件升级。In a possible implementation manner, the software restart of the DPU interface card is triggered by a failure of the DPU interface card or is triggered by an upgrade of the previous version of the software of the DPU interface card, so as to realize the fault recovery of the DPU interface card Or a software upgrade.
在其他实现方式中,DPU接口卡也可以是接收到主机发送的重新启动软件的指令后,重新启动软件等。In other implementation manners, the DPU interface card may also restart the software after receiving an instruction to restart the software sent by the host.
在一种可能的实施方式中,当DPU接口卡的内存发生故障时,DPU接口卡重新启动软件,以便实现DPU接口卡的故障内存修复。In a possible implementation manner, when the memory of the DPU interface card fails, the DPU interface card restarts the software, so as to realize the repair of the faulty memory of the DPU interface card.
在一种可能的实施方式中,当DPU接口卡的内存发生故障且发生故障的内存不满足预设条件时,DPU接口卡重新启动DPU接口卡的操作系统。如此,DPU接口卡可以在发生内存故障时,根据内存的故障情况决定是否对DPU接口卡的操作系统进行重新启动,以此实现对重新启动DPU接口卡的操作系统的条件进行约束。In a possible implementation manner, when the memory of the DPU interface card fails and the failed memory does not satisfy a preset condition, the DPU interface card restarts the operating system of the DPU interface card. In this way, when a memory failure occurs, the DPU interface card can decide whether to restart the operating system of the DPU interface card according to the failure of the memory, so as to implement constraints on restarting the operating system of the DPU interface card.
在一种可能的实施方式中,当DPU接口卡的内存发生故障且发生故障的内存满足预设条件时,DPU接口卡还可以重新启动DPU接口卡的操作系统的内核中使用发生故障的内存区域的服务组件。如此,DPU接口卡可以不用重新启动整个操作系统,从而可以尽可能降低故障内存对于DPU接口卡的影响。In a possible implementation, when the memory of the DPU interface card fails and the failed memory meets the preset conditions, the DPU interface card can also restart the kernel of the operating system of the DPU interface card to use the failed memory area service component. In this way, the DPU interface card does not need to restart the entire operating system, thereby reducing the impact of the faulty memory on the DPU interface card as much as possible.
示例性地,DPU接口卡中发生故障的内存所满足的预设条件,例如可以是故障内存的大小不超预设大小,如内存中故障的行(或者列)的数量不超过预设行数(或预设列数)等。此时,该故障内存对于DPU接口卡的影响相对较小,从而DPU接口卡可以无需重新启动整个操作系统来实现故障修复。Exemplarily, the preset condition that the failed memory in the DPU interface card satisfies, for example, the size of the failed memory does not exceed the preset size, such as the number of failed rows (or columns) in the memory does not exceed the preset number of rows (or preset number of columns), etc. At this time, the impact of the faulty memory on the DPU interface card is relatively small, so that the DPU interface card can implement fault recovery without restarting the entire operating system.
或者,故障的内存所满足的预设条件,具体可以是使用该故障内存的系统组件为预设的系统组件,从而当该故障内存对特定的系统组件产生影响时,DPU接口卡可以重新启动该部分系统组件来实现故障修复。Or, the preset condition satisfied by the faulty memory may specifically be that the system component using the faulty memory is a preset system component, so that when the faulty memory affects a specific system component, the DPU interface card can restart the Part of the system components to achieve failure recovery.
或者,故障的内存所满足的预设条件,具体可以是使用该故障内存的系统组件的数量不超过预设数量。此时,该故障内存仅影响少量的服务组件,而并未影响其余服务组件,因此,DPU接口卡可以重新启动该部分受影响的服务组件即可,而无需重新启动整个操作系统以及所有服务组件来实现故障修复。Alternatively, the preset condition that the faulty memory satisfies may specifically be that the number of system components using the faulty memory does not exceed a preset number. At this time, the faulty memory only affects a small number of service components, but does not affect the rest of the service components. Therefore, the DPU interface card can restart the part of the affected service components without restarting the entire operating system and all service components. for fault recovery.
在一种可能的实施方式中,DPU接口卡获取主机的内存中存储的业务信息时,具体可以是从内存区域中获取第一地址标识,该第一地址标识用于标识主机中的内存区域,该内存区域为主机的内存中的一段存储区域,用于存储业务信息,其中,内存区域可以是易失性存储器中的存储区域,也可以是非易失性存储器中的存储区域等,并且,当DPU接口卡中的操作系统重新启动时,该内存区域中存储的数据不发生丢失;这样,DPU接口卡在重新启动操作系统后,可以根据第一地址标识访问主机的内存区域,以获得业务信息。In a possible implementation manner, when the DPU interface card obtains the service information stored in the memory of the host, it may specifically obtain the first address identifier from the memory area, where the first address identifier is used to identify the memory area in the host, The memory area is a storage area in the memory of the host for storing business information, where the memory area can be a storage area in a volatile memory or a storage area in a non-volatile memory, etc., and, when When the operating system in the DPU interface card is restarted, the data stored in the memory area will not be lost; in this way, after the operating system is restarted, the DPU interface card can access the memory area of the host according to the first address identifier to obtain business information .
示例性地,该内存区域可以位于DPU接口卡内部,例如可以是DPU接口卡包括的CPLD中的逻辑块等,或者,该内存区域可以位于DPU接口卡外部,例如可以是与DPU接口卡连接的外部存储器中的存储区域等。Exemplarily, the memory area can be located inside the DPU interface card, for example, it can be a logic block in the CPLD included in the DPU interface card, or the memory area can be located outside the DPU interface card, for example, it can be connected to the DPU interface card Storage area in external memory, etc.
在一种可能的实施方式中,在软件重新启动之前,DPU接口卡还可以向主机申请用于存储业务信息的内存区域,并获取该内存区域的第一地址标识,该第一地址标识例如可以是内存区域的首地址等,从而DPU接口卡可以根据该第一地址标识将业务信息存储至该内 存区域。如此,在DPU接口卡中的软件重新启动之后,DPU接口卡可以利用该内存区域中存储的业务信息实现业务恢复。In a possible implementation manner, before the software is restarted, the DPU interface card may also apply to the host for a memory area for storing service information, and obtain the first address identifier of the memory area. The first address identifier may be, for example, is the first address of the memory area, etc., so that the DPU interface card can store service information in the memory area according to the first address identifier. In this way, after the software in the DPU interface card is restarted, the DPU interface card can use the service information stored in the memory area to implement service recovery.
在一种可能的实施方式中,DPU接口卡还可以从内存区域中获取配置信息,该配置信息用于对DPU接口卡进行配置,其中,配置信息具体可以包括主机的内存中发送队列的第二地址标识(如发送队列的首地址等)以及完成队列的第三地址标识(如完成队列的首地址等),该发送队列用于存储主机的处理器发送的IO,该完成队列用于存储DPU接口卡针对该IO的执行结果。In a possible implementation manner, the DPU interface card may also obtain configuration information from the memory area, and the configuration information is used to configure the DPU interface card, where the configuration information may specifically include the second The address identifier (such as the first address of the sending queue, etc.) and the third address identifier of the completion queue (such as the first address of the completion queue, etc.), the sending queue is used to store the IO sent by the processor of the host, and the completion queue is used to store the DPU The execution result of the interface card for this IO.
可选地,该配置信息还可以包括DPU接口卡与主机之间进行数据交互时的通信格式、通信协议版本等,或者,配置信息还可以包括其它内容。Optionally, the configuration information may also include a communication format, a communication protocol version, etc. during data exchange between the DPU interface card and the host, or the configuration information may also include other content.
第二方面,本申请实施例还提供了一种数据处理单元DPU装置,用于执行如第一方面或第一方面的任一种实现方式中所述的业务恢复方法。In the second aspect, the embodiment of the present application further provides a data processing unit DPU device, configured to execute the service recovery method described in the first aspect or any implementation manner of the first aspect.
第三方面,本申请提供一种计算设备,所述计算设备包括主机和DPU(数据处理单元)接口卡,其中,主机包括内存以及处理器,该DPU接口卡用于执行如下操作:在所述DPU接口卡的软件重新启动后,获取所述主机的内存中存储的业务信息,所述业务信息为在所述软件重新启动之前,所述DPU接口卡处理所述处理器发送的输入输出IO所产生的信息;根据所述业务信息恢复业务。示例性地,DPU接口卡用于执行如第一方面或第一方面的任一种实现方式中所述的业务恢复方法。In a third aspect, the present application provides a computing device, the computing device includes a host and a DPU (data processing unit) interface card, wherein the host includes a memory and a processor, and the DPU interface card is used to perform the following operations: After the software of the DPU interface card is restarted, obtain the business information stored in the memory of the host, the business information is before the software restarts, the DPU interface card processes the input and output IO sent by the processor Generated information; resume services according to the service information. Exemplarily, the DPU interface card is used to execute the service restoration method described in the first aspect or any implementation manner of the first aspect.
第四方面,本申请提供一种数据处理单元DPU接口卡,该DPU接口卡包括印刷电路板、接口和数据处理单元DPU芯片,所述接口卡通过所述接口与主机通信,所述接口与所述DPU安装在所述印刷电路板上,所述DPU芯片用于在所述DPU接口卡的软件重新启动后,获取所述主机的内存中存储的业务信息,所述业务信息为在所述软件重新启动之前,所述DPU接口卡处理所述主机的处理器发送的输入输出IO所产生的信息;根据所述业务信息恢复业务。In a fourth aspect, the present application provides a data processing unit DPU interface card, the DPU interface card includes a printed circuit board, an interface, and a data processing unit DPU chip, the interface card communicates with the host through the interface, and the interface communicates with the The DPU is installed on the printed circuit board, and the DPU chip is used to obtain the service information stored in the memory of the host after the software of the DPU interface card is restarted, and the service information is in the software Before restarting, the DPU interface card processes the information generated by the input and output IO sent by the processor of the host; resumes the service according to the service information.
示例性地,所述DPU接口卡中的DPU芯片可以用于执行如第一方面或第一方面的任一种实现方式中所述的业务恢复方法。Exemplarily, the DPU chip in the DPU interface card may be used to execute the service restoration method described in the first aspect or any implementation manner of the first aspect.
第五方面,本申请提供一种数据处理单元DPU芯片,应用于DPU接口卡中,所述DPU接口卡与主机耦合,所述DPU芯片包括获取电路和处理电路,其中,所述获取电路用于在所述DPU接口卡的软件重新启动后,获取主机的内存中存储的业务信息,所述业务信息为在所述软件重新启动之前,所述处理电路处理所述主机的处理器发送的输入输出IO所产生的信息;所述处理电路,用于根据所述业务信息恢复业务。In a fifth aspect, the present application provides a data processing unit DPU chip, which is applied to a DPU interface card, the DPU interface card is coupled to a host, and the DPU chip includes an acquisition circuit and a processing circuit, wherein the acquisition circuit is used for After the software of the DPU interface card is restarted, the business information stored in the memory of the host is obtained, and the business information is before the software restarts, the processing circuit processes the input and output sent by the processor of the host Information generated by the IO; the processing circuit is used to restore services according to the service information.
示例性地,所述获取电路以及处理电路相互协作,可以用于执行如第一方面或第一方面的任一种实现方式中所述的业务恢复方法。Exemplarily, the obtaining circuit and the processing circuit cooperate with each other and may be used to execute the service recovery method as described in the first aspect or any implementation manner of the first aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided in the foregoing aspects, the present application may further be combined to provide more implementation manners.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例, 对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some implementations recorded in the application. For example, those skilled in the art can also obtain other drawings based on these drawings.
图1为本申请实施例提供的一示例性DPU接口卡的架构的示意图;Fig. 1 is the schematic diagram of the architecture of an exemplary DPU interface card that the embodiment of the present application provides;
图2为本申请实施例提供的一种业务恢复方法的流程示意图;FIG. 2 is a schematic flowchart of a service recovery method provided in an embodiment of the present application;
图3为本申请实施例提供的一种数据处理单元DPU装置的结构示意图;FIG. 3 is a schematic structural diagram of a data processing unit DPU device provided in an embodiment of the present application;
图4为本申请实施例提供的一种计算设备的硬件结构示意图;FIG. 4 is a schematic diagram of a hardware structure of a computing device provided by an embodiment of the present application;
图5为本申请实施例提供的一种DPU接口卡的结构示意图;FIG. 5 is a schematic structural diagram of a DPU interface card provided by an embodiment of the present application;
图6为本申请实施例提供的一种DPU芯片的结构示意图。FIG. 6 is a schematic structural diagram of a DPU chip provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请中的附图,对本申请提供的实施例中的方案进行描述。The solutions in the embodiments provided in the present application will be described below with reference to the drawings in the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。The terms "first", "second", "third" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is merely a description of the manner in which objects with the same attribute are described in the embodiments of the present application.
DPU接口卡作为CPU的卸载引擎,在协助CPU处理业务的过程中,通常会在DPU接口卡的本地内存暂存业务数据,DPU接口卡的内存发生故障或者DPU接口卡中的软件(如操作系统等)进行版本升级导致DPU接口卡相应的软件发生重启,造成DPU接口卡的内存中数据丢失。比如,实际应用时,DPU接口卡的部分内存区域难免会发生故障,为修复内存故障,DPU的软件需要重启,这使得该部分内存区域所暂存的业务数据可能会发生丢失,从而导致DPU接口卡因为业务数据缺失而难以继续处理该业务。另外,当DPU接口卡中的操作系统或者其它软件需要进行版本升级时,DPU接口卡的内存中存储的业务数据也会因为重新启动DPU接口卡的操作系统而发生丢失,从而影响DPU接口卡处理该业务。The DPU interface card is used as the offload engine of the CPU. During the process of assisting the CPU in processing services, it usually temporarily stores service data in the local memory of the DPU interface card. If the memory of the DPU interface card fails or the software (such as the operating system) etc.) the version upgrade causes the corresponding software of the DPU interface card to restart, resulting in the loss of data in the memory of the DPU interface card. For example, in actual application, some memory areas of the DPU interface card will inevitably fail. In order to repair the memory failure, the DPU software needs to be restarted. It is difficult for Kaka to continue processing the business due to missing business data. In addition, when the operating system or other software in the DPU interface card needs to be upgraded, the service data stored in the memory of the DPU interface card will also be lost due to restarting the operating system of the DPU interface card, thus affecting the processing of the DPU interface card. the business.
基于此,本申请实施例提供了一种业务恢复方法,以恢复DPU接口卡中断的业务。具体实现时,在重新启动DPU接口卡的软件之前,主机的内存中预先存储有DPU接口卡处理业务所产生的业务信息,这样,DPU接口卡在重新启动软件后(如因为内存故障或者软件版本升级触发DPU接口卡重新操作系统等),能够从主机的内存中获取该业务信息,从而DPU接口卡利用该业务信息可以恢复处理被中断的业务。如此,即使DPU接口卡因为上面描述的原因发生软件重启,导致DPU接口卡的内存中存储的业务数据丢失,DPU接口卡也能利用主机内存快速恢复业务,可以不用依赖DPU接口卡的本地内存,降低对于业务的影响。Based on this, an embodiment of the present application provides a service recovery method to recover services interrupted by the DPU interface card. During specific implementation, before restarting the software of the DPU interface card, the business information generated by the DPU interface card processing business is pre-stored in the memory of the host computer, so that after the DPU interface card restarts the software (such as because of memory failure or software version) The upgrade triggers the DPU interface card to re-operate the operating system, etc.), and the service information can be obtained from the memory of the host, so that the DPU interface card can resume and process the interrupted service by using the service information. In this way, even if the software of the DPU interface card is restarted due to the reasons described above, resulting in the loss of business data stored in the memory of the DPU interface card, the DPU interface card can also use the host memory to quickly restore services without relying on the local memory of the DPU interface card. Reduce impact on business.
示例性地,上述业务恢复方法可以应用于如图1所示的DPU接口卡100。如图1所示,DPU接口卡100与主机200通过快捷外设部件互连(peripheral component interconnect express,PCIe)总线或者其他总线进行耦合。并且,DPU接口卡100包含印刷电路板1011、接口1012、DPU芯片101和软件1013,并且,接口1012和DPU芯片101安装在印刷电路板1011上。具体的,接口1012可以为PCIe接口。软件1013可以为DPU接口卡100的操作系统,操作系统包括系统服务组件单元102、软复位(soft reset)单元103、内存监管单元104、微内核(micro kernel)单元105。进一步地,DPU接口卡100中还可以包括微复位(micro reset)单元106 等。图1所示的软件,具体实现可以位于DPU接口卡的存储器中;图1所示的软件还可以为嵌入式等,本发明实施例对此不作限定。Exemplarily, the above service recovery method may be applied to the DPU interface card 100 shown in FIG. 1 . As shown in FIG. 1 , the DPU interface card 100 is coupled to the host 200 through a peripheral component interconnect express (PCIe) bus or other buses. Moreover, the DPU interface card 100 includes a printed circuit board 1011 , an interface 1012 , a DPU chip 101 and software 1013 , and the interface 1012 and the DPU chip 101 are installed on the printed circuit board 1011 . Specifically, the interface 1012 may be a PCIe interface. The software 1013 can be the operating system of the DPU interface card 100, and the operating system includes a system service component unit 102, a soft reset (soft reset) unit 103, a memory supervision unit 104, and a microkernel (micro kernel) unit 105. Further, the DPU interface card 100 may also include a micro reset (micro reset) unit 106 and the like. The specific implementation of the software shown in FIG. 1 can be located in the memory of the DPU interface card; the software shown in FIG. 1 can also be embedded, which is not limited in the embodiment of the present invention.
其中,DPU芯片101用于控制DPU接口卡100为主机200提供存储服务,即处理存储类型的业务,如快捷非易失性存储(non-volatile memory express,NVMe)、virtiofs(具体请参见https://virtio-fs.gitlab.io/)、virtio_scsi(具体请参见https://www.ovirt.org/develop/release-management/features/storage/virtio-scsi.html)业务等。或者,DPU芯片101也可以控制DPU接口卡100为主机200提供计算服务等,即处理计算类型的业务。并且,当DPU接口卡100的操作系统等软件完成重新启动后,DPU芯片101可以控制DPU接口卡100恢复中断的业务(如上述存储类型的业务或者计算类型的业务等)。Among them, the DPU chip 101 is used to control the DPU interface card 100 to provide storage services for the host 200, that is, to process storage-type services, such as fast non-volatile storage (non-volatile memory express, NVMe), virtiofs (see https: //virtio-fs.gitlab.io/), virtio_scsi (for details, please refer to https://www.ovirt.org/develop/release-management/features/storage/virtio-scsi.html) business, etc. Alternatively, the DPU chip 101 may also control the DPU interface card 100 to provide computing services for the host 200, that is, to process computing-type services. Moreover, after the software such as the operating system of the DPU interface card 100 is restarted, the DPU chip 101 can control the DPU interface card 100 to resume the interrupted service (such as the above-mentioned storage type service or calculation type service, etc.).
系统服务组件单元102,包括操作系统的内核中的多个服务组件,并且,该多个服务组件可以使用DPU接口卡100中的内存为DPU接口卡100中的操作系统提供服务,如图1所示的驱动(drivers)组件、文件系统(file system)组件、内存管理(memory manage)组件、网络协议(network protocol)组件等。其中,驱动组件,用于驱动DPU接口卡100与主机200进行数据通信,可以包括驱动框架以及一组实体的业务驱动;文件系统组件,用于提供文件系统服务,如以文件形式进行数据存储、读取以及管理等;内存管理组件,用于提供内存管理服务,如对内存区域进行分配、回收以及隔离等;网络协议组件,用于提供网络协议服务,如超文本传输协议(Hyper Text Transfer Protocol,HTTP)等。The system service component unit 102 includes multiple service components in the kernel of the operating system, and the multiple service components can use the memory in the DPU interface card 100 to provide services for the operating system in the DPU interface card 100, as shown in Figure 1 Drivers components, file system components, memory management components, network protocol components, etc. are shown. Among them, the driver component is used to drive the DPU interface card 100 to perform data communication with the host 200, and may include a driver framework and a group of entity business drivers; the file system component is used to provide file system services, such as data storage in the form of files, Reading and management, etc.; memory management components, used to provide memory management services, such as allocation, recycling and isolation of memory areas, etc.; network protocol components, used to provide network protocol services, such as Hyper Text Transfer Protocol (Hyper Text Transfer Protocol) , HTTP) etc.
软复位单元103,用于复位DPU接口卡100中的硬件单元,并重新启动DPU接口卡100的操作系统等软件。The soft reset unit 103 is used to reset the hardware units in the DPU interface card 100, and restart software such as the operating system of the DPU interface card 100.
内存监管单元104,用于对DPU接口卡100中的内存进行故障监测、故障修复、故障隔离、冗余区域替换等,以使得DPU接口卡100中的内存具备可靠性、可用性、可服务性(reliability、availability、serviceability,RAS)。The memory supervision unit 104 is used to perform fault monitoring, fault repair, fault isolation, redundant area replacement, etc. to the memory in the DPU interface card 100, so that the memory in the DPU interface card 100 has reliability, availability, and serviceability ( reliability, availability, serviceability, RAS).
微内核单元105,用于对DPU接口卡100中的资源进行管理,并将内核的服务组件进行拆分,以实现内核的服务组件能够单独重新启动。The microkernel unit 105 is configured to manage the resources in the DPU interface card 100 and split the service components of the kernel so that the service components of the kernel can be restarted separately.
微复位单元106,用于通过微内核架构,实现控制操作系统的内核中的服务组件单独重新启动。The micro-reset unit 106 is configured to individually restart the service components in the kernel of the control operating system through the micro-kernel architecture.
通常情况下,DPU接口卡100可以协助主机200处理业务。当内存监管单元104监测到DPU接口卡100中的内存发生故障时,若该内存故障引发业务发生中断,DPU接口卡100的软件发生重启,则,DPU接口卡100可以利用主机200的内存中存储的业务信息恢复该业务。例如,内存监管单元104可以对故障的内存位置进行隔离或者失效单元替换等,并触发软复位单元103复位DPU接口卡100中的硬件单元以及重新启动DPU接口卡100中的操作系统。然后,微内核单元105对于重新启动的硬件单元进行初始化,并重新启动系统服务组件单元102中的各个服务组件。在软复位单元103完成操作系统的重新启动后,DPU芯片101获取主机200的内存区域201中存储的业务信息,并利用该DPU接口卡100业务信息恢复业务。Normally, the DPU interface card 100 can assist the host 200 to process services. When the memory supervision unit 104 detects that the memory in the DPU interface card 100 fails, if the memory failure causes business interruption and the software of the DPU interface card 100 restarts, then the DPU interface card 100 can use the memory of the host computer 200 to store business information to restore the business. For example, the memory monitoring unit 104 can isolate the faulty memory location or replace the failed unit, and trigger the soft reset unit 103 to reset the hardware units in the DPU interface card 100 and restart the operating system in the DPU interface card 100 . Then, the microkernel unit 105 initializes the restarted hardware unit, and restarts each service component in the system service component unit 102 . After the soft reset unit 103 completes the restart of the operating system, the DPU chip 101 obtains the service information stored in the memory area 201 of the host 200, and uses the service information of the DPU interface card 100 to restore the service.
进一步地,当DPU接口卡100中的内存故障范围较小,如仅涉及到操作系统的内核中使用该部分发生故障的内存的网络协议组件等,此时,内存监管单元104可以对故障的内存位置进行隔离,并触发微复位单元106执行微复位流程,具体是触发微复位106重新启动使用该部分发生故障的内存的服务组件,例如可以由微复位单元106重新启动网络协议组 件(其余未被影响的服务组件可以不用执行重新启动过程),以恢复DPU接口卡100正常的网络通信功能等。Further, when the scope of the memory failure in the DPU interface card 100 is relatively small, such as only involving network protocol components in the kernel of the operating system that use this part of the failed memory, at this time, the memory supervision unit 104 can correct the failed memory. location, and trigger the micro reset unit 106 to execute the micro reset process, specifically trigger the micro reset 106 to restart the service components that use the part of the failed memory, for example, the network protocol components can be restarted by the micro reset unit 106 (the rest are not Affected service components do not need to perform a restart process), so as to restore the normal network communication function of the DPU interface card 100 and the like.
值得注意的是,图1中是以DPU接口卡100与主机200之间通过PCIe总线进行耦合为例进行示例性说明,实际应用时,DPU接口卡100也可以通过其它方式与主机200进行耦合,本实施例对此并不进行限定。并且,图1所示的DPU接口卡100的架构仅作为一种示例性说明,实际应用时,DPU接口卡100也可以采用其它架构,如DPU接口卡100中还可以包括更多其它类型的服务组件等。It should be noted that, in FIG. 1 , the coupling between the DPU interface card 100 and the host 200 via the PCIe bus is taken as an example for illustration. In practical applications, the DPU interface card 100 can also be coupled with the host 200 in other ways. This embodiment does not limit it. Moreover, the architecture of the DPU interface card 100 shown in FIG. 1 is only used as an exemplary illustration. In actual application, the DPU interface card 100 can also adopt other architectures, such as the DPU interface card 100 can also include more other types of services components etc.
接下来,对业务恢复过程的各种非限定性的具体实施方式进行详细描述。Next, various non-limiting specific implementations of the service restoration process are described in detail.
参阅图2,为本申请实施例中一种业务恢复方法的流程示意图。该方法可以应用于上述图1所示的DPU接口卡100,或者也可以是应用于其它可适用的DPU接口卡中。下面以应用于图1所示的DPU接口卡100为例进行说明。图2所示的业务恢复方法具体可以包括:Referring to FIG. 2 , it is a schematic flowchart of a service restoration method in the embodiment of the present application. This method may be applied to the DPU interface card 100 shown in FIG. 1 above, or may also be applied to other applicable DPU interface cards. The following uses the DPU interface card 100 shown in FIG. 1 as an example for description. The service recovery method shown in Figure 2 may specifically include:
S201:DPU接口卡100向主机200申请内存区域201,获取申请到的内存区域201的第一地址标识。S201: The DPU interface card 100 applies for a memory area 201 from the host 200, and acquires a first address identifier of the applied memory area 201.
示例性地,第一地址标识,例如可以是内存区域201的首地址,或者,也可以是尾地址等其它用于指示内存区域201的标识信息,本实施例对此并不进行限定。Exemplarily, the first address identifier may be, for example, the first address of the memory area 201 , or may be other identification information for indicating the memory area 201 such as the last address, which is not limited in this embodiment.
本实施例中,DPU接口卡100可以预先向主机200申请一段内存区域,以便后续利用申请的内存区域存储与DPU接口卡100处理的业务相关的业务信息。作为一种实现示例,DPU接口卡100中的DPU芯片101可以向主机200发送申请内存区域的请求,从而主机200可以响应该请求,从可用的内存区域中确定出预设大小的内存区域201,将其分配给DPU接口卡100,并将该内存区域201的第一地址标识返回给DPU芯片101。实际应用时,也可以是在DPU接口卡100与主机200建立通信连接后,由主机200主动为DPU接口卡100分配内存区域201,并将其对应的第一地址标识发送给DPU接口卡100等。In this embodiment, the DPU interface card 100 may apply for a section of memory area from the host 200 in advance, so that the applied memory area can be used later to store service information related to the service processed by the DPU interface card 100 . As an implementation example, the DPU chip 101 in the DPU interface card 100 can send a request to the host 200 to apply for a memory area, so that the host 200 can respond to the request and determine the memory area 201 of a preset size from the available memory area, Allocate it to the DPU interface card 100, and return the first address identifier of the memory area 201 to the DPU chip 101. In practical applications, after the DPU interface card 100 establishes a communication connection with the host 200, the host 200 actively allocates the memory area 201 for the DPU interface card 100, and sends its corresponding first address identifier to the DPU interface card 100, etc. .
S202:DPU接口卡100将第一地址标识存储至目标存储区域。S202: The DPU interface card 100 stores the first address identifier in the target storage area.
作为一种实现示例,DPU接口卡100中可以配置有目标存储区域,并且,当DPU接口卡100重新启动操作系统等软件时,该目标存储区域中存储的数据可以不会发生丢失。例如,DPU接口卡100中可以配置有复杂可编程逻辑器件(complex programmable logic device,CPLD),并且,DPU接口卡100可以将第一地址标识存储至该CPLD中的逻辑块(即上述目标存储区域)。实际应用时,目标存储区域可以通过非易失性存储器实现,如可以通过电可改写只读存储器(electrically alterable read only memory,EAROM)、带电可擦可编程只读存储器(electrically erasable programmable read only memory,EEPROM)、快闪存储器中的至少一种实现。或者,目标存储区域也可以是通过易失性存储器实现,如可以通过静态随机存取存储器(static random access memory,SRAM)、动态随机存取存储器(dynamic random access memory,DRAM)中的至少一种实现。As an implementation example, the DPU interface card 100 may be configured with a target storage area, and when the DPU interface card 100 restarts software such as an operating system, the data stored in the target storage area may not be lost. For example, a complex programmable logic device (complex programmable logic device, CPLD) may be configured in the DPU interface card 100, and the DPU interface card 100 may store the first address identifier into the logic block in the CPLD (ie, the above-mentioned target storage area ). In practical applications, the target storage area can be realized by a non-volatile memory, such as an electrically alterable read only memory (EAROM), an electrically erasable programmable read only memory (electrically erasable programmable read only memory) , EEPROM), at least one implementation of flash memory. Alternatively, the target storage area may also be implemented by a volatile memory, such as at least one of static random access memory (static random access memory, SRAM) and dynamic random access memory (dynamic random access memory, DRAM). accomplish.
在其它可能的实现示例中,目标存储区域也可以部署于DPU接口卡100外部,如DPU接口卡100可以外接非易失性存储器或者易失性存储器,从而DPU接口卡100可以将获取的第一地址标识写入外接的非易失性存储器或者易失性存储器中。In other possible implementation examples, the target storage area can also be deployed outside the DPU interface card 100. For example, the DPU interface card 100 can be externally connected with a non-volatile memory or a volatile memory, so that the DPU interface card 100 can store the acquired first The address identification is written into an external non-volatile memory or a volatile memory.
需要说明的是,本实施例中并不限定步骤202以及步骤S203的执行顺序,比如,在其 它实施例中,DPU接口卡100也可以先执行步骤S203,再执行步骤S202等,或者这两个步骤同时执行。It should be noted that the execution sequence of step 202 and step S203 is not limited in this embodiment. For example, in other embodiments, the DPU interface card 100 may also execute step S203 first, and then execute step S202, etc., or both The steps are executed simultaneously.
S203:DPU接口卡100在协助主机200处理业务的过程中,根据第一地址标识,将处理该业务所产生的业务信息存储至主机200中的内存区域201,该业务信息为DPU接口卡100处理主机200的处理器发送的IO所产生的信息。S203: During the process of assisting the host 200 in processing services, the DPU interface card 100 stores the service information generated by processing the service in the memory area 201 of the host 200 according to the first address identifier, and the service information is processed by the DPU interface card 100 Information generated by the IO sent by the processor of the host 200 .
在完成对于DPU接口卡100的配置后,DPU接口卡100可以开始协助主机200处理一个或者多个业务。以处理一个业务为例,主机200中的处理器可以将该业务对应的输入输出(input output,IO)发送至发送队列中进行存储(处理器发送的IO数量可以是一个或者多个),从而DPU接口卡100可以从主机200的发送队列中读取IO,并对读取的IO进行解析和执行,其对该IO进行解析和执行所得到的数据可以暂存至DPU接口卡100的内存中。同时,DPU接口卡100还可以将执行IO的过程中所产生的IO相关信息存储至主机200中内存区域201。其中,IO相关信息即为业务信息,例如包括IO执行阶段、IO执行的关键状态等。这样,即使DPU接口卡100的内存中存储的业务数据发生丢失,后续DPU接口卡100也可以从内存区域201中读取到IO执行阶段、IO执行的关键状态等业务信息,并利用该业务信息继续执行该IO,实现业务恢复,并且,如此也可以避免DPU接口卡100重新执行该IO,降低业务的恢复时延。After the configuration of the DPU interface card 100 is completed, the DPU interface card 100 can start to assist the host 200 to process one or more services. Taking processing a business as an example, the processor in the host 200 can send the input and output (input output, IO) corresponding to the business to the sending queue for storage (the number of IOs sent by the processor can be one or more), so that The DPU interface card 100 can read the IO from the sending queue of the host 200, and parse and execute the read IO, and the data obtained by parsing and executing the IO can be temporarily stored in the memory of the DPU interface card 100 . At the same time, the DPU interface card 100 can also store the IO-related information generated during the execution of the IO into the memory area 201 of the host 200 . Wherein, the IO-related information is business information, including, for example, IO execution stages, key states of IO execution, and the like. In this way, even if the business data stored in the internal memory of the DPU interface card 100 is lost, the subsequent DPU interface card 100 can also read business information such as the IO execution stage and the key state of IO execution from the memory area 201, and use the business information Continue to execute the IO to realize service recovery, and in this way, the DPU interface card 100 can also avoid re-execution of the IO, reducing service recovery delay.
在进一步可能的实施方式中,DPU接口卡100可以将执行IO的过程中所产生的至少部分业务信息存储至内存区域201,以减少DPU接口卡100处理业务的资源消耗。例如,在执行IO的初始阶段,DPU接口卡100可以不用将该IO当前的执行阶段以及IO执行的关键状态存储至主机200中的内存。相应的,如果后续需要基于该IO恢复处理业务,DPU接口卡100可以重新执行该IO以恢复业务。由于在重新启动操作系统等软件之前,DPU接口卡100尚未开始执行该IO或者刚开始该IO,因此,即使DPU接口卡100后续在恢复业务的过程中重新执行该IO,对于DPU接口卡100恢复处理业务的时延影响也较小。而当DPU接口卡100执行IO的阶段达到中期或者后期阶段,DPU接口卡100可以将该IO执行的关键状态、IO执行阶段等信息存储至主机200的内存区域201,这样,如果后续需要基于该IO恢复处理业务,DPU接口卡100可以根据该主机200保存的信息继续执行IO,而可以不用重新执行IO,以此可以降低业务的恢复时延。In a further possible implementation manner, the DPU interface card 100 may store at least part of service information generated during the execution of the IO in the memory area 201, so as to reduce resource consumption of the DPU interface card 100 for processing services. For example, in the initial stage of IO execution, the DPU interface card 100 may not store the current execution stage of the IO and key states of IO execution in the memory in the host 200 . Correspondingly, if the service needs to be resumed based on the IO later, the DPU interface card 100 can re-execute the IO to resume the service. Since the DPU interface card 100 has not started to execute the IO or has just started the IO before restarting software such as the operating system, even if the DPU interface card 100 subsequently re-executes the IO in the process of resuming business, the DPU interface card 100 recovers The delay impact of processing services is also small. And when the stage of IO execution by DPU interface card 100 reaches the middle or late stage, DPU interface card 100 can store information such as the key state and IO execution stage of the IO execution in the memory area 201 of the host computer 200, like this, if subsequent needs based on this When the IO resumes processing services, the DPU interface card 100 can continue to execute IOs according to the information saved by the host 200 without re-executing the IOs, thereby reducing the service recovery delay.
或者,DPU接口卡100可以根据IO大小,确定是否将执行IO所产生的业务信息存储至内存区域201。例如,当DPU接口卡100从发送队列读取出的IO大小不超过预设阈值时,DPU接口卡100在执行该IO的过程中,可以无需将处理该IO所产生的业务信息发送至主机200的内存中进行存储。这样,即使DPU接口卡100通过重新执行该IO的方式恢复业务,所需付出的代价也相对较小。而当DPU接口卡100从发送队列读取出的IO大小超过预设阈值时,DPU接口卡100可以将IO执行的关键状态、IO执行阶段等业务信息存储至主机200的内存,以避免DPU接口卡100重新执行该IO,降低业务的恢复时延。实际应用时,DPU接口卡100还可以结合IO大小、IO执行进度等方面综合确定是否将执行该IO所产生的业务信息发送至主机200的内存中进行存储。Alternatively, the DPU interface card 100 may determine whether to store the service information generated by executing the IO into the memory area 201 according to the size of the IO. For example, when the IO size read by the DPU interface card 100 from the sending queue does not exceed the preset threshold, the DPU interface card 100 may not need to send the business information generated by processing the IO to the host 200 during the execution of the IO. stored in memory. In this way, even if the DPU interface card 100 restores the service by re-executing the IO, the cost to be paid is relatively small. And when the IO size read by the DPU interface card 100 from the sending queue exceeds the preset threshold, the DPU interface card 100 can store business information such as the key status of the IO execution and the IO execution stage in the memory of the host 200, so as to avoid the DPU interface The card 100 re-executes the IO to reduce service recovery delay. In actual application, the DPU interface card 100 can also comprehensively determine whether to send the service information generated by executing the IO to the memory of the host 200 for storage in consideration of the IO size, IO execution progress and other aspects.
实际应用时,DPU接口卡100在执行IO的过程中,还可以将执行IO过程中所得到的IO执行结果也发送至主机200的内存中进行存储。这样,当该IO因为DPU接口卡100的操作系 统重新启动而被中断处理时,DPU接口卡100可以根据主机200的内存中存储的IO执行结果以及上述业务信息,从中断位置继续处理该IO,从而可以进一步降低业务的恢复时延。In actual application, when the DPU interface card 100 is executing the IO, it can also send the IO execution result obtained during the IO execution to the memory of the host 200 for storage. In this way, when the IO is interrupted because the operating system of the DPU interface card 100 is restarted, the DPU interface card 100 can continue to process the IO from the interrupted position according to the IO execution result and the above-mentioned business information stored in the memory of the host computer 200, Therefore, the service recovery delay can be further reduced.
S204:当满足重新启动软件的条件时,DPU接口卡100重新启动该软件。S204: When the conditions for restarting the software are met, the DPU interface card 100 restarts the software.
本实施例中,DPU中的软件会在满足重启条件时进行重新启动,并且,该软件的重新启动会对DPU接口卡100处理或者恢复业务产生影响。示例性地,该软件例如可以是DPU接口卡100中的操作系统,或者可以是其它软件。为便于理解,下面以软件具体为操作系统为例进行示例性说明。In this embodiment, the software in the DPU will be restarted when the restart condition is met, and the restart of the software will affect the processing or service recovery of the DPU interface card 100 . Exemplarily, the software may be, for example, the operating system in the DPU interface card 100, or may be other software. For ease of understanding, the software is specifically an operating system as an example for illustrative description below.
实际应用时,DPU接口卡100在部分场景中可能会重新启动操作系统。作为一些示例,满足重新启动操作系统的条件可以包括以下几种:In actual application, the DPU interface card 100 may restart the operating system in some scenarios. As some examples, conditions to qualify for an operating system reboot could include the following:
示例一:检测到DPU接口卡100中的内存发生故障。Example 1: It is detected that the memory in the DPU interface card 100 fails.
具体实现时,DPU接口卡100可以实时(或者周期性)感知DPU接口卡100中的内存是否发生故障,如感知该内存中的至少一个行、列或者bank故障而导致数据访问发生不可纠正错误(uncorrected errors,UCE)等(或者可以是其它故障),并将内存发生故障的位置信息报告给内存监管单元104。内存监管单元104可以根据发生故障的位置信息,对故障的内存部分进行隔离或者对失效的内存部分进行替换等,并触发软复位单元103执行软复位过程。然后,软复位单元103可以复位DPU接口卡100中的硬件单元(如DPU芯片101),并重新启动DPU接口卡100的操作系统。During specific implementation, the DPU interface card 100 can sense in real time (or periodically) whether the memory in the DPU interface card 100 fails, such as sensing at least one row, column or bank failure in the memory and causing an uncorrectable error in data access ( uncorrected errors, UCE) etc. (or may be other failures), and report the location information of the memory failure to the memory supervision unit 104. The memory monitoring unit 104 may isolate or replace the failed memory part according to the location information of the failure, and trigger the soft reset unit 103 to perform a soft reset process. Then, the soft reset unit 103 can reset the hardware unit (such as the DPU chip 101 ) in the DPU interface card 100 and restart the operating system of the DPU interface card 100 .
其中,软复位单元可以复位DPU接口卡100中的所有硬件单元,此时,DPU接口卡100与主机200之间的PCIe链路发生断开。而在另一种实现方式中,软复位单元103可以复位除PCIe核(core)之外的硬件单元,这样,PCIe核因为不被复位而能够持续与主机200连接,从而保持DPU接口卡100与主机200之间的PCIe链路不断开。PCIe核用于建立与主机之间的PCIe链路。Wherein, the soft reset unit can reset all hardware units in the DPU interface card 100, and at this time, the PCIe link between the DPU interface card 100 and the host 200 is disconnected. In another implementation, the soft reset unit 103 can reset hardware units other than the PCIe core (core), so that the PCIe core can continue to be connected to the host 200 because it is not reset, thereby maintaining the connection between the DPU interface card 100 and the host computer 200. The PCIe link between the hosts 200 is not disconnected. The PCIe core is used to establish a PCIe link with the host.
在软复位单元103完成硬件单元的复位后,可以由微内核单元105对该硬件单元进行初始化,并重新启动系统服务组件单元102中的各个服务组件,以启动内核各个的系统服务,如内核驱动服务、文件系统服务、内存管理服务以及网络协议服务等。After the soft reset unit 103 completes the reset of the hardware unit, the microkernel unit 105 can initialize the hardware unit, and restart each service component in the system service component unit 102 to start each system service of the kernel, such as the kernel driver services, file system services, memory management services, and network protocol services.
值得注意的是,示例一所示的实施方式中是以检测到内存发生故障触发DPU接口卡100重新启动操作系统为例,实际应用时,当检测到DPU接口卡100发生其它故障(如DPU接口卡100中存在应用程序运行错误等)而需要重启操作系统时,也可以触发DPU接口卡100重新启动操作系统。It should be noted that, in the embodiment shown in Example 1, the DPU interface card 100 is detected to trigger the restart of the operating system as an example. In actual application, when other failures (such as DPU interface When there is an application running error in the card 100) and the operating system needs to be restarted, the DPU interface card 100 can also be triggered to restart the operating system.
示例二:对DPU接口卡100的操作系统的前一版本操作系统升级完成。Example 2: The operating system of the previous version of the operating system of the DPU interface card 100 is upgraded.
具体地,主机200可以生成针对DPU接口卡100的前一版本的操作系统的升级指令,并将其发送给DPU接口卡100,从而DPU接口卡100可以根据接收到的升级指令,执行升级DPU接口卡100的操作系统的流程。例如,DPU接口卡100可以根据该升级指令,从主机200中读取新版本的操作系统,并将DPU接口卡100的操作系统由前一版本替换成新版本,然后,DPU接口卡100可以在确定版本升级完成后,可以启动运行该新版本的操作系统。Specifically, the host 200 can generate an upgrade command for the previous version of the operating system of the DPU interface card 100, and send it to the DPU interface card 100, so that the DPU interface card 100 can perform an upgrade of the DPU interface according to the received upgrade command. The flow of the operating system of the card 100. For example, the DPU interface card 100 can read the new version of the operating system from the host computer 200 according to the upgrade instruction, and replace the previous version of the operating system of the DPU interface card 100 with a new version, and then the DPU interface card 100 can be in the After confirming that the version upgrade is complete, the operating system running the new version can be started.
其中,主机200可以周期性的下发升级指令,以实现对DPU接口卡100操作系统的周期性更新;或者,主机200可以根据用户针对DPU接口卡100操作系统的升级操作,生成相应的升级指令并将其下发给DPU接口卡100等。Wherein, the host 200 can periodically issue upgrade instructions to realize the periodic update of the DPU interface card 100 operating system; or, the host 200 can generate corresponding upgrade instructions according to the user's upgrade operation for the DPU interface card 100 operating system And send it to the DPU interface card 100 and so on.
示例三:接收到重新启动操作系统的指令。Example 3: An instruction to restart the operating system is received.
具体地,主机200可以根据用户针对DPU接口卡100操作系统的重新启动操作,生成相应的重新启动指令,并将其发送给DPU接口卡100,从而DPU接口卡100在接收到该重新启动指令后,重新启动并运行操作系统。Specifically, the host computer 200 can generate a corresponding restart command according to the user's restart operation for the DPU interface card 100 operating system, and send it to the DPU interface card 100, so that the DPU interface card 100 can restart after receiving the restart command. , reboot and run the OS.
上述触发DPU接口卡100重新启动操作系统的实现方式仅作为一些示例性说明,在其它实施例中,DPU接口卡100也可以是在满足其它可能的条件时重新启动操作系统,比如,当DPU接口卡100的操作系统在运行过程中出现运行错误时可以自动触发操作系统的重新启动等;或者,当重新启动的软件为除操作系统之外的其它软件时,DPU接口卡100可以基于上述类似方式实现重新启动DPU接口卡100中的其它软件,本实施例对此并不进行限定。The above implementation of triggering the DPU interface card 100 to restart the operating system is only for some exemplary illustrations. In other embodiments, the DPU interface card 100 may also restart the operating system when other possible conditions are met, for example, when the DPU interface The operating system of the card 100 can automatically trigger the restart of the operating system when an error occurs during operation; Restarting other software in the DPU interface card 100 is implemented, which is not limited in this embodiment.
DPU接口卡100在重新启动操作系统后,DPU接口卡100内存中暂存的业务数据发生丢失,从而DPU接口卡100可能会因为业务数据丢失而中断处理业务。为此,本实施例中,DPU接口卡100通过继续执行如下步骤,以实现恢复处理中断的业务。After the DPU interface card 100 restarts the operating system, the service data temporarily stored in the internal memory of the DPU interface card 100 is lost, so the DPU interface card 100 may interrupt processing services due to the loss of service data. For this reason, in this embodiment, the DPU interface card 100 continues to execute the following steps to realize the recovery and processing of interrupted services.
S205:DPU接口卡100在重新启动软件后,获取内存区域201中存储的业务信息。S205: After restarting the software, the DPU interface card 100 acquires the service information stored in the memory area 201.
在一种可能的实施方式中,在重新启动操作系统或者其它软件后,DPU接口卡100中的DPU芯片101可以从目标存储区域中获取第一地址标识,该第一地址标识(例如内存区域201的首地址)用于指示DPU接口卡100预先向主机200申请的内存区域201,从而DPU芯片101可以根据该第一地址标识访问主机的内存区域201,读取该内存区域201中存储的业务信息。In a possible implementation manner, after restarting the operating system or other software, the DPU chip 101 in the DPU interface card 100 can obtain the first address identifier from the target storage area, and the first address identifier (for example, the memory area 201 first address) is used to indicate the memory area 201 that the DPU interface card 100 pre-applied to the host 200, so that the DPU chip 101 can access the memory area 201 of the host according to the first address identification, and read the business information stored in the memory area 201 .
S206:DPU接口卡100根据获取的业务信息恢复业务。S206: The DPU interface card 100 restores the service according to the acquired service information.
示例性地,具体地,获取的业务信息具体可以是DPU接口卡100在执行未完成的IO时所产生的数据,从而DPU芯片101101可以从主机200中获取未完成的IO,并根据内存区域201中存储的IO的当前执行阶段、IO执行的关键状态,从当前执行阶段继续执行该IO,以此实现恢复对于业务的处理。进一步的,DPU芯片101还可以根据内存区域201中存储的IO执行结果,从当前执行阶段的中断位置处继续执行该IO等。其中,对于DPU接口卡100在重新启动操作系统之前尚未执行的IO或者刚开始执行的IO,内存区域201中可能未记录该IO的相关信息,此时,DPU芯片101可以直接重新执行该IO。Exemplarily, specifically, the acquired business information may specifically be the data generated when the DPU interface card 100 executes the unfinished IO, so that the DPU chip 101101 can acquire the unfinished IO from the host 200, and according to the memory area 201 The current execution stage of the IO and the key state of the IO execution are stored in the IO, and the IO is continued to be executed from the current execution stage, so as to realize the restoration of the processing of the business. Further, the DPU chip 101 may continue to execute the IO from the interrupted position of the current execution stage according to the IO execution result stored in the memory area 201 . Wherein, for the IO that has not been executed by the DPU interface card 100 before the operating system is restarted or the IO that has just been executed, the memory area 201 may not record the relevant information of the IO. At this time, the DPU chip 101 can directly re-execute the IO.
在进一步可能的实施方式中,内存区域201中存储的业务信息可能是部分未执行完成的IO的信息,如IO的执行阶段以及在该执行阶段所对应的执行结果等,而对于DPU接口卡100已经执行但是未执行完成的另一部分IO,可能并没有在内存区域201中存储该部分IO的相关信息。因此,DPU接口卡100在从主机200的发送队列获取未完成的IO后,可以查找内存区域201存储的业务信息中是否包括该IO相关的信息。并且,若查找到与该IO相关的信息,则DPU接口卡100可以根据查找到的信息,从中断位置继续执行该IO;而若查找不到与该IO相关信息,则DPU接口卡100可以重新执行该IO。In a further possible implementation, the business information stored in the memory area 201 may be the information of some unfinished IOs, such as the execution stage of the IO and the corresponding execution results in the execution stage, etc., while the DPU interface card 100 Another part of IO that has been executed but not completed may not store relevant information of this part of IO in the memory area 201 . Therefore, after the DPU interface card 100 acquires an unfinished IO from the sending queue of the host 200, it may check whether the service information stored in the memory area 201 includes information related to the IO. And, if the information related to this IO is found, then the DPU interface card 100 can continue to execute the IO from the interrupted position according to the information found; Execute the IO.
如此,即使因为DPU接口卡100内存发生故障或者升级软件等原因,DPU接口卡100的软件重启,导致业务发生中断,DPU接口卡100也能通过主机200的内存中存储的业务信息快速恢复业务,降低对于业务的影响。In this way, even if the software of the DPU interface card 100 is restarted due to reasons such as a fault in the memory of the DPU interface card 100 or a software upgrade, causing service interruption, the DPU interface card 100 can quickly restore the service through the service information stored in the memory of the host computer 200. Reduce impact on business.
进一步地,在复位DPU接口卡100硬件单元时,DPU接口卡100可以不复位PCIe核,这 样,DPU接口卡100可以通过该PCIe核持续与主机200保持PCIe链路的连接,从而实现DPU接口卡100与主机200之间的PCIe链路不断开。如此,在修复DPU接口卡100的故障内存或者升级DPU接口卡100的操作系统时,主机200可以不感知DPU接口卡100的故障状态以及升级状态的变化,从而可以降低对于主机200的影响。并且,DPU接口卡100的业务恢复过程对于硬件以及操作系统的要求较低,可以兼容多种类型的计算设备以及操作系统,从而可以提高方案实施的普适性。Further, when resetting the DPU interface card 100 hardware unit, the DPU interface card 100 may not reset the PCIe core, so that the DPU interface card 100 can continue to maintain the connection of the PCIe link with the host computer 200 through the PCIe core, thereby realizing the DPU interface card The PCIe link between 100 and host 200 is not disconnected. In this way, when repairing the faulty memory of the DPU interface card 100 or upgrading the operating system of the DPU interface card 100, the host 200 may not perceive the fault state of the DPU interface card 100 and the change of the upgrade state, thereby reducing the impact on the host 200. Moreover, the service recovery process of the DPU interface card 100 has relatively low requirements on hardware and operating systems, and can be compatible with various types of computing devices and operating systems, thereby improving the universality of solution implementation.
上述实施例中,DPU接口卡100可以在发生内存故障时,直接触发软复位单元103复位硬件单元并重新启动操作系统等软件,而在其它可能的实施例中,DPU接口卡100也可以通过重新启动操作系统的内核中的部分服务组件的方式实现故障修复。在一种可能的实施方式中,DPU接口卡100在检测到存在内存发生故障时,可以进一步判断故障的内存是否满足预设条件,并且,当故障的内存满足预设条件时,DPU接口卡100重新启动操作系统的内核中使用该故障内存的服务组件,并确定故障内存中存储的数据所对应的IO,从而通过重新执行该IO恢复业务运行,或者DPU接口卡100可以根据内存区域201中存储的该IO的相关信息继续执行该IO以恢复业务运行等。这样,DPU接口卡100可以无需重新启动整个操作系统以及重新对DPU接口卡100进行配置也能实现故障修复,从而可以降低故障修复代价。而当故障的内存不满足该预设条件时,DPU接口卡100可以通过图2所示实施例的方式恢复中断业务。如此,可以根据DPU接口卡100的内存的故障情况进行采用不同的处理方式进行故障修复,提高DPU接口卡100修复故障内存的灵活性。In the above embodiment, the DPU interface card 100 can directly trigger the soft reset unit 103 to reset the hardware unit and restart software such as the operating system when a memory failure occurs. Fault recovery is realized by starting some service components in the kernel of the operating system. In a possible implementation manner, when the DPU interface card 100 detects that there is a memory failure, it can further determine whether the failed memory meets the preset condition, and when the failed memory meets the preset condition, the DPU interface card 100 Restart the service component using the faulty memory in the kernel of the operating system, and determine the IO corresponding to the data stored in the faulty memory, so as to resume business operation by re-executing the IO, or the DPU interface card 100 can be based on the data stored in the memory area 201. The relevant information of the IO continues to execute the IO to resume business operation, etc. In this way, the DPU interface card 100 can implement fault recovery without restarting the entire operating system and reconfiguring the DPU interface card 100 , thereby reducing the cost of fault recovery. And when the faulty memory does not meet the preset condition, the DPU interface card 100 can resume the interrupted service through the method of the embodiment shown in FIG. 2 . In this way, different processing methods can be used to repair the fault according to the fault condition of the memory of the DPU interface card 100 , and the flexibility of the DPU interface card 100 to repair the faulty memory can be improved.
作为一些实现示例,故障的内存所满足的预设条件,具体可以是故障的内存的大小不超预设大小,如内存中故障的行(或者列)的数量不超过预设行数(或预设列数)等。此时,发生故障的内存部分对于DPU接口卡100的影响相对较小,因此,DPU接口卡100可以无需重新启动整个操作系统来实现故障修复。As some implementation examples, the preset condition that the faulty memory satisfies may specifically be that the size of the faulty memory does not exceed the preset size, such as the number of faulty rows (or columns) in the memory does not exceed the preset number of rows (or preset Set the number of columns), etc. At this time, the faulty memory portion has relatively little impact on the DPU interface card 100 , therefore, the DPU interface card 100 can implement fault recovery without restarting the entire operating system.
或者,故障的内存所满足的预设条件,具体可以是使用该故障的内存的系统组件为预设的系统组件,从而当该故障的内存对特定的系统组件产生影响时,因此,DPU接口卡100可以隔离或者替换故障的内存部分,并重新启动该部分系统组件来实现故障修复。Or, the preset condition that the faulty memory satisfies may specifically be that the system component using the faulty memory is a preset system component, so that when the faulty memory affects a specific system component, therefore, the DPU interface card 100 can isolate or replace a faulty portion of memory and restart that portion of the system components to effectuate fault recovery.
或者,故障的内存所满足的预设条件,具体可以是使用该故障的内存的系统组件的数量不超过预设数量。此时,该故障的内存仅影响少量的服务组件,而并未影响其余服务组件,因此,DPU接口卡100重新启动该部分受影响的服务组件即可,而无需重新启动整个操作系统(或者其它软件)以及所有服务组件来实现故障修复。实际应用时,故障的内存所满足的预设条件也可以是其它条件,本实施例对此并不进行限定。Alternatively, the preset condition that the faulty memory satisfies may specifically be that the number of system components using the faulty memory does not exceed a preset number. At this point, the faulty memory only affects a small number of service components, but does not affect the rest of the service components. Therefore, the DPU interface card 100 can restart the affected service components without restarting the entire operating system (or other software) and all service components for fault recovery. In actual application, the preset condition satisfied by the faulty memory may also be other conditions, which are not limited in this embodiment.
另外,实际应用场景中,DPU接口卡100可以预先完成相应的配置,以实现DPU接口卡100与主机200之间的正常通信。比如,可以预先配置DPU接口卡100与主机200具有统一的数据通信格式、通信协议版本、命令解析规则等,并为DPU接口卡100配置主机内存中的发送队列(SQ)以及完成队列(CQ),其中,发送队列用于存储主机200中的处理器向DPU接口卡100发送的用于处理业务的至少一个IO,接收队列用于存储DPU接口卡100反馈的针对该IO的执行结果。由于DPU接口卡100在重新启动软件后,可能会丢失原先对于DPU接口卡100的配置,因此,在进一步可能的实施方式中,DPU接口卡100可以在获取到第一地址标识后,还可以DPU接口卡100根据第一地址标识,将用于配置DPU接口卡100的配置 信息存储至内存区域201。其中,可以由技术人员预先对DPU接口卡100进行人工配置,从而DPU接口卡100可以基于技术人员的配置操作,生成相应的配置文件并将其发送至内存区域201中。或者,也可以是在DPU接口卡100与主机200建立通信连接后,由主机200生成配置文件,并利用该配置文件自动对DPU接口卡100进行配置等,并由主机200将该配置文件写入内存区域201中,本实施例对此并不进行限定。In addition, in an actual application scenario, the DPU interface card 100 may complete corresponding configuration in advance, so as to realize normal communication between the DPU interface card 100 and the host 200 . For example, the DPU interface card 100 and the host 200 can be pre-configured to have a unified data communication format, communication protocol version, command parsing rules, etc., and the DPU interface card 100 can be configured with a sending queue (SQ) and a completion queue (CQ) in the host memory , where the sending queue is used to store at least one IO sent by the processor in the host 200 to the DPU interface card 100 for processing services, and the receiving queue is used to store the execution result of the IO fed back by the DPU interface card 100 . Since the DPU interface card 100 may lose the original configuration of the DPU interface card 100 after restarting the software, therefore, in a further possible implementation, the DPU interface card 100 may also use the DPU interface card 100 after obtaining the first address identifier. The interface card 100 stores configuration information for configuring the DPU interface card 100 in the memory area 201 according to the first address identifier. Wherein, the DPU interface card 100 may be manually configured by a technician in advance, so that the DPU interface card 100 may generate a corresponding configuration file based on the configuration operation of the technician and send it to the memory area 201 . Or, after the DPU interface card 100 establishes a communication connection with the host computer 200, the configuration file is generated by the host computer 200, and the configuration file is automatically used to configure the DPU interface card 100, etc., and the configuration file is written by the host computer 200. In the memory area 201, this embodiment does not limit it.
这样,当重新启动软件(如DPU接口卡100因为内存故障或者进行软件版本升级而重新启动该软件等)并且之前对DPU接口卡100的配置发生失效时,DPU接口卡100可以从内存区域201中获取配置信息,并利用该配置信息重新对DPU接口卡100进行配置,例如重新配置DPU接口卡100的通信数据的格式、指令解析规则等,以保证DPU接口卡100与主机200之间正常通信。并且,配置信息中还包括主机200中的发送队列的第二地址标识以及完成队列的第三地址标识,从而在配置DPU接口卡100后,DPU接口卡100可以根据第二地址标识访问主机200的发送队列,并从该发送队列中获取DPU接口卡100尚未执行完成的IO,该IO由DPU接口卡100中的处理器下发至发送队列中。相应的,DPU接口卡100执行IO所最终得到的结果,可以根据第三地址标识将其发送至主机200的完成队列中,以便于实现恢复处理被中断的业务。In this way, when the software is restarted (as the DPU interface card 100 restarts the software due to a memory failure or a software version upgrade, etc.) Obtain the configuration information, and use the configuration information to reconfigure the DPU interface card 100, such as reconfiguring the format of the communication data of the DPU interface card 100, the instruction analysis rules, etc., to ensure normal communication between the DPU interface card 100 and the host computer 200. Moreover, the configuration information also includes the second address identifier of the send queue and the third address identifier of the completion queue in the host 200, so that after the DPU interface card 100 is configured, the DPU interface card 100 can access the host 200 according to the second address identifier. The sending queue, and obtain the IO that has not been executed by the DPU interface card 100 from the sending queue, and the IO is delivered to the sending queue by the processor in the DPU interface card 100. Correspondingly, the final result of the IO performed by the DPU interface card 100 can be sent to the completion queue of the host 200 according to the third address identifier, so as to resume and process the interrupted service.
以上结合图1、图2对本申请实施例提供的业务恢复方法进行介绍,接下来结合附图对本申请实施例提供的数据处理单元DPU装置的功能以及实现该数据处理单元的计算设备进行介绍。The service recovery method provided by the embodiment of the present application is introduced above with reference to FIG. 1 and FIG. 2 . Next, the functions of the data processing unit DPU device provided by the embodiment of the present application and the computing equipment for implementing the data processing unit are introduced in conjunction with the accompanying drawings.
参见图3,示出了一种数据处理单元DPU装置的结构示意图。其中,图3所示的DPU装置300与主机(图3中未示出)耦合,该DPU装置300包括:Referring to FIG. 3 , it shows a schematic structural diagram of a data processing unit DPU device. Wherein, the DPU device 300 shown in FIG. 3 is coupled with a host (not shown in FIG. 3 ), and the DPU device 300 includes:
获取模块301,用于在所述DPU装置300的软件重新启动后,获取所述主机的内存中存储的业务信息,所述业务信息为在所述软件重新启动之前所述DPU装置300处理所述主机的处理器发送的输入输出IO所产生的信息;The acquiring module 301 is configured to acquire the business information stored in the memory of the host after the software of the DPU device 300 is restarted, the business information is that the DPU device 300 processes the The information generated by the input and output IO sent by the processor of the host;
恢复模块302,用于根据所述业务信息恢复业务。A recovery module 302, configured to recover services according to the service information.
可选地,重新启动的软件例如可以是操作系统、操作系统的内核中的组件、或者可以是除操作系统之外的其它软件。Optionally, the restarted software may be, for example, an operating system, a component in the kernel of the operating system, or other software except the operating system.
在一种可能的实施方式中,所述DPU装置300基于快捷外设部件互连PCIe总线与所述主机耦合,并且,在所述软件重新启动的过程中,所述DPU装置300与所述主机之间的PCIe链路不断开。In a possible implementation manner, the DPU device 300 is coupled with the host based on the peripheral component interconnection PCIe bus, and, during the software restart process, the DPU device 300 is connected to the host The PCIe link between them is not disconnected.
在一种可能的实施方式中,所述DPU装置300的软件重新启动由所述DPU装置300发生故障触发或者由对所述DPU装置300的所述软件的前一版本软件升级触发。In a possible implementation manner, the restart of the software of the DPU device 300 is triggered by a failure of the DPU device 300 or is triggered by a software upgrade of a previous version of the software of the DPU device 300 .
在一种可能的实施方式中,所述DPU装置300还包括所述启动模块303,用于当所述DPU装置300的内存发生故障时,重新启动所述软件。In a possible implementation manner, the DPU device 300 further includes the startup module 303, configured to restart the software when the memory of the DPU device 300 fails.
在一种可能的实施方式中,所述启动模块303,用于当所述DPU装置300的内存发生故障且发生故障的内存不满足预设条件时,重新启动所述DPU300的操作系统。In a possible implementation manner, the startup module 303 is configured to restart the operating system of the DPU 300 when the memory of the DPU device 300 fails and the failed memory does not meet a preset condition.
在一种可能的实施方式中,所述启动模块303,还用于当所述DPU装置300的内存发生故障且发生故障的内存满足所述预设条件时,重新启动所述DPU装置300的操作系统的内 核中使用所述发生故障的内存的服务组件。In a possible implementation manner, the startup module 303 is further configured to restart the operation of the DPU device 300 when the memory of the DPU device 300 fails and the failed memory satisfies the preset condition A service component in the system's kernel that uses the failed memory.
在一种可能的实施方式中,所述获取模块301,用于:In a possible implementation manner, the obtaining module 301 is configured to:
获取第一地址标识,所述第一地址标识用于标识所述主机中的内存区域,所述内存区域存储所述业务信息;Obtaining a first address identifier, where the first address identifier is used to identify a memory area in the host, where the memory area stores the service information;
根据所述第一地址标识访问所述内存区域,获得所述业务信息。Accessing the memory area according to the first address identifier to obtain the service information.
在一种可能的实施方式中,所述DPU装置300还包括:In a possible implementation manner, the DPU device 300 further includes:
申请模块304,用于在所述软件重新启动之前,向所述主机申请所述内存区域,获取所述内存区域的第一地址标识;An application module 304, configured to apply for the memory area from the host and obtain the first address identifier of the memory area before the software is restarted;
存储模块305,用于根据所述第一地址标识将所述业务信息存储至所述内存区域。The storage module 305 is configured to store the service information in the memory area according to the first address identifier.
在一种可能的实施方式中,DPU装置300还可以从内存区域中获取配置信息,该配置信息用于对DPU装置300进行配置,其中,所述配置信息包括所述主机的内存中发送队列的第二地址标识和完成队列的第三地址标识,所述发送队列用于存储所述IO,所述完成队列用于存储所述DPU装置300针对所述IO的执行结果。In a possible implementation manner, the DPU device 300 may also obtain configuration information from the memory area, and the configuration information is used to configure the DPU device 300, where the configuration information includes the send queue in the memory of the host The second address identifier and the third address identifier of the completion queue, the sending queue is used to store the IO, and the completion queue is used to store the execution result of the IO by the DPU device 300 .
由于图3所示的DPU装置300可以实现图2所示的方法,故图3所示的DPU装置300的具体实现方式及其所具有的技术效果,可以参见前述实施例中的相关之处描述,在此不做赘述。Since the DPU device 300 shown in FIG. 3 can implement the method shown in FIG. 2, the specific implementation of the DPU device 300 shown in FIG. 3 and its technical effects can be referred to the relevant descriptions in the foregoing embodiments. , which will not be described here.
图3所示的DPU装置300可以由专用集成电路实现,也可以由通用CPU与专用集成电路实现,也可以由软件实现,或者软件与硬件结合实现,本发明实施例对此不作限定。The DPU device 300 shown in FIG. 3 may be implemented by an ASIC, or by a general-purpose CPU and an ASIC, or by software, or by a combination of software and hardware, which is not limited in this embodiment of the present invention.
图4提供了一种计算设备。其中,如图4所示,计算设备400包括主机401以及DPU接口卡402,主机401与DPU接口卡402通过总线403进行耦合。总线403可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图4中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。Figure 4 provides a computing device. Wherein, as shown in FIG. 4 , the computing device 400 includes a host 401 and a DPU interface card 402 , and the host 401 and the DPU interface card 402 are coupled through a bus 403 . The bus 403 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 4 , but it does not mean that there is only one bus or one type of bus.
其中,主机401包括内存4011和处理器4012,内存4011以及处理器4012可以通过总线4013进行耦合。Wherein, the host 401 includes a memory 4011 and a processor 4012 , and the memory 4011 and the processor 4012 may be coupled through a bus 4013 .
总线4013可以是PCI总线或EISA总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图4中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 4013 can be a PCI bus or an EISA bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 4 , but it does not mean that there is only one bus or one type of bus.
处理器4012可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。The processor 4012 can be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP) or a digital signal processor (digital signal processor, DSP) etc. Any one or more of them.
内存4011可以通过存储器实现,该存储器可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。并且,该存储器还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard drive drive,HDD)或固态硬盘(solid state drive,SSD)。The memory 4011 may be implemented by a memory, and the memory may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM). And, the memory can also include non-volatile memory (non-volatile memory), such as read-only memory (read-only memory, ROM), flash memory, mechanical hard disk (hard drive drive, HDD) or solid state disk (solid state drive, SSD).
DPU接口卡402具体可以用于实现上述图2所示实施例中DPU接口卡100所执行的方法。The DPU interface card 402 may specifically be used to implement the method executed by the DPU interface card 100 in the embodiment shown in FIG. 2 above.
计算设备400可以是服务器、存储阵列或者分布式存储系统。Computing device 400 may be a server, a storage array, or a distributed storage system.
此外,本申请实施例还提供了一种DPU接口卡。参见图5,图5示出了一种DPU接口卡的结构示意图。如图5所示,DPU接口卡500包括印刷电路板501、接口502以及DPU芯片503,并且,DPU接口卡500通过接口502与主机进行通信,所述接口502和所述DPU芯片503安装在印刷电路板501上;接口502和DPU芯片503可以通过印刷电路板上的线路通信,或者通过线缆通信,或者总线通信,或者接口502和DPU芯片503集成在一起。例如接口502和DPU芯片503集成在一起的一种实现是封装在一颗芯片中。其中,DPU接口卡500用于实现上述图2所示实施例中的DPU接口卡100所执行的业务恢复方法。相应的,印刷电路板501、接口502以及DPU芯片503的具体实现,可参见前述实施例中的印刷电路板1011、接口1012以及DPU芯片101,在此不做赘述。In addition, the embodiment of the present application also provides a DPU interface card. Referring to FIG. 5, FIG. 5 shows a schematic structural diagram of a DPU interface card. As shown in Figure 5, the DPU interface card 500 includes a printed circuit board 501, an interface 502 and a DPU chip 503, and the DPU interface card 500 communicates with the host through the interface 502, and the interface 502 and the DPU chip 503 are installed on the printed circuit board. On the circuit board 501 ; the interface 502 and the DPU chip 503 can communicate through lines on the printed circuit board, or through cable communication, or bus communication, or the interface 502 and the DPU chip 503 are integrated together. For example, an implementation in which the interface 502 and the DPU chip 503 are integrated is packaged in one chip. Wherein, the DPU interface card 500 is used to implement the service recovery method performed by the DPU interface card 100 in the above embodiment shown in FIG. 2 . Correspondingly, for the specific implementation of the printed circuit board 501 , the interface 502 and the DPU chip 503 , reference may be made to the printed circuit board 1011 , the interface 1012 and the DPU chip 101 in the foregoing embodiments, and details are not repeated here.
本申请实施例还提供了一种DPU芯片。参见图6,图6示出了一种DPU芯片的结构示意图。如图6所示,DPU芯片600应用于与主机耦合的DPU接口卡(图6中未示出),如前述实施例中的DPU接口卡100等;所述DPU芯片600包括获取电路601以及处理电路602,其中,获取电路601用于实现DPU芯片600获取数据的功能,如获取主机的内存中存储的业务信息,所述业务信息为在该DPU接口卡上的软件重新启动之前,处理电路602处理所述主机的处理器发送的输入输出IO所产生的信息;处理电路602用于实现DPU芯片600的数据处理功能,如根据获取电路601所获取的所述业务信息恢复业务。具体实现时,DPU芯片600可以为专用集成电路ASIC。The embodiment of the present application also provides a DPU chip. Referring to FIG. 6, FIG. 6 shows a schematic structural diagram of a DPU chip. As shown in Figure 6, the DPU chip 600 is applied to a DPU interface card (not shown in Figure 6) coupled with the host, such as the DPU interface card 100 in the foregoing embodiment; the DPU chip 600 includes an acquisition circuit 601 and a processing Circuit 602, wherein the acquisition circuit 601 is used to realize the function of the DPU chip 600 to acquire data, such as acquiring the business information stored in the memory of the host, the business information is before the software on the DPU interface card restarts, the processing circuit 602 Process the information generated by the input and output IO sent by the processor of the host; the processing circuit 602 is used to realize the data processing function of the DPU chip 600, such as recovering services according to the service information obtained by the obtaining circuit 601. During specific implementation, the DPU chip 600 may be an application specific integrated circuit (ASIC).
其中,获取电路601以及处理电路602相互协作,可以用于实现上述图2所示实施例中的DPU接口卡100中DPU芯片101所执行的业务恢复方法。Wherein, the acquisition circuit 601 and the processing circuit 602 cooperate with each other, and can be used to implement the service recovery method executed by the DPU chip 101 in the DPU interface card 100 in the embodiment shown in FIG. 2 above.
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述业务恢复方法。The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that includes one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state hard disk), etc. The computer-readable storage medium includes instructions, and the instructions instruct a computing device to execute the above service recovery method.
本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在计算设备上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。The embodiment of the present application also provides a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computing device, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机或数据中心进行传输。The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wirelessly (such as infrared, wireless, microwave, etc.) to another website site, computer or data center.
所述计算机程序产品可以为一个软件安装包,在需要使用前述业务恢复方法的任一方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。The computer program product may be a software installation package, and if any of the aforementioned service recovery methods needs to be used, the computer program product may be downloaded and executed on the computing device.
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。The description of the process or structure corresponding to each of the above drawings has its own emphasis. For the part that is not described in detail in a certain process or structure, you can refer to the relevant description of other processes or structures.

Claims (19)

  1. 一种业务恢复方法,其特征在于,所述方法包括:A service recovery method, characterized in that the method comprises:
    在数据处理单元DPU接口卡的软件重新启动后,所述DPU接口卡获取主机的内存中存储的业务信息,所述业务信息为在所述软件重新启动之前,所述DPU接口卡处理所述主机的处理器发送的请求所产生的信息;After the software of the data processing unit DPU interface card restarts, the DPU interface card obtains the service information stored in the internal memory of the host, and the service information is before the software restarts, the DPU interface card processes the host information generated by requests sent by the processor;
    所述DPU接口卡根据所述业务信息恢复业务。The DPU interface card restores services according to the service information.
  2. 根据权利要求1所述的方法,其特征在于,所述DPU接口卡基于快捷外设部件互连PCIe总线与所述主机耦合,并且,在所述软件重新启动的过程中,所述DPU接口卡与所述主机之间的PCIe链路不断开。The method according to claim 1, wherein the DPU interface card is coupled with the host based on the shortcut peripheral component interconnection PCIe bus, and, in the process of restarting the software, the DPU interface card The PCIe link with the host is not disconnected.
  3. 根据权利要求1或2所述的方法,其特征在于,所述DPU接口卡的软件重新启动由所述DPU接口卡发生故障触发或者由对所述DPU接口卡的所述软件的前一版本软件升级触发。The method according to claim 1 or 2, wherein the restart of the software of the DPU interface card is triggered by a failure of the DPU interface card or by the software of the previous version of the software of the DPU interface card Upgrade trigger.
  4. 根据权利要求3所述的方法,其特征在于,所述DPU接口卡重新启动所述软件,包括:The method according to claim 3, wherein the DPU interface card restarts the software, comprising:
    当所述DPU接口卡的内存发生故障时,所述DPU接口卡重新启动所述软件。When the memory of the DPU interface card fails, the DPU interface card restarts the software.
  5. 根据权利要求4所述的方法,其特征在于,所述当所述DPU接口卡的内存发生故障时,所述DPU接口卡重新启动所述软件,包括:The method according to claim 4, wherein when the internal memory of the DPU interface card fails, the DPU interface card restarts the software, including:
    当所述DPU接口卡的内存发生故障且发生故障的内存不满足预设条件时,所述DPU接口卡重新启动所述DPU接口卡的操作系统。When the memory of the DPU interface card fails and the failed memory does not meet the preset condition, the DPU interface card restarts the operating system of the DPU interface card.
  6. 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, characterized in that the method further comprises:
    当所述DPU接口卡的内存发生故障且发生故障的内存满足预设条件时,所述DPU接口卡重新启动所述DPU接口卡的操作系统的内核中使用所述发生故障的内存的服务组件。When the memory of the DPU interface card fails and the failed memory satisfies a preset condition, the DPU interface card restarts the service component using the failed memory in the kernel of the operating system of the DPU interface card.
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述DPU接口卡获取所述主机的内存中存储的业务信息,包括:The method according to any one of claims 1 to 6, wherein the DPU interface card acquires service information stored in the memory of the host, including:
    所述DPU接口卡获取第一地址标识,所述第一地址标识用于标识所述主机中的内存区域,所述内存区域存储所述业务信息;The DPU interface card acquires a first address identifier, the first address identifier is used to identify a memory area in the host, and the memory area stores the service information;
    所述DPU接口卡根据所述第一地址标识访问所述内存区域,获得所述业务信息。The DPU interface card accesses the memory area according to the first address identifier to obtain the service information.
  8. 根据权利要求7所述的方法,其特征在于,在所述软件重新启动之前,所述方法还包括:The method according to claim 7, wherein before the software is restarted, the method further comprises:
    所述DPU接口卡向所述主机申请所述内存区域,获取所述内存区域的第一地址标识;The DPU interface card applies for the memory area from the host, and obtains the first address identifier of the memory area;
    所述DPU接口卡根据所述第一地址标识将所述业务信息存储至所述内存区域。The DPU interface card stores the service information in the memory area according to the first address identifier.
  9. 一种数据处理单元DPU装置,其特征在于,所述DPU装置包括:A kind of data processing unit DPU device is characterized in that, described DPU device comprises:
    获取模块,用于在所述DPU装置的软件重新启动后,获取主机的内存中存储的业务信息;An acquisition module, configured to acquire the service information stored in the memory of the host after the software of the DPU device is restarted;
    恢复模块,用于根据所述业务信息恢复业务;所述业务信息为在所述软件重新启动之前,所述DPU装置处理所述主机的处理器发送的请求所产生的信息。A recovery module, configured to recover services according to the service information; the service information is information generated by the DPU device processing the request sent by the processor of the host before the software is restarted.
  10. 根据权利要求10所述的DPU装置,其特征在于,所述DPU装置基于快捷外设部件互连PCIe总线与所述主机耦合,并且,在所述软件重新启动的过程中,所述DPU装置与所述主机之间的PCIe链路不断开。The DPU device according to claim 10, characterized in that, the DPU device is coupled with the host based on the express peripheral component interconnection PCIe bus, and, in the process of restarting the software, the DPU device is connected with the The PCIe link between the hosts is not disconnected.
  11. 根据权利要求9或10所述的DPU装置,其特征在于,所述DPU装置的软件重新启动由所述DPU装置发生故障触发或者由对所述DPU装置的所述软件的前一版本软件升级触发。The DPU device according to claim 9 or 10, wherein the restart of the software of the DPU device is triggered by a failure of the DPU device or is triggered by a software upgrade of a previous version of the software of the DPU device .
  12. 根据权利要求11所述的DPU装置,其特征在于,所述DPU装置还包括启动模块,用于当所述DPU装置的内存发生故障时,重新启动所述软件。The DPU device according to claim 11, characterized in that, the DPU device further comprises a startup module, configured to restart the software when the memory of the DPU device fails.
  13. 根据权利要求12所述的DPU装置,其特征在于,所述启动模块,用于当所述DPU装置的内存发生故障且发生故障的内存不满足预设条件时,重新启动所述DPU装置的操作系统。The DPU device according to claim 12, wherein the startup module is configured to restart the operation of the DPU device when the memory of the DPU device fails and the failed memory does not meet the preset conditions system.
  14. 根据权利要求12所述的DPU装置,其特征在于,所述启动模块,还用于当所述DPU装置的内存发生故障且发生故障的内存满足预设条件时,重新启动所述DPU装置的操作系统的内核中使用所述发生故障的内存的服务组件。The DPU device according to claim 12, wherein the startup module is also used to restart the operation of the DPU device when the memory of the DPU device fails and the failed memory meets a preset condition The service component in the system's kernel that uses the failed memory.
  15. 根据权利要求9至14任一项所述的DPU装置,其特征在于,所述获取模块,用于:The DPU device according to any one of claims 9 to 14, wherein the acquisition module is configured to:
    获取第一地址标识,所述第一地址标识用于标识所述主机中的内存区域,所述内存区域存储所述业务信息;Obtaining a first address identifier, where the first address identifier is used to identify a memory area in the host, where the memory area stores the service information;
    根据所述第一地址标识访问所述内存区域,获得所述业务信息。Accessing the memory area according to the first address identifier to obtain the service information.
  16. 根据权利要求15所述的DPU装置,其特征在于,所述DPU装置还包括:The DPU device according to claim 15, wherein the DPU device further comprises:
    申请模块,用于在所述软件重新启动之前,向所述主机申请所述内存区域,获取所述内存区域的第一地址标识;An application module, configured to apply to the host for the memory area and obtain the first address identifier of the memory area before the software is restarted;
    存储模块,用于根据所述第一地址标识将所述业务信息存储至所述内存区域。A storage module, configured to store the service information in the memory area according to the first address identifier.
  17. 一种计算设备,其特征在于,所述计算设备包括主机以及数据处理单元DPU接口卡,所述主机包括内存和处理器;A computing device, characterized in that the computing device includes a host and a data processing unit DPU interface card, and the host includes a memory and a processor;
    所述DPU接口卡用于在所述DPU接口卡的软件重新启动后,获取所述主机的内存中存储的业务信息,根据所述业务信息恢复业务;所述业务信息为在所述软件重新启动之前,所述DPU接口卡处理所述主机的处理器发送的请求所产生的信息。The DPU interface card is used to obtain the service information stored in the memory of the host after the software of the DPU interface card is restarted, and restore the service according to the service information; the service information is after the software restarts Before, the DPU interface card processes the information generated by the request sent by the processor of the host.
  18. 一种数据处理单元DPU接口卡,其特征在于,包括印刷电路板、接口和数据处理单元DPU芯片,所述DPU接口卡通过所述接口与主机通信,所述接口与所述DPU芯片安装在所述印刷电路板上,所述DPU芯片用于在所述DPU接口卡的软件重新启动后,获取所述主机的内存中存储的业务信息,根据所述业务信息恢复业务;所述业务信息为在所述软件重新启动之前,所述DPU芯片处理所述主机的处理器发送的请求所产生的信息。A data processing unit DPU interface card is characterized in that it includes a printed circuit board, an interface and a data processing unit DPU chip, the DPU interface card communicates with the host through the interface, and the interface and the DPU chip are installed on the On the printed circuit board, the DPU chip is used to obtain the service information stored in the memory of the host after the software of the DPU interface card is restarted, and restore the service according to the service information; the service information is in the Before the software restarts, the DPU chip processes the information generated by the request sent by the processor of the host.
  19. 一种数据处理单元DPU芯片,其特征在于,所述DPU芯片应用于DPU接口卡,所述DPU接口卡与主机耦合;所述DPU芯片包括获取电路和处理电路;A data processing unit DPU chip, characterized in that, the DPU chip is applied to a DPU interface card, and the DPU interface card is coupled with a host; the DPU chip includes an acquisition circuit and a processing circuit;
    其中,所述获取电路用于在所述DPU接口卡的软件重新启动后,获取所述主机的内存中存储的业务信息;Wherein, the obtaining circuit is used to obtain the service information stored in the memory of the host after the software of the DPU interface card is restarted;
    所述处理电路,用于根据所述业务信息恢复业务;所述业务信息为在所述软件重新启动之前,所述处理电路处理所述主机的处理器发送的请求所产生的信息。The processing circuit is configured to restore the service according to the service information; the service information is information generated by the processing circuit processing the request sent by the processor of the host before the software is restarted.
PCT/CN2022/139182 2021-12-16 2022-12-15 Service recovery method, data processing unit and related device WO2023109880A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202111540861.5 2021-12-16
CN202111540861 2021-12-16
CN202210269274.5 2022-03-18
CN202210269274.5A CN116266150A (en) 2021-12-16 2022-03-18 Service recovery method, data processing unit and related equipment

Publications (1)

Publication Number Publication Date
WO2023109880A1 true WO2023109880A1 (en) 2023-06-22

Family

ID=86744086

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/139182 WO2023109880A1 (en) 2021-12-16 2022-12-15 Service recovery method, data processing unit and related device

Country Status (2)

Country Link
CN (1) CN116266150A (en)
WO (1) WO2023109880A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116795605B (en) * 2023-08-23 2023-12-12 珠海星云智联科技有限公司 Automatic recovery system and method for abnormality of peripheral device interconnection extension equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140146660A1 (en) * 2011-08-16 2014-05-29 Hangzhou H3C Technologies Co., Ltd. Restarting a line card
US20180167448A1 (en) * 2016-12-13 2018-06-14 International Business Machines Corporation Self-Recoverable Multitenant Distributed Clustered Systems
CN111078465A (en) * 2019-11-08 2020-04-28 苏州浪潮智能科技有限公司 Data recovery method and device and computer readable storage medium
CN113722147A (en) * 2020-05-26 2021-11-30 华为技术有限公司 Method for keeping service connection and related equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140146660A1 (en) * 2011-08-16 2014-05-29 Hangzhou H3C Technologies Co., Ltd. Restarting a line card
US20180167448A1 (en) * 2016-12-13 2018-06-14 International Business Machines Corporation Self-Recoverable Multitenant Distributed Clustered Systems
CN111078465A (en) * 2019-11-08 2020-04-28 苏州浪潮智能科技有限公司 Data recovery method and device and computer readable storage medium
CN113722147A (en) * 2020-05-26 2021-11-30 华为技术有限公司 Method for keeping service connection and related equipment

Also Published As

Publication number Publication date
CN116266150A (en) 2023-06-20

Similar Documents

Publication Publication Date Title
US7197634B2 (en) System and method for updating device firmware
US10303459B2 (en) Electronic system with update control mechanism and method of operation thereof
US9043656B2 (en) Securing crash dump files
US9665521B2 (en) System and method for providing a processing node with input/output functionality by an I/O complex switch
US11126420B2 (en) Component firmware update from baseboard management controller
US8782469B2 (en) Request processing system provided with multi-core processor
CN104834575A (en) Firmware recovery method and device
US11194589B2 (en) Information handling system adaptive component reset
US11157349B2 (en) Systems and methods for pre-boot BIOS healing of platform issues from operating system stop error code crashes
RU2653254C1 (en) Method, node and system for managing data for database cluster
US20210240831A1 (en) Systems and methods for integrity verification of secondary firmware while minimizing boot time
CN101482823A (en) Single board application version implementing method and system
US9148479B1 (en) Systems and methods for efficiently determining the health of nodes within computer clusters
WO2023109880A1 (en) Service recovery method, data processing unit and related device
CN111124728A (en) Automatic service recovery method, system, readable storage medium and server
JP6599725B2 (en) Information processing apparatus, log management method, and computer program
US20160306688A1 (en) System and Method for Cloud Remediation of a Client with a Non-Bootable Storage Medium
CN116724297A (en) Fault processing method, device and system
US11740969B2 (en) Detecting and recovering a corrupted non-volatile random-access memory
CN114328024A (en) PCIe function level reset implementation method and device, computer equipment and storage medium
TWI554876B (en) Method for processing node replacement and server system using the same
TWI777664B (en) Booting method of embedded system
CN116483612B (en) Memory fault processing method, device, computer equipment and storage medium
US20230055136A1 (en) Systems and methods to flush data in persistent memory region to non-volatile memory using auxiliary processor
KR20140032071A (en) Apparatus and method for managing file system of a computing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22906625

Country of ref document: EP

Kind code of ref document: A1