WO2023186143A1 - 一种数据处理方法、主机及相关设备 - Google Patents

一种数据处理方法、主机及相关设备 Download PDF

Info

Publication number
WO2023186143A1
WO2023186143A1 PCT/CN2023/085690 CN2023085690W WO2023186143A1 WO 2023186143 A1 WO2023186143 A1 WO 2023186143A1 CN 2023085690 W CN2023085690 W CN 2023085690W WO 2023186143 A1 WO2023186143 A1 WO 2023186143A1
Authority
WO
WIPO (PCT)
Prior art keywords
host
data
chip
peripheral
memory
Prior art date
Application number
PCT/CN2023/085690
Other languages
English (en)
French (fr)
Inventor
刘鸿彬
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023186143A1 publication Critical patent/WO2023186143A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox

Definitions

  • the present application relates to the field of computers, and in particular, to a data processing method, host and related equipment.
  • DMA direct memory access
  • the upstream and downstream bandwidth of the peripheral chip is usually limited by the DMA data bandwidth. For example, assuming the bandwidth of the peripheral chip is 200MB/s, but since the DMA data bandwidth is 100MB/s, then the actual bandwidth of the peripheral chip It can only reach 100MB/s, which makes the peripheral chip bandwidth very limited and wastes the chip's processing power.
  • This application provides a data processing method, host and related equipment to solve the problem of limited bandwidth of peripheral chips and wasted chip processing capabilities.
  • a data processing method is provided.
  • the method is applied to a computing device.
  • the computing device includes a host, a memory and a peripheral chip.
  • the host, the memory and the peripheral chip are coupled through a bus.
  • the method includes the following steps: the host obtains data. Processing request.
  • the data processing request includes downstream data.
  • the downstream data is used to indicate the data to be sent to the peripheral chip by the host.
  • the host stores the downstream data into the memory.
  • the host copies the downstream data to the peripheral chip using direct memory access DMA.
  • the host stores the downstream data in the data processing request into the memory, and then copies the downstream data to the peripheral chip through DMA.
  • the peripheral chip no longer needs to process the downstream data, thereby making the peripheral chip
  • the data bandwidth can be fully utilized, and the data processing bandwidth of the peripheral chip is no longer limited by the DMA bandwidth.
  • the host may include at least one general-purpose processor, such as a CPU, an NPU, or a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip is an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD), or a combination thereof.
  • the above-mentioned PLD is a complex programmable logic device (Complex Programmable Logic Device, CPLD), a field-programmable gate array (Field-Programmable Gate Array, FPGA), a general array logic (Generic Array Logic, GAL) or any combination thereof.
  • the memory is the memory of the host.
  • the memory can be volatile memory, such as random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM) ), synchronous dynamic random access memory (synchronous dynamic RAM, SDRAM), double rate synchronous dynamic random access memory (double data rate RAM, DDR), cache (cache), etc.
  • RAM random access memory
  • DRAM dynamic RAM
  • SRAM static RAM
  • SDRAM synchronous dynamic random access memory
  • double rate synchronous dynamic random access memory double data rate RAM, DDR
  • cache cache
  • the memory can also include a combination of the above types. This application does not apply This is limited.
  • Peripheral chips include fast peripheral component interconnection standard PCIe device chips, memory card chips, network card chips, independent redundancy One or more of a disk array RAID chip and an accelerator card chip, where the accelerator card includes one or more of an image processor GPU, a processor distributed processing unit DPU, and a neural network processor NPU.
  • the bus includes one or more of the Peripheral Component Interconnect Express PCIe bus, Unified UB bus, Computer Quick Link CXL bus, Cache Coherent Interconnect Protocol CCIX bus, and Z era bus.
  • the data processing method provided by this application can be applied to a single-chip scenario and a multi-chip stacking scenario.
  • the single-chip scenario refers to a scenario where a single peripheral chip is coupled to the host through a bus.
  • the multi-chip stacking scenario refers to a scenario where a single peripheral chip is coupled to the host through a bus.
  • the scenario refers to a scenario in which multiple peripheral chips are coupled to the host through a bus, which is called a multi-chip stacking scenario.
  • peripheral chips process multiple data streams
  • multiple data streams are easy to affect each other. Therefore, in this type of scenario, multiple chips are stacked to stack multiple data streams. Single chips with weak processing capabilities are stacked into a device or module, and each single chip processes one data stream, thereby achieving physical isolation in high concurrency and multiple data stream scenarios.
  • multi-chip stacking scenario although multiple peripheral chips are installed in the pass-through card, each peripheral chip is coupled with the host through the bus.
  • the computing device includes a pass-through card, multiple peripheral chips are provided in the pass-through card, and the peripheral chips are coupled with the host through a bus in a pass-through manner.
  • the computing device includes a plug-in card with a PCIe switch (switch, SW).
  • a PCIe switch switch, SW
  • Multiple peripheral chips are set in the plug-in card with the PCIe switch.
  • the peripheral chips pass through a plug-in card with a PCIe switch.
  • the PCIe switch of the switch's plug-in card is coupled to the host.
  • the plug-in card can also be called a SW card.
  • peripheral chips are uniformly installed in the pass-through card below.
  • the scenario in which the peripheral chips are coupled with the host through the bus in a pass-through mode is called a pass-through scenario.
  • Multiple peripheral chips The scenario in which the peripheral chip is installed in a plug-in card with a PCIe switch and is coupled to the host through the PCIe switch of the plug-in card with a PCIe switch is called a SW scenario.
  • the peripheral chip includes a DMA module
  • the method may also include the following steps: the peripheral chip obtains uplink data in the chip memory of the peripheral chip, and the uplink data is used to indicate the peripheral chip to be sent to the host. Data, the peripheral chip copies the uplink data to the memory in DMA mode.
  • each peripheral chip can reach 100MB/s. Since the peripheral chip processes DMA write operations and does not need to process DMA read operations, the upstream bandwidth of the peripheral chip is 100MB/s, and the DMA read operation It is implemented by the RC 120 of the host 100, so the downlink bandwidth of the peripheral chip 200 is 100MB/s.
  • the uplink and downlink bandwidth of the peripheral chip can both reach 100MB/s, and the data processing bandwidth of the chipset can reach the theoretical value of 200MB/s, so that the processing capabilities of the peripheral chips can also be maximized in multi-chip stacking scenarios, with high concurrency.
  • the bandwidth of multi-data stream processing is no longer limited by DMA bandwidth.
  • the peripheral chip copies the upstream data in the chip memory to the memory in DMA mode through the DMA module in the peripheral chip, and the downstream data is implemented by the host's DMA module, so that the peripheral chip no longer needs to process the downstream data. , thus allowing the data bandwidth of the peripheral chip to be fully utilized, and the data processing bandwidth of the peripheral chip is no longer limited by the DMA bandwidth.
  • the host stores the downstream data in the data processing request into the memory, and then copies the downstream data to the peripheral chip through DMA.
  • the peripheral chip no longer needs to process the downstream data, so that the data bandwidth of the peripheral chip can be fully utilized. It is assumed that the data processing bandwidth of the chip is no longer limited by the DMA bandwidth.
  • the root complex (RC) of the host supports the DMA function.
  • the root complex of the machine may include a DMA module.
  • the DMA module may be a hardware module capable of realizing DMA functions, and may specifically include a DMA controller, a register, and so on.
  • the DMA module can be a DMA hardware unit integrated within the host.
  • the DMA module can also be a logic circuit external to the host. The logic circuit can implement the DMA function. This application does not specifically limit this.
  • the driver of the DMA module can be an open source kernel device driver provided by the CPU manufacturer (not limited to ARM, X86, Tianchi and other CPU types).
  • DMA is a mature reading and writing technology.
  • the current server will be equipped with DMA hardware regardless of its architecture.
  • This application develops the driver for the existing DMA hardware of the server, enables the DMA hardware, and combines the hardware with the DMA hardware.
  • the address space configuration mapping enables it to realize the function of DMA the downstream data in the memory to the chip memory, so that the peripheral chip no longer needs to process the downstream data and perform DMA read operations, thus reducing the processing pressure of the peripheral chip. , improve the data processing bandwidth of peripheral chips.
  • the above implementation method enables the original DMA module in the host to process downlink data, thereby reducing the processing pressure of the peripheral chip, and does not require the deployment of additional hardware resources. It is achieved by upgrading the DMA module driver in the host.
  • the solution provided by this application can be realized, and the solution has high feasibility and high reproducibility.
  • the method before the host obtains the data processing request, the method further includes the following steps: the host determines the mapping relationship between the physical address of the chip memory and the physical address of the memory, and uses direct memory access DMA based on the mapping relationship. method to copy the downstream data to the peripheral chip.
  • the mapping relationship between the addresses of the chip memory and the host's memory can be determined by configuring the host's DMA module.
  • the configuration process may include device enumeration, driver initialization, and device configuration. , where device enumeration refers to device enumeration of all peripheral chips coupled to the host through the bus to obtain the topology information of each peripheral chip.
  • Driver initialization refers to initializing the driver of the host's DMA module and determining the channel information of the host's DMA module.
  • Device configuration refers to configuring the address of the DMA module, and determining the mapping relationship between the chip memory and the memory address based on the above topology information and channel information.
  • the topology information is the bus topology generated when the peripheral chip and the host are coupled through the bus.
  • the topology information is used to describe the topology of the device system composed of the peripheral chips. Specifically, it can be a data structure linked list, such as a PCI device tree.
  • the topology information may also include the identity information of each peripheral chip, such as the device number (decive_id), manufacturer identification (vendor_id) of the peripheral chip, the bus device function (BDF) code of the PCI device, etc., this application No specific limitation is made.
  • the channel information may include the number of channels of the DMA module, memory space information occupied by each channel, structure assignment information within the channel, etc., which is not specifically limited in this application.
  • the device enumeration process can be as follows: the host can use the depth first search (DFS) algorithm to start from the host's RC to find the peripheral chips and bridges connected to the RC, and then search for the peripheral chips and bridges connected to the RC. Assume that the chip and bridge allocate BDF numbers. Then, read the base address register (BAR) space, perform mapping and access testing on the BAR space, and allocate PCI resources to each found peripheral chip and bridge. After the resource allocation is completed, the above topology information is obtained. That is, the PCI device tree.
  • DFS depth first search
  • BAR base address register
  • RC can determine the driver corresponding to each peripheral chip through device scanning after obtaining the topology information of the peripheral chip. For example, determine the network card corresponding to the network card. driver, the sound card driver corresponding to the sound card, etc.
  • the topology information of the peripheral chip may include the identity information of the peripheral chip.
  • the driver corresponding to the peripheral chip is determined. . It should be understood that since the number of types of peripheral chips is different, the number of corresponding drivers is also different. Therefore, after obtaining the topology information of the peripheral chips, the driver corresponding to each peripheral chip is determined by matching the identity information to avoid subsequent pairing. drive During initialization, problems such as initialization failure or writing failure may occur due to driver mismatch.
  • the host's DMA module can first apply for a data channel.
  • the data channel includes a descriptor, and the descriptor includes a source address and a destination address.
  • the host uses the data channel to Direct memory access DMA mode copies downstream data to peripheral chips.
  • the DMA descriptor can be moved to the physical ring of the DMA hardware to enable the DMA module to transmit data according to the DMA descriptor.
  • the data channel includes the data channel extended by the PCIe switch.
  • the number of data channels of the pass-through card is 2.
  • the PCIe switch can convert 1 peripheral chip
  • the endpoint port EP is expanded to 2
  • the number of data channels of the SW card is 4.
  • the expanded data channels are the data channels expanded by the PCIe switch.
  • the peripheral chip can also configure the DMA module on the peripheral chip, so that the DMA module can write the upstream data in the chip memory to the host.
  • the peripheral chip configures the DMA module, it can first determine the mapping relationship between the chip memory and the memory address. In this way, when the peripheral chip processes uplink data, the DMA module can store the mapping relationship in the chip memory according to the stored mapping relationship. The upstream data is written into the memory using DMA.
  • the detailed step process of configuring the DMA module by the peripheral chip can be referred to the process of configuring the DMA module in the host by the host in the foregoing content, and will not be repeated here.
  • the above implementation method is to develop the driver for the existing DMA hardware of the server, enable the DMA hardware, and map the DMA address space configuration with the hardware, so that it can realize the function of writing the downstream data in the memory to the chip memory.
  • the peripheral chip no longer needs to process downlink data and perform DMA read operations, thereby reducing the processing pressure of the peripheral chip and improving the data processing bandwidth of the peripheral chip.
  • a host is provided.
  • the host is used in a computing device.
  • the computing device includes a host, a memory and a peripheral chip.
  • the host, the memory and the peripheral chip are coupled through a bus.
  • the host includes: an acquisition unit for acquiring data processing. Request, data processing request includes downstream data, the downstream data is used to indicate the data to be sent by the host to the peripheral chip, the storage unit is used by the host to store the downstream data into the memory, and the direct memory access DMA unit is used to access the DMA with direct memory method to copy the downstream data to the peripheral chip.
  • the host stores the downstream data in the data processing request into the memory, and then copies the downstream data to the peripheral chip through DMA.
  • the peripheral chip no longer needs to process the downstream data, thereby making the peripheral chip
  • the data bandwidth can be fully utilized, and the data processing bandwidth of the peripheral chip is no longer limited by the DMA bandwidth.
  • the root complex of the host supports DMA functionality.
  • the host includes a determining unit, which is used to determine the mapping relationship between the physical address of the chip memory and the physical address of the memory before the acquisition unit obtains the data processing request, and the DMA unit is used to determine the mapping relationship between the physical address of the chip memory and the physical address of the memory. Mapping relationship, copying downstream data to peripheral chips using direct memory access DMA.
  • the DMA unit is used to obtain the source address of the downlink data, and determine the destination address of the source address according to the mapping relationship.
  • the DMA unit is used to apply for a data channel.
  • the data channel includes a descriptor, and the descriptor includes a source address. and destination address, DMA unit, used to copy downstream data to peripheral chips through the data channel in direct memory access DMA mode.
  • the computing device includes a pass-through card, a plurality of peripheral chips are provided in the pass-through card, and the peripheral chips are coupled with the host through a bus in a pass-through manner.
  • the computing device includes a plug-in card with a PCIe switch, multiple peripheral chips are provided in the plug-in card with the PCIe switch, and the peripheral chips communicate with the PCIe switch through the PCIe switch with the plug-in card. Host coupling.
  • the data channel includes a data channel extended by the PCIe switch.
  • the bus includes one or more of the Peripheral Component Interconnect Express PCIe bus, the Unified UB bus, the Computer Quick Link CXL bus, the Cache Coherent Interconnect Protocol CCIX bus, and the Z-era bus.
  • the peripheral chip includes one or more of a peripheral component interconnection standard PCIe device chip, a memory card chip, a network card chip, an independent redundant disk array RAID chip, and an accelerator card chip, where the acceleration
  • the card includes one or more of an image processor GPU, a distributed processing unit DPU, and a neural network processor NPU.
  • a processor is provided.
  • the processor is installed in a computing device.
  • the computing device includes a processor, a memory and a peripheral chip.
  • the processor, the memory and the peripheral chip are coupled through a bus.
  • the processor is used to execute the above-mentioned first step. Aspects describe the operating steps of the host in the method.
  • the fourth aspect provides a computing device.
  • the computing device includes a host, a memory and a peripheral chip.
  • the host, the memory and the peripheral chip are coupled through a bus.
  • the host is used to implement the operating steps of the host in the method described in the first aspect.
  • the peripheral chip is used to implement the operation steps of the peripheral chip in the method described in the first aspect.
  • a readable storage medium stores instructions, which when run on a host, cause the host to execute the method described in the first aspect.
  • Figure 1 is a schematic structural diagram of a data processing system provided by this application.
  • FIG. 2 is a schematic structural diagram of another data processing system provided by this application.
  • Figure 3 is a schematic flow chart of the steps of a data processing method provided by this application.
  • Figure 4 is a schematic flow chart of the steps of DMA driver initialization in a data processing method provided by this application;
  • Figure 5 is a schematic flow chart of the steps of equipment configuration in a data processing method provided by this application.
  • Figure 6 is a schematic flow chart of the steps of a data processing method provided by this application in the SW scenario
  • Figure 7 is a schematic structural diagram of a host provided by this application.
  • Single chip refers to the data exchange scenario between the processor and a single peripheral chip
  • multi-chip stacking refers to the data exchange between the processor and the device or module after multiple single chips are stacked into one device or module. scene. It should be understood that in high-concurrency, multi-data stream scenarios, when peripheral chips process multiple data streams, multiple data streams are easy to affect each other. Therefore, in this type of scenario, multiple chips are stacked to stack multiple data streams. Single chips with weak processing capabilities are stacked into a device or module, and each single chip processes one data stream, thereby achieving physical isolation in high concurrency and multiple data stream scenarios.
  • DMA direct memory access
  • the peripheral chip processes the DMA write operation and writes the data in the peripheral chip memory to the host's memory.
  • the peripheral chip processes the DMA read operation to read the data from the host's memory to the peripheral chip memory.
  • DMA technology is a high-speed data transmission technology that realizes data exchange between the processor and peripheral chips through DMA hardware, thereby reducing the processing pressure of the processor and improving the efficiency of data transmission.
  • the host 100 is coupled with the chip of each node device (end point, EP) (ie, the above-mentioned peripheral chip) through the root complex 110 (root complex, RC), where the RC is used to process the Converts access transactions to PCIe Access transactions on the bus.
  • EP end point
  • RC root complex
  • the PCIe bus exchanges information or transmits data in the form of messages. Therefore, the RC is responsible for generating corresponding messages according to the CPU's access transactions and transmitting them to the downstream peripheral chips. In the same way, the RC is also responsible for receiving the downstream peripherals.
  • the chip reports the message and forwards the information or data to the CPU according to the message content.
  • the upstream and downstream bandwidth of the peripheral chip is usually limited by the DMA data bandwidth. For example, assuming that the bandwidth of the peripheral chip is 200MB/s, but since the DMA data bandwidth is 100MB/s, then the The actual bandwidth of the peripheral chip can only reach 100MB/s, which makes the bandwidth of the peripheral chip very limited and wastes the processing power of the chip.
  • the uplink and downlink bandwidth after multiple peripheral chips are stacked will also be limited by the DMA data bandwidth, so that the data processing capabilities of each peripheral chip cannot be fully utilized, resulting in the peripheral chip processing capabilities being wasted.
  • the processing capacity of a PCIe chip formed by stacking two peripheral chips can reach 200MB/s.
  • the processing power of the peripheral chip needs to be frequency divided.
  • the uplink bandwidth is 50MB/s and the downlink bandwidth is 50MB/s.
  • the processing power of the stacked PCIe chip can ultimately reach a data bandwidth of 100MB/s for the upstream bandwidth and 100MB/s for the downstream bandwidth. This is equivalent to the peripheral chip being able to only use half of the data bandwidth when reading data, and also when writing data. Only half of the data bandwidth can be used, which wastes the processing power of the peripheral chip.
  • the upstream and downstream bandwidth of the peripheral chip is usually limited by the data bandwidth of DMA, resulting in single-chip and multi-chip stacking scenarios.
  • the data bandwidth of the chip cannot be fully utilized, which wastes the processing power of the peripheral chip, and the uplink and downlink bandwidth of the peripheral chip is low.
  • this application provides a data processing system that transfers downlink data to the DMA module of the host for processing.
  • the host's DMA module stores the downlink data into the memory, and then copies the downlink data to the peripheral chip in the form of DMA.
  • the peripheral chip does not need to process the downlink data, thereby reducing the processing pressure on the peripheral chip and improving the up and down time of the peripheral chip. line bandwidth.
  • FIG. 1 is a schematic structural diagram of a data processing system provided by the present application.
  • the data processing system 1000 includes a host (host) 100, a peripheral chip 200, a memory 400 and a chip memory 240.
  • the host 100 The peripheral chip 200, the memory 400 and the chip memory 240 are coupled through the bus 300, where the number of the peripheral chips 200 may be one or more.
  • the data processing system 1000 can be deployed on a computing device.
  • the computing device can be a physical server, such as an Storage devices, such as storage arrays or storage servers; computing devices can also be edge servers, which are not specifically limited in this application.
  • the host 100 may include at least one general-purpose processor, such as a CPU, an NPU, or a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip is an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD), or a combination thereof.
  • the above-mentioned PLD is a complex programmable logic device (Complex Programmable Logic Device, CPLD), a field-programmable gate array (Field-Programmable Gate Array, FPGA), a general array logic (Generic Array Logic, GAL) or any combination thereof.
  • the peripheral chip 200 may be a system on chip (SoC).
  • SoC system on chip
  • the peripheral chip 200 may be any chip that the host 100 can couple through the bus 300, such as a sound card, a network card (NIC), or a universal serial Universal serial bus (USB) card, integrated development environment (IDE) interface card, redundant arrays of independent disks (RAID) card, video capture card, etc. are not specifically limited in this application.
  • NIC network card
  • USB universal serial Universal serial bus
  • IDE integrated development environment
  • RAID redundant arrays of independent disks
  • video capture card etc.
  • the number of peripheral chips coupled to the host 100 through the bus may be one.
  • the number of peripheral chips coupled to the host 100 through a bus may be multiple.
  • the peripheral chip is a PCIe device chip
  • multiple peripheral chips can be stacked to form a PCIe card.
  • the peripheral chip is a network card chip, then multiple peripheral chips can be stacked. Form a stacked network card.
  • the peripheral chip is a disk chip, then multiple peripheral chips can be stacked to form a redundant array of independent disks (RAID).
  • RAID redundant array of independent disks
  • peripheral chip is a processor chip, such as a central processing unit ,CPU), graphics processing unit (GPU), processor distributed processing unit (data processing unit (DPU)), neural network processing unit (NPU) and other special processor chips, so many Peripheral chips can be stacked to form an acceleration component.
  • the host 100 can use the processor to process the main business system, and use the acceleration component to process other systems, such as neural network training systems, image rendering systems, etc., which are not specifically limited in this application.
  • one or more peripheral chips 200 can be installed in a pass-through card, and the peripheral chips are coupled with the host 100 through the bus 300 in a pass-through manner.
  • the peripheral chips 1 and peripheral chip 2 can be installed in the pass-through card.
  • Peripheral chip 1 is coupled to the host 100 through the bus
  • peripheral chip 2 is coupled to the host 100 through the bus.
  • the pass-through card can be the part selected by the dotted box in Figure 1. This combination is referred to as "pass-through scenario" below.
  • the memory 400 is the memory of the host 100.
  • the memory 400 can specifically be a volatile memory (volatile memory), such as a random access memory (random access memory, RAM), a dynamic random access memory (dynamic RAM, DRAM), or a static random access memory (static random access memory).
  • RAM random access memory
  • DRAM dynamic random access memory
  • static random access memory static random access memory
  • RAM SRAM
  • synchronous dynamic random access memory synchronous dynamic RAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate RAM, DDR
  • cache cache
  • the chip memory 240 is the memory of the peripheral chip 200. It can be a memory stick or memory particle plugged into the peripheral chip interface. Specifically, it can be a volatile memory, such as RAM, DRAM, SRAM, SDRAM, DDR, cache, etc. Chip memory 240 may also include a combination of the above categories, which is not limited by this application.
  • the bus 300 may be a peripheral component interconnect express (PCIe) bus, an extended industry standard architecture (EISA) bus, a unified bus (unified bus, Ubus or UB), or a computer quick link (
  • PCIe peripheral component interconnect express
  • EISA extended industry standard architecture
  • unified bus unified bus, Ubus or UB
  • CXL compute express link
  • CCIX cache coherent interconnect for accelerators
  • GenZ Generation Z
  • the host 100 and the peripheral chip 200 can be further divided into multiple unit modules.
  • Figure 1 is an exemplary division method.
  • the host 100 can include a memory controller 110 and an RC 120, where the memory The controller 110 and the RC 120 are coupled through the system bus 130.
  • the peripheral chip 200 may include a node device (end point, EP) port 210 and a peripheral chip memory controller 220, where the EP 210 and the peripheral chip memory controller 220 are coupled through a system bus 230, where the system bus 230 can also be coupled with the chip memory 240.
  • the host 100 and the peripheral chip 200 can also include more unit modules.
  • the host 100 can also include a communication interface, a power supply, etc.
  • the peripheral chip 200 can also include a communication interface, a power supply, etc., which are not specifically limited in this application. .
  • the memory controller 110 and the peripheral chip memory controller 220 can be hardware chips with processing functions.
  • the above hardware chips are application-specific integrated circuits (Application-Specific Integrated Circuit, ASIC), programmable logic devices (Programmable Logic Device, PLD) or other combination.
  • the above-mentioned PLD is a complex programmable logic device (CPLD), a field-programmable gate array (Field-Programmable Gate Array, FPGA), a general array logic (Generic Array Logic, GAL), or any combination thereof.
  • Memory controller 110 can execute various types of digital storage instructions, such as software or firmware programs stored in memory 400, that enable host 100 to Provide a wide variety of services.
  • the system bus 130 and the system bus 230 may be a PCIe bus, an EISA bus, a UB bus, a CXL bus, a CCIX bus, a GenZ bus, etc., which are not specifically limited in this application.
  • RC 120 is connected to EP 210 of peripheral chip 200 via bus 300. Among them, RC 120 is used to convert the access transaction of the host 100 into the access transaction on the PCIe bus. It should be understood that the PCIe bus exchanges information or transmits data in the form of messages. Therefore, the RC 120 is responsible for generating corresponding messages according to the CPU's access transactions, or processing the received messages and converting the message contents into information or data. Forwarded to host 100.
  • the host 100 can obtain the first data request.
  • the first data request includes downlink data.
  • the downlink data is used to indicate the data to be sent by the host 100 to the peripheral chip 200.
  • the host 100 stores the downlink data to into the memory 400, and then copies the downstream data to the peripheral chip 200 in a DMA manner.
  • the peripheral chip 200 can obtain the uplink data in the chip memory 240.
  • the uplink data is used to indicate the data to be sent by the peripheral chip 200 to the host 100, and then copies the uplink data to the memory 400 of the host 100 in a DMA manner. .
  • the host 100 when the host 100 copies the downlink data to the peripheral chip 200 in DMA mode, it can first send the copied downlink data to the peripheral chip memory controller 220 through DMA technology, and then the peripheral chip memory controller 220 Store the downlink data into chip memory 240.
  • the peripheral chip 200 copies the uplink data to the host 100 in the DMA method
  • the copied uplink data can first be sent to the memory controller 110 through the DMA technology, and then the memory controller 110 can store the uplink data into the memory 400. middle.
  • the downstream data in the memory 400 is no longer read by the peripheral chip 200 through DMA, but is written into the chip memory 240 by the host 100 through DMA technology.
  • the peripheral chip 200 No longer processing DMA read operations not only can the processing pressure of the peripheral chip 200 be reduced, but the data processing bandwidth of the peripheral chip 200 does not need to be frequency divided, so that the data processing bandwidth of the peripheral chip 200 can be improved.
  • each peripheral chip 200 in Figure 1 can reach 100MB/s. Since the peripheral chip 200 only needs to process DMA write operations, it does not need to process DMA read operations. In other words, the peripheral chip 200 does not need to process DMA read operations.
  • the chip 200 only needs to write the upstream data in the chip memory 240 into the memory 400 of the host 100 through DMA, so the upstream bandwidth of the peripheral chip is 100MB/s, and the DMA read operation is implemented by the RC 120 of the host 100, so the external Assume that the downlink bandwidth of the chip 200 is 100MB/s.
  • the uplink and downlink bandwidths of the two peripheral chips 200 can be maximized. It can reach 100MB/s.
  • the data processing bandwidth of the chipset can reach the theoretical value of 200MB/s, which maximizes the processing capabilities of peripheral chips in multi-chip stacking scenarios.
  • the bandwidth of high concurrency and multi-data stream processing is no longer affected by DMA. Bandwidth limitations.
  • the RC 120 may include a DMA module 121, and the host 100 may copy downlink data to the peripheral chip through the DMA module 121.
  • the DMA module 121 may be a hardware module capable of realizing the DMA function of the host 100, which may specifically include a DMA controller, a register, etc., and the DMA module 121 realizes the function of writing data in the memory 400 to the chip memory 240.
  • the DMA module 121 can be installed by the RC 120 before writing the data in the memory 400 to the chip memory 240 through DMA technology.
  • the driver of the DMA module 121 can be a CPU manufacturer (not limited to ARM, X86, Tianchi and other CPU types). ) provides an open source kernel device driver.
  • the host can configure the DMA module 121 so that the DMA module 121 can write the downstream data in the memory 400 to the chip memory 240 .
  • the DMA module 121 can be a DMA hardware unit integrated inside the host 100 as shown in Figure 1. In some embodiments, the DMA module 121 can also be deployed outside the host 100.
  • the host 100 is a CPU chip, and the DMA module 121 may be a DMA hardware unit integrated within the CPU, or may be a logic circuit outside the CPU.
  • the logic circuit can implement the DMA function, which is not specifically limited in this application.
  • DMA is a mature reading and writing technology.
  • the current server will be equipped with DMA hardware regardless of its architecture.
  • This application develops the driver for the existing DMA hardware of the server, enables the DMA hardware, and combines the hardware with the DMA hardware.
  • the address space configuration mapping enables the function of writing the downstream data in the memory 400 to the chip memory 240, so that the peripheral chip 200 no longer needs to process the downstream data, and no longer needs to perform DMA read operations, thereby reducing the peripheral cost.
  • the processing pressure of the chip 200 increases the data processing bandwidth of the peripheral chip 200 .
  • the RC 120 when configuring the DMA module 121, determines the mapping relationship between the addresses of the chip memory 240 and the memory 400. In this way, when the RC 120 obtains a data processing request, the DMA module 121 can determine the mapping relationship between the addresses of the chip memory 240 and the memory 400. Mapping relationship, the downstream data in the memory 400 is written into the chip memory 240 of the peripheral chip 200 in a DMA manner.
  • the RC 120 can determine the mapping relationship between the addresses of the chip memory 240 and the host's memory 400 by configuring the DMA module 121 of the host.
  • the configuration process may include device enumeration, driver initialization and device configuration, where, Device enumeration refers to performing device enumeration on peripheral chips 200 to obtain topology information of each peripheral chip 200 .
  • Driver initialization refers to initializing the driver of the DMA module 121 of the host 100 and determining the channel information of the DMA module 121 of the host 100 .
  • Device configuration refers to configuring the address of the DMA module 121, and determining the mapping relationship between the chip memory and the memory address based on the above topology information and channel information.
  • the topology information is the bus topology generated when the peripheral chip 200 and the host 100 are coupled through the bus.
  • the topology information is used to describe the topology of the device system composed of the peripheral chip 200. Specifically, it can be a data structure linked list, such as a PCI device. Tree.
  • the topology information may also include the identity information of each peripheral chip 200, such as the device number (decive_id), manufacturer identification (vendor_id), bus device function (BDF) code of the PCI device, etc. of the peripheral chip 200.
  • the channel information may include the number of channels of the DMA module 121, memory space information occupied by each channel, structure assignment information in the channel, etc., which is not specifically limited in this application.
  • the device enumeration process can be as follows: RC 120 can use a depth first search (DFS) algorithm to start from RC 120 and search for the peripheral chip 200 and bridge (bridge) connected to RC 120.
  • the peripheral chip 200 and the bridge perform BDF number assignment.
  • BDF base address register
  • the above topology information is obtained , which is the PCI device tree.
  • RC 120 can determine the driver corresponding to each peripheral chip 200 through device scanning after obtaining the topology information of the peripheral chip 200. For example, determine the network card. The corresponding network card driver, the sound card driver corresponding to the sound card, etc.
  • the topology information of the peripheral chip 200 may include the identity information of the peripheral chip 200.
  • the identity information of the peripheral chip 200 is matched to determine the driver corresponding to each peripheral chip 200.
  • the vendor_id and decive_id of the peripheral chip 200 can be matched with the vendor_id and decive_id registered by the driver.
  • the driver corresponding to the peripheral chip 200 determines the driver corresponding to the peripheral chip 200. It should be understood that since the number of types of peripheral chips 200 is different, the number of corresponding drivers is also different. Therefore, after obtaining the topology information of the peripheral chips 200, the driver corresponding to each peripheral chip 200 can be determined by matching the identity information. This avoids problems such as initialization failure or writing failure due to driver mismatch when initializing the driver later.
  • the DMA module 121 can first apply for a DMA channel and configure a DMA descriptor for the DMA channel according to the above mapping relationship.
  • the DMA descriptor includes the source address and destination address of the data, and then the DMA descriptor is moved to the physical ring of the DMA hardware to enable the DMA module 121 to transmit data according to the DMA descriptor.
  • the RC 120 can determine the DMA module corresponding to each application when configuring the multiple DMA modules 121 according to the above process.
  • the RC 120 initiates the data When writing a request, you can first determine the DMA module corresponding to the application, and then use the DMA module to apply for a DMA channel. The details will not be repeated here.
  • the peripheral chip 200 can include a DMA module 211.
  • the peripheral chip 200 can obtain the uplink data in the chip memory 240.
  • the uplink data is used to indicate the data to be sent by the peripheral chip to the host 100, and then the uplink data is sent to the host 100 in a DMA manner. Data is copied to memory 400.
  • the peripheral chip 200 can also configure the DMA module 211 so that the DMA module 211 can write the uplink data in the chip memory 240 into the memory 400 of the host 100 .
  • the peripheral chip 200 configures the DMA module 211, it can first determine the mapping relationship between the addresses of the chip memory 240 and the memory 400. In this way, when the peripheral chip 200 initiates a data write request, the DMA module 211 can determine the mapping relationship between the addresses of the chip memory 240 and the memory 400. Mapping relationship, the data in the chip memory 240 is written into the memory 400 in a DMA manner.
  • the detailed step process of configuring the DMA module 211 by the peripheral chip 200 can be referred to the process of configuring the DMA module 121 by the RC 120 in the above content, and will not be repeated here.
  • the data processing system may also include a computing device including a plug-in card with a PCIe switch, and the one or more peripheral chips 200 may be disposed on the plug-in card with a PCIe switch (SW).
  • SW PCIe switch
  • the peripheral chip 200 is coupled to the host through a PCIe switch with a plug-in card of the above-mentioned PCIe switch.
  • This application scenario can be called a SW scenario.
  • Figure 2 is another data processing system 1001 provided by this application, wherein Figure 2 is the data processing system 1001 in the above-mentioned PCIe switch scenario (SW scenario for short), and Figure 1 is a pass-through scenario.
  • the data processing system 1000 below includes a host 100, a memory 400 and a PCIe switch 510.
  • the PCIe switch 510 has a plug-in card, and one or more peripheral chips 200 can be installed in the plug-in card of the PCIe switch 510.
  • the peripheral chip 200 is coupled with the host 100 through the PCIe switch 510.
  • the plug-in card can be called It is a SW card, such as the SW card 500 shown in Figure 2.
  • the two peripheral chips 200 are coupled to the host 100 through the PCIe switch 510.
  • the PCIe switch 510 is used to provide expansion or aggregation capabilities, allowing more peripheral chips 200 to be connected to one PCIe interface of the host.
  • the host 100 can be coupled with more peripheral chips 200 through the bus, and each peripheral chip 200 There can be more data channels, thereby increasing the data processing bandwidth of stacked cards in multi-chip stacking scenarios.
  • the pass-through scenario shown in Figure 1 only two EP210s are coupled to the host 100 through the bus.
  • the SW scenario shown in Figure 2 there can be four EP210s coupled to the host through the PCIe switch, making the bandwidth of the entire data processing system 1001 higher. high.
  • the host 100 can use the DMA module 121 to write the downstream data in the memory 400 into the chip memory 240 of the peripheral chip 200 through DMA technology.
  • the peripheral chip 200 can use the DMA module 211 to write the uplink data in the chip memory 240 into the host's memory 400 using DMA technology.
  • the specific implementation method can be referred to the embodiment in Figure 1 and will not be repeated here.
  • the host 100 can first apply for a data channel.
  • the data channel can include a descriptor.
  • the descriptor carries the source address and destination address of the downlink data.
  • the host 100 can copy the downlink data to the outside in DMA mode through the data channel.
  • the chip 200 wherein the above-mentioned data channel may include a data channel extended by the PCIe switch 510.
  • the PCIe switch 510 in Figure 1 has expanded one DMA data channel.
  • the PCIe switch 510 can also expand more DMA data channels for use by more peripheral chips 200, and , each EP 210 in the peripheral chip 200 in Figure 1 is deployed with one DMA module 211.
  • multiple EPs 210 in the peripheral chip 200 can share one DMA module 211, or one EP. Multiple DMAs present in 210 Module 211 is not specifically limited in this application.
  • the PCIe switch needs to deploy DMA read and DMA write functions.
  • the PCIe switch processes uplink data and downlink data, so that the PCIe switch needs higher DMA processing capabilities and needs to be adapted.
  • the processing capabilities of the host 100 and the peripheral chip 200 Using the technical solution provided by this application, this application transfers the processing of downlink data to the DMA module 121 of the host 100 and the uplink data to the DMA module 211 of the peripheral chip 200. Since the PCIe switch 510 does not need to perform DMA read and write operations , the DMA requirement of the PCIe switch 510 hardware is reduced, and it may not even have the DMA function.
  • the PCIe switch 510 can The selection range is increased. At the same time, developers do not need to separately develop and maintain DMA driver code for the PCIe switch 510, and the development and maintenance costs of the PCIe switch 510 are reduced.
  • the host stores the downstream data in the data processing request into the memory, and then copies the downstream data to the peripheral chip in the DMA mode through the DMA module, and the peripheral chip obtains the upstream data in the chip memory.
  • data copy the uplink data to the host memory in DMA mode, so that the entire bandwidth of the peripheral chip can be used to process the uplink data, and there is no need to process the downlink data.
  • the downlink data is handed over to the host's DMA module for processing, which not only makes the single-chip scenario Under this condition, the data bandwidth of peripheral chips can be fully utilized.
  • the data processing bandwidth of a single peripheral chip is no longer limited by the DMA bandwidth.
  • the data processing bandwidth of pass-through cards and SW cards can also be fully utilized. Utilization makes the bandwidth of high concurrency and multi-data stream processing no longer limited.
  • Figure 3 is a data processing method provided by this application. This method can be applied to the data processing system 1000 and the data processing system 1001 shown in Figure 1 or 2.
  • the data processing system 1000 or the data processing system 1001 can be deployed in a computing environment.
  • the computing device may include a host 100, a memory 400 and a peripheral chip 200.
  • the method may include the following steps:
  • Step S310 The host 100 obtains a data processing request.
  • the data processing request includes downlink data.
  • the downlink data is used to indicate the data to be sent by the host 100 to the peripheral chip 200. This step may be implemented by the memory controller 110 in FIG. 1 or FIG. 2 .
  • Step S320 The host 100 stores the downlink data into the memory 400. This step may be implemented by the memory controller 110 in FIG. 1 or FIG. 2 .
  • Step S330 The host 100 copies the downlink data to the peripheral chip 200 in a direct memory access DMA manner. This step can be implemented by the DMA 121 module in Figure 1 or Figure 2.
  • the root complex RC 120 of the host 100 supports the DMA function.
  • the host 100 writes data into the peripheral chip 200 through the DMA technology.
  • the DMA technology please refer to the aforementioned embodiments of Figures 1 and 2. The description will not be repeated here.
  • the host 100 copies the downstream data to the peripheral chip 200 in a direct memory access DMA manner, so that when the peripheral chip 200 needs to read the data in the host's memory 400, it no longer needs to perform a DMA read operation, thereby reducing the external cost. Assume the processing pressure of the chip 200 and increase the data processing bandwidth of the peripheral chip 200 .
  • the host 100 and the peripheral chip 200 are coupled through a bus 300, where the bus 300 includes the Peripheral Component Interconnect Express PCIe bus, the Unified UB bus, the Computer Quick Link CXL bus, the Cache Coherent Interconnect Protocol CCIX bus, Z One or more of the era buses.
  • Peripheral chips include one or more of the fast peripheral component interconnection standard PCIe device chips, memory card chips, network card chips, independent redundant disk array RAID chips, and accelerator card chips.
  • the accelerator cards include image processors, GPUs, and processing chips.
  • the peripheral chip 200 includes a DMA module 211.
  • the peripheral chip 200 can obtain uplink data in the chip memory 240.
  • the uplink data is used to instruct the peripheral chip 200 to send data to the host 100 through the DMA module 211.
  • the above uplink data is copied to the memory 400 in DMA mode.
  • the host 100 copies the downstream data in the host's memory 400 to the peripheral chip 200 using direct memory access DMA, the following steps may also be included: The host 100 determines the physical address of the chip memory and the host The mapping relationship between memory addresses. In this way, the host can write the data in the host's memory into the chip memory of the peripheral chip according to the mapping relationship.
  • the host may include an RC, and the RC is connected to the EP port of the chip through a bus.
  • RC is used to convert the access transaction of the host into the access transaction on the PCIe bus.
  • the PCIe bus exchanges information or transmits data in the form of messages. Therefore, the RC is responsible for generating corresponding messages according to the CPU's access transactions, or processing the received messages and forwarding the message content to the information or data. to the processor.
  • RC can be used to determine the mapping relationship between the addresses of the chip memory and the host's memory.
  • RC can determine the mapping relationship between the address of the chip memory and the memory of the host by configuring the DMA module of the host.
  • the configuration process may include device enumeration, driver initialization and device configuration, where device enumeration refers to The purpose is to perform device enumeration on peripheral chips and obtain the topology information of each peripheral chip.
  • Driver initialization refers to initializing the driver of the host's DMA module and determining the channel information of the host's DMA module.
  • Device configuration refers to configuring the address of the DMA module, and determining the mapping relationship between the chip memory and the memory address based on the above topology information and channel information.
  • the topology information is the bus topology generated when multiple peripheral chips are coupled with the host through the bus.
  • the topology information is used to describe the topology of the device system composed of multiple peripheral chips. Specifically, it can be a linked list of data structures, such as PCI Device tree.
  • the topology information may also include the identity information of each peripheral chip, such as the peripheral chip's decive_id, vendor_id, BDF code, etc., which is not specifically limited in this application.
  • the channel information may include the number of channels of the DMA module, memory space information occupied by each channel, structure assignment information within the channel, etc., which is not specifically limited in this application.
  • the RC performs device enumeration of peripheral chips
  • the specific process of obtaining the topology information of each peripheral chip can be as follows: the RC of the host can use the DFS algorithm to find the peripheral chips and bridges connected to the RC starting from the RC ( bridge), allocate BDF numbers to the found peripheral chips and bridges. Then, read the BAR space, perform mapping and access testing on the BAR space, and allocate PCI resources to each found peripheral chip and bridge. After the resource allocation is completed, the above topology information, which is the PCI device tree, is obtained.
  • the driver corresponding to each peripheral chip 200 determines the driver corresponding to each peripheral chip 200 through device scanning. For example, it determines the network card driver corresponding to the network card and the sound card driver corresponding to the sound card. Etc.
  • the topology information of the peripheral chip 200 may include the identity information of the peripheral chip 200.
  • the driver corresponding to the peripheral chip 200 is determined. It should be understood that since the number of types of peripheral chips 200 is different, the number of corresponding drivers is also different. Therefore, after obtaining the topology information of the peripheral chips 200, the driver corresponding to each peripheral chip 200 can be determined by matching the identity information. This avoids problems such as initialization failure or writing failure due to driver mismatch when initializing the driver later.
  • FIG. 4 is a schematic flow chart of the steps of DMA driver initialization provided by this application.
  • RC is based on The topology information of each device initializes the driver of the host's DMA module, and the specific process of determining the channel information of the host's DMA module can be as follows:
  • the DMA module here may be the DMA module 121 in the embodiments of Figures 1 and 2.
  • the identity information of the DMA module may be the BDF number of the DMA module, so as to facilitate the positioning, statistics, and collection status of the DMA module during subsequent processing.
  • the BDF number of the DMA module can be recorded in the log space. It should be understood that the device enumeration process before step S410 will not enumerate the DMA module, because the DMA module is a DMA hardware device in the host, so the identity information of the DMA module needs to be obtained through step S410.
  • step S410 After obtaining the DMA module identity information, you can set the DMA driver data pointer to a private device pointer. It should be understood that the device pointer when the DMA driver is initialized is a public pointer. After setting the DMA module pointer to private, you can make the DMA module
  • the multiple peripheral chips obtained by device enumeration before step S410 can be specially supplied for use.
  • the DMA module can be configured for PCIe through the set function in the DMA driver code or the kernel system function.
  • the configuration content includes configuring the memory address of the DMA module space and other PCIe-related parameters. Configuration is not specifically limited in this application.
  • the channel information may include the number of available channels of the DMA module, and then apply for corresponding memory space for each channel. For example, if the structure size of each channel is A and the number of channels is B, then in step S430 You can apply for A ⁇ B size memory space. It should be understood that the above examples are for illustration and are not specifically limited in this application.
  • the channel information can also include the assignment of each channel structure.
  • the assignment initialization operation can be determined according to the actual business environment, and is not specifically limited in this application.
  • the DMA module can be switched on and off to enable its DMA function. This can include configuring the status of the DMA module, configuring the transceiver mode, etc. It can also include the configuration of other DMA-related functions. I will not give examples one by one here.
  • each DMA module in the host is usually one or more.
  • each DMA module can be configured according to the description of the above steps S410 to S430 and its optional steps. Here Not repeated.
  • Figure 5 is a step flow of device configuration provided by this application. Schematic diagram, as shown in Figure 5, configure the address of the DMA module. Based on the above topology information and channel information, the specific steps to determine the mapping relationship between the chip memory and the memory address can be as follows:
  • Step S510 Obtain the physical address information of each peripheral chip and the memory address information of the host.
  • the physical address information can be the physical starting address of the peripheral chip storing data.
  • the memory address information of the host refers to the storage space allocated for each peripheral chip during the aforementioned device enumeration process, which can include the host's memory address. The starting address and length of the memory.
  • the memory address information of the host can specifically be the BAR2 address and length corresponding to each peripheral chip.
  • Step S520 Obtain the queue information of the queue corresponding to each peripheral chip.
  • step S520 obtains the queue information of each queue.
  • data can also be transmitted in other ways.
  • messages, etc. can obtain corresponding information according to the data transmission mode, and examples will not be given here.
  • the queue information of the queue corresponding to the peripheral chip includes the queue's transceiver pointer, queue resource information, memory space information corresponding to the queue, etc., which is not specifically limited in this application.
  • step S520 may also combine each queue with The host thread is associated. Simply put, if thread A is associated with queue A, then the data A processed by thread A can be sent through queue A.
  • Step S530 Determine the channel corresponding to each queue based on the channel information.
  • each channel is associated with a group of transceiver queues and bound to a group of transceiver threads.
  • Each peripheral chip can correspond to a transceiver thread, thereby determining the mapping relationship between chip memory and memory addresses.
  • the host when the host writes the data in the host's memory to the peripheral chip memory, it can determine the destination address of the source address based on the source address of the memory data and the mapping relationship, and then apply for a data channel.
  • the data channel includes a description descriptor, which includes the above-mentioned source address and destination address, so that the host can write data into the chip memory of the peripheral chip through this data channel.
  • the above data channel may be a DMA data channel
  • the above descriptor may be a DMA descriptor.
  • the host can first determine the DMA device to be used for this data writing through the DMA driver, and then write the source address and destination address of the data into the DMA descriptor, and then Apply for a DMA channel through the descriptor, and associate the DMA channel with the sending and receiving thread corresponding to the data.
  • the data at steps S510 to S530 has been associated with the transceiver thread and the transceiver queue. At this time, it is associated with the DMA channel to ensure that the DMA module uses the DMA channel to transmit the transceiver queue of the data.
  • the peripheral chip can also configure the DMA module of the peripheral chip, so that the DMA module of the peripheral chip can write the data in the chip memory into the memory of the host.
  • the peripheral chip configures the DMA module of the peripheral chip, it can first determine the mapping relationship between the chip memory and the memory address. In this way, when the peripheral chip initiates a data write request, the DMA module of the peripheral chip can determine the mapping relationship between the chip memory and the memory address. The storage mapping relationship writes the data in the chip memory into the memory in the form of DMA.
  • the detailed step process of configuring the DMA module by the peripheral chip can be referred to the process of configuring the DMA module 121 by the RC 120 in the aforementioned content, which will not be repeated here.
  • the above-mentioned computing device may also include a pass-through card.
  • a plurality of peripheral chips 200 are disposed in the pass-through card.
  • the peripheral chips 200 are coupled with the host 100 through the bus 300 in a pass-through manner.
  • FIG. 1 The system is the data processing system in the pass-through scenario.
  • the computing device includes a plug-in card with a PCIe switch, a plurality of peripheral chips are disposed in the plug-in card with the PCIe switch, and the peripheral chips are coupled to the host through the PCIe switch of the plug-in card with the PCIe switch.
  • the PCIe switch please refer to the description of the PCIe switch 510 in the embodiment of FIG. 2, and the details will not be repeated here.
  • the host 100 can use the DMA module 121 to write the downstream data in the memory 400 into the chip memory 240 of the peripheral chip 200 through DMA technology.
  • the peripheral chip 200 After the DMA module 211 is configured, the peripheral chip 200 can use the DMA module 211 to write the uplink data in the chip memory 240 into the host's memory 400 using DMA technology.
  • the specific implementation method can be referred to the embodiment in Figure 1 and will not be repeated here.
  • the host 100 can apply for a data channel first.
  • the data channel can include a descriptor, which carries the source address and destination address of the downlink data.
  • the host 100 can copy the downlink data to the outside in DMA mode through the data channel.
  • the chip 200 wherein the above-mentioned data channel may include a data channel extended by the PCIe switch 510.
  • Figure 6 is a schematic step flow diagram of a data processing method in a SW scenario provided by this application, wherein the step flow shown in Figure 6 is the same as that shown in Figure 2
  • the step flow diagram of data interaction between the host 100 and the EP11 in the peripheral chip 1 is shown.
  • the method may include the following steps:
  • Step 1 The memory controller 110 obtains a data processing request.
  • the data processing request includes downstream data.
  • the downstream data is used to indicate the data to be sent by the host 100 to the peripheral chip 200. For details of this step, please refer to the steps in the embodiment of FIG. 3 S310 will not be repeated here.
  • Step 2 The memory controller 110 stores the downlink data into the memory 400.
  • Step 3 The memory controller 110 determines the destination address of the downlink data source address based on the source address of the downlink data and the mapping relationship, and sends data channel application information to the DMA module 121.
  • the application information includes the source address and destination address of the downlink data.
  • Step 4 The DMA module 121 applies for a data channel, and copies the downstream data to the peripheral chip memory controller 220 in a DMA manner through the data channel.
  • the data channel includes a descriptor, and the descriptor includes the source address and destination address of the above-mentioned downlink data.
  • the above data channel may be a DMA data channel
  • the above descriptor may be a DMA descriptor.
  • the data channel applied for in step 4 may be the data channel extended by the PCIe switch 510.
  • Step 5 The peripheral chip memory controller 220 stores the downstream data in the chip memory 1 .
  • steps 1 to 5 are about the processing process of downlink data.
  • the processing process of uplink data will be explained below in combination with steps 6 to 9.
  • Step 6 The peripheral chip memory controller 220 obtains the uplink data in the chip memory 240.
  • the uplink data is used to indicate the data sent by the peripheral chip 200 to the host 100.
  • Step 7 The peripheral chip memory controller 220 sends data channel application information to the DMA module 12.
  • the application information includes the source address and destination address of the uplink data.
  • Step 8 The DMA module 12 applies for a data channel, through which the upstream data is copied to the memory controller 110 in a DMA manner.
  • the data channel includes a descriptor, which includes the source address and destination address of the upstream data.
  • Step 9 The memory controller 110 stores the uplink data in the memory 400.
  • the processing of downlink data is handed over to the DMA module 121 of the host 100, and the uplink data is processed by the DMA module 211 of the peripheral chip 200.
  • the PCIe switch 510 does not need to perform DMA read and write operations, the PCIe switch 510
  • the DMA requirements of the hardware are reduced, and there may even be no DMA function. It only needs to provide interface expansion functions, so that users do not need to consider the processing capabilities of the PCIe switch 510 when choosing the PCIe switch 510.
  • the optional range of the PCIe switch 510 increases.
  • developers do not need to separately develop and maintain DMA driver code for the PCIe switch 510, and the development and maintenance costs of the PCIe switch 510 are reduced.
  • step flow of other EPs in peripheral chip 1, such as EP12, when interacting with the host 100, and the step flow of other peripheral chips, such as peripheral chip 2, when interacting with the host 100 are the same as those in Figure 6 Steps 1 to 9 are similar and will not be repeated here.
  • the data processing method in the pass-through scenario is similar to that shown in Figure 6, but in the pass-through scenario, the data channels used by the DMA module 121 of the host 100 and the DMA module 211 of the peripheral chip 200 do not include the extended data channels of the PCIe switch 510.
  • the examples will not be repeated.
  • the host stores the downstream data in the data processing request into the memory, and then copies the downstream data to the peripheral chip in the DMA mode through the DMA module, and the peripheral chip obtains the upstream data in the chip memory.
  • data copy the uplink data to the host memory in DMA mode, so that the entire bandwidth of the peripheral chip can be used to process the uplink data, and there is no need to process the downlink data.
  • the downlink data is handed over to the host's DMA module for processing, which not only makes the single-chip scenario Under this condition, the data bandwidth of peripheral chips can be fully utilized.
  • the data processing bandwidth of a single peripheral chip is no longer limited by the DMA bandwidth.
  • the data processing bandwidth of pass-through cards and SW cards can also be fully utilized. Utilization makes the bandwidth of high concurrency and multi-data stream processing no longer limited.
  • FIG. 7 is a schematic structural diagram of a host provided by the present application.
  • the host can be the host 100 in Figures 1 to 6.
  • the host can be applied to the data processing system shown in Figure 1 or Figure 2.
  • the data processing system It can be deployed in a computing device, which includes a host 100, a memory 400, and a peripheral chip 200.
  • the host 100, the memory 400, and the peripheral chip 200 are coupled through a bus 300.
  • the host 100 may include an acquisition unit 710 , a storage unit 720 , a DMA unit 730 and a determination unit 740 .
  • the acquisition unit 710 is used to obtain the data processing request.
  • the data processing request includes downlink data.
  • the downlink data is used to indicate the data to be sent by the host to the peripheral chip;
  • the storage unit 720 is used by the host to store the downlink data into the memory;
  • the DMA unit 730 used to copy downstream data to the peripheral chip using direct memory access DMA.
  • the host 100 in this embodiment of the present invention can be implemented by a central processing unit (CPU), an application-specific integrated circuit (ASIC), or a programmable logic device.
  • programmable logic device, PLD programmable logic device
  • the above PLD can be a complex programmable logical device (CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL ) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the root complex of host 100 supports DMA functionality.
  • the host includes a determining unit 740, which is used to determine the mapping relationship between the physical address of the chip memory and the physical address of the memory before the obtaining unit 710 obtains the data processing request; a DMA unit 730, used to determine the mapping relationship between the physical address of the chip memory and the physical address of the memory. Based on the mapping relationship, the downstream data is copied to the peripheral chip using direct memory access DMA.
  • the DMA unit 730 is used to obtain the source address of the downlink data, and determine the destination address of the source address according to the mapping relationship; the DMA unit 730 is used to apply for a data channel, the data channel includes a descriptor, and the descriptor includes the source address. and destination address; DMA unit 730, used to copy the downstream data to the peripheral chip in the direct memory access DMA mode through the data channel.
  • the computing device includes a pass-through card, a plurality of peripheral chips are disposed in the pass-through card, and the peripheral chips are coupled with the host through a bus in a pass-through manner.
  • a pass-through card a plurality of peripheral chips are disposed in the pass-through card, and the peripheral chips are coupled with the host through a bus in a pass-through manner.
  • the computing device includes a plug-in card with a PCIe switch, a plurality of peripheral chips are disposed in the plug-in card with the PCIe switch, and the peripheral chips are coupled to the host through the PCIe switch of the plug-in card with the PCIe switch.
  • the data lane includes a PCIe switch extended data lane.
  • the bus includes one or more of the Peripheral Component Interconnect Express PCIe bus, the Unified UB bus, the Computer Quick Link CXL bus, the Cache Coherent Interconnect Protocol CCIX bus, and the Z-era bus.
  • the peripheral chip includes one or more of a peripheral component interconnection standard PCIe device chip, a memory card chip, a network card chip, an independent redundant disk array RAID chip, and an accelerator card chip, where the accelerator card includes One or more of the image processor GPU, the distributed processing unit DPU, and the neural network processor NPU.
  • the host 100 may correspond to performing the method described in the embodiment of the present invention, and the above and other operations and/or functions of the various units in the host 100 are respectively to implement the steps shown in FIGS. 1 to 6
  • the corresponding processes of each method are not repeated here for the sake of brevity.
  • this application provides the host to store the downlink data in the data processing request into the memory, and then copies the downlink data to the peripheral chip in DMA mode through the DMA module, so that the entire bandwidth of the peripheral chip can be used to process the uplink data. It is no longer necessary to process downlink data.
  • the downlink data is handed over to the DMA module of the host for processing, which not only makes the single-core
  • the data bandwidth of the peripheral chip can be fully utilized.
  • the data processing bandwidth of a single peripheral chip is no longer limited by the DMA bandwidth.
  • the data processing bandwidth of the pass-through card and SW card can also be used. It is fully utilized, so that the bandwidth of high concurrency and multi-data stream processing is no longer limited.
  • Embodiments of the present application provide a computer-readable storage medium, which includes: computer instructions are stored in the computer-readable storage medium; when the computer instructions are run on a computer, the computer is caused to execute the data processing method described in the above method embodiment.
  • Embodiments of the present application provide a computer program product containing instructions, including a computer program or instructions.
  • the computer program or instructions When the computer program or instructions are run on a computer, the computer executes the data processing method described in the above method embodiment.
  • Embodiments of the present application provide a processor, which can be installed on a computing device.
  • the computing device includes a processor, a memory, and a peripheral chip.
  • the processor, the memory, and the peripheral chip are coupled through a bus.
  • the processor It can be the host 100 in the embodiment of Figures 1 and 2
  • the memory can be the memory 400 in the embodiment of Figures 1 and 2
  • the peripheral chip can be the peripheral chip 200 in the embodiment of Figures 1 and 2
  • the bus can be It is the bus 300 in the embodiments of Figures 1 and 2.
  • the processor can implement the corresponding processes of the host 100 in implementing the various methods in Figures 1 to 6 as shown in the figure. For the sake of simplicity, details will not be described here.
  • Embodiments of the present application provide a computing device.
  • the computing device includes a host, memory and peripheral chips.
  • the host, memory and peripheral chips are coupled through a bus.
  • the host implements the corresponding processes of the host 100 in each method in Figures 1 to 6.
  • the peripheral chip implements the corresponding processes of the peripheral chip 200 in each method in Figures 1 to 6.
  • the data processing system 1000 shown in Figure 1 or the data processing system 1001 shown in Figure 2 can be deployed in the computing device, For the sake of brevity, no further details will be given here.
  • the above embodiments are implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments are implemented in whole or in part in the form of a computer program product.
  • a computer program product includes at least one computer instruction.
  • the computer is a general-purpose computer, a special-purpose computer, a computer network, or other programming device.
  • Computer instructions are stored in or transmitted from one computer-readable storage medium to another, e.g., from a website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic cable) , digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means to transmit to another website, computer, server or data center.
  • Computer-readable storage media are any media that can be accessed by a computer or data storage nodes such as servers and data centers that contain at least one media collection.
  • the media used is magnetic media (for example, floppy disk, hard disk, tape), optical media (for example, high-density digital video disc (DVD)), or semiconductor media.
  • the semiconductor medium is SSD.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Bus Control (AREA)

Abstract

一种数据处理方法、主机及相关设备,该方法应用于计算设备,该计算设备包括主机、内存和外设芯片,主机、内存和外设芯片通过总线耦合,该方法包括以下步骤:主机获取数据处理请求,数据处理请求包括下行数据,下行数据用于指示主机待向外设芯片发送的数据,主机将下行数据存储至内存,主机以直接内存访问DMA方式将下行数据拷贝至外设芯片,这样,外设芯片不再需要处理下行数据,而是交由主机进行处理,从而降低外设芯片的处理压力,提高外设芯片的处理带宽。

Description

一种数据处理方法、主机及相关设备
本申请要求于2022年3月31日提交中国专利局、申请号为202210336383.4、发明名称为“一种数据处理方法、主机及相关设备”的中国专利申请的优先权,所述专利申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种数据处理方法、主机及相关设备。
背景技术
处理器与外设芯片之间通常需要大量的数据交换,通常情况下,处理器与外设芯片之间高速交换数据通过直接内存访问(direct memory access,DMA)技术来实现,具体是由外设芯片处理DMA写操作将外设芯片内存中的数据写入主机的内存中,外设芯片处理DMA读操作从主机的内存中读取数据至外设芯片内存中。
但是,外设芯片的上下行带宽通常受限于DMA数据带宽,举例来说,假设外设芯片的带宽为200MB/s,但是由于DMA数据带宽为100MB/s,那么该外设芯片的实际带宽只能达到100MB/s,使得外设芯片带宽非常受限,浪费芯片的处理能力。
发明内容
本申请供了一种数据处理方法、主机及相关设备,用于解决外设芯片带宽受限、芯片处理能力被浪费的问题。
第一方面,提供了一种数据处理方法,该方法应用于计算设备,计算设备包括主机、内存和外设芯片,主机、内存和外设芯片通过总线耦合,该方法包括以下步骤:主机获取数据处理请求,数据处理请求包括下行数据,下行数据用于指示主机待向外设芯片发送的数据,主机将下行数据存储至内存,主机以直接内存访问DMA方式将下行数据拷贝至外设芯片。
实施第一方面描述的方法,主机将数据处理请求中的下行数据存储至内存,然后通过DMA方式将下行数据拷贝至外设芯片,外设芯片不再需要处理下行数据,进而使得外设芯片的数据带宽可以完全被利用,外设芯片的数据处理带宽不再受到DMA带宽的限制。
具体实现中,主机可以包括至少一个通用处理器,例如CPU、NPU或者CPU和硬件芯片的组合。上述硬件芯片是专用集成电路(Application-Specific Integrated Circuit,ASIC)、编程逻辑器件(Programmable Logic Device,PLD)或其组合。上述PLD是复杂编程逻辑器件(Complex Programmable Logic Device,CPLD)、现场编程逻辑门阵列(Field-Programmable Gate Array,FPGA)、通用阵列逻辑(Generic Array Logic,GAL)或其任意组合。
内存是主机的内存,内存具体可以是易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)、动态随机存储器(dynamic RAM,DRAM)、静态随机存储器(static RAM,SRAM)、同步动态随机存储器(synchronous dynamic RAM,SDRAM)、双倍速率同步动态随机存储器(double data rate RAM,DDR)、高速缓存(cache)等等,内存还可以包括上述种类的组合,本申请不对此进行限定。
外设芯片包括快捷外围部件互联标准PCIe设备芯片、存储卡芯片、网卡芯片、独立冗余 磁盘阵列RAID芯片、加速卡芯片中的一种或者多种,其中,加速卡包括图像处理器GPU、处理器分散处理单元DPU、神经网络处理器NPU中的一种或者多种。
总线包括快捷外围部件互联标准PCIe总线、统一UB总线、计算机快速链接CXL总线、缓存一致互连协议CCIX总线、Z时代总线中的一种或者多种。
在一可能的实现方式中,本申请提供的数据处理方法可应用于单芯片场景和多芯片堆叠场景,其中,单芯片场景指的是单个外设芯片与主机通过总线耦合的场景,多芯片堆叠场景指的是多个外设芯片与主机通过总线耦合的场景称为多芯片堆叠场景。
应理解,在高并发、多数据流场景下,外设芯片在处理多个数据流时,多个数据流之间容易互相影响,因此在该类场景下,通过多芯片堆叠的方式将多个处理能力较弱的单芯片堆叠成一个设备或者模组,一个单芯片处理一个数据流,从而实现高并发、多数据流场景下的物理隔离。其中,多芯片堆叠场景中,多个外设芯片虽然设置于直通卡内,但是每个外设芯片均通过总线与主机进行耦合。
可选地,在多芯片堆叠场景下,计算设备包括直通卡,多个外设芯片设置于直通卡内,外设芯片以直通方式通过总线与主机耦合。
可选地,在多芯片堆叠场景下,计算设备包括带有PCIe交换机(switch,SW)的插卡,多个外设芯片设置于带有PCIe交换机的插卡内,外设芯片通过带有PCIe交换机的插卡的PCIe交换机与主机耦合,该插卡又可称为SW卡。
需要说明的,为了便于区分本申请的应用场景,下文统一将多个外设芯片设置于直通卡内,外设芯片以直通方式通过总线与主机耦合的场景称为直通场景,多个外设芯片设置于带有PCIe交换机的插卡内,外设芯片通过带有PCIe交换机的插卡的PCIe交换机与主机耦合的场景称为SW场景。
上述实现方式,在单个外设芯片的数据处理带宽不再受到DMA带宽的限制时,由多个外设芯片堆叠成的直通卡和SW卡的数据处理带宽也可以完全被利用,使得高并发、多数据流处理的带宽不再受限。
在一可能的实现方式中,外设芯片包括DMA模块,该方法还可包括以下步骤:外设芯片获取外设芯片的芯片内存中上行数据,上行数据用于指示外设芯片待向主机发送的数据,外设芯片以DMA方式将上行数据拷贝至内存。
举例来说,假设每个外设芯片的处理能力可以达到100MB/s,由于外设芯片处理DMA写操作,不需要处理DMA读操作,因此外设芯片的上行带宽为100MB/s,DMA读操作由主机100的RC 120来实现,因此外设芯片200的下行带宽为100MB/s,这样,不仅单芯片场景下外设芯片上行带宽和下行带宽都可以达到最大化,而且多芯片堆叠场景下,两个外设芯片200的上下行带宽都可达到100MB/s,芯片组的数据处理带宽可以达到理论值200MB/s,使得多芯片堆叠场景下外设芯片处理能力也可以得到最大化,高并发、多数据流处理的带宽不再受到DMA带宽的限制。
上述实现方式,外设芯片通过外设芯片内的DMA模块,将芯片内存中的上行数据以DMA方式拷贝至内存,下行数据交由主机的DMA模块实现,使得外设芯片不再需要处理下行数据,进而使得外设芯片的数据带宽可以完全被利用,外设芯片的数据处理带宽不再受到DMA带宽的限制。主机将数据处理请求中的下行数据存储至内存,然后通过DMA方式将下行数据拷贝至外设芯片,外设芯片不再需要处理下行数据,进而使得外设芯片的数据带宽可以完全被利用,外设芯片的数据处理带宽不再受到DMA带宽的限制。
在一可能的实现方式中,主机的根复合体(root complex,RC)支持DMA功能。具体地,主 机的根复合体可包括DMA模块,该DMA模块可以是一种能够实现DMA功能的硬件模块,具体可包括DMA控制器、寄存器等等。其中,DMA模块可以是主机内部集成的DMA硬件单元,在一些实施例中,DMA模块也可以是于主机外部的一个逻辑电路,该逻辑电路可实现DMA功能,本申请不对此进行具体限定。该DMA模块的驱动程序可以是CPU厂商(不限于ARM、X86、天池等CPU类型)提供的开源内核设备驱动。
应理解,DMA是一种成熟的读写技术,当前的服务器无论何种架构都会配置有DMA硬件,本申请通过对服务器已有的DMA硬件进行驱动开发,将DMA硬件使能,结合硬件把DMA的地址空间配置映射,使其能够实现将内存中的下行数据DMA至芯片内存的功能,使得外设芯片不再需要处理下行数据,不再需要进行DMA读操作,从而降低外设芯片的处理压力,提高外设芯片的数据处理带宽。
上述实现方式,通过使能主机内原有的DMA模块,使其能够处理下行数据,从而降低外设芯片的处理压力,并且不需要额外部署其他硬件资源,通过对主机内的DMA模块驱动进行升级即可实现本申请提供的方案,方案可行性高,可复刻性高。
在一可能的实现方式中,主机获取数据处理请求之前,该方法还包括一下步骤:主机确定芯片内存的物理地址与内存的物理地址之间的映射关系,主机根据映射关系,以直接内存访问DMA方式将下行数据拷贝至外设芯片。
可选地,在主机获取数据处理请求之前,可通过对主机的DMA模块进行配置来确定芯片内存与主机的内存的地址之间的映射关系,配置过程可包括设备枚举、驱动初始化以及设备配置,其中,设备枚举指的是对与主机通过总线耦合的全部外设芯片进行设备枚举,获得每个外设芯片的拓扑信息。驱动初始化指的是对主机的DMA模块的驱动进行初始化,确定主机的DMA模块的通道信息。设备配置指的是对DMA模块进行地址配置,根据上述拓扑信息以及通道信息,确定芯片内存与内存的地址之间的映射关系。
其中,拓扑信息是外设芯片与主机通过总线耦合时生成的总线拓扑结构,拓扑信息用于描述外设芯片组成的设备系统的拓扑结构,具体可以是一个数据结构链表,比如PCI设备树。拓扑信息还可包括每个外设芯片的身份信息,比如外设芯片的设备号(decive_id)、厂家标识(vendor_id)、PCI设备的总线设备功能(bus device function,BDF)编码等等,本申请不作具体限定。通道信息可包括DMA模块的通道数量、每个通道所占用的内存空间信息、通道内的结构体赋值信息等等,本申请不作具体限定。
可选地,设备枚举过程可以如下:主机可通过深度优先遍历(depth first search,DFS)算法,从主机的RC出发寻找与RC相连的外设芯片和桥(bridge),对寻找到的外设芯片和桥进行BDF编号的分配。然后,读取基础地址寄存器(base address register,BAR)空间,对BAR空间进行映射和访问测试,为每个寻找到的外设芯片和桥分配PCI资源,资源分配完毕后,获得上述拓扑信息,也就是PCI设备树。
具体实现中,在设备枚举过程之后,驱动初始化之前,RC可以在获得外设芯片的拓扑信息之后,通过设备扫描来确定每个外设芯片对应的驱动,举例来说,确定网卡对应的网卡驱动,声卡对应的声卡驱动等等,具体的,参考前述内容可知,外设芯片的拓扑信息可包括外设芯片的身份信息,通过将外设芯片的身份信息与驱动注册的身份信息进行匹配,从而确定每个外设芯片对应的驱动,举例来说,可以将外设芯片的vendor_id和decive_id与驱动注册的vendor_id和decive_id进行匹配,在二者均一致的情况下,确定外设芯片对应的驱动。应理解,由于外设芯片的种类数量不同,对应的驱动数量也不同,因此在获得外设芯片的拓扑信息之后,通过身份信息匹配的方式确定每个外设芯片对应的驱动,可以避免后续对驱动 进行初始化时,由于驱动不匹配出现的初始化失败或者写入失败等问题。
可以理解的,RC对DMA模块进行配置之后,主机获取数据处理请求时,主机的DMA模块可以先申请数据通道,数据通道包括描述符,描述符包括源地址和目的地址,主机通过数据通道,以直接内存访问DMA方式将下行数据拷贝至外设芯片。具体可通过将DMA描述符搬运到DMA硬件的物理环,使能DMA模块根据该DMA描述符进行数据传输。
需要说明的,在上述多芯片堆叠场景下,若外设芯片配置于SW卡中,数据通道包括PCIe交换机扩展的数据通道。举例来说,若2个外设芯片配置于直通卡中,那么该直通卡的数据通道数量为2,若2个外设芯片配置于SW卡中,且该PCIe交换机可以将1个外设芯片的端点端口EP扩展为2个,那个该SW卡的数据通道数量为4,多扩展出的数据通道即为PCIe交换机扩展的数据通道,上述举例用于说明,本申请不作具体限定。
同理,外设芯片也可以对外设芯片上的DMA模块进行配置,使得DMA模块能够实现将芯片内存中的上行数据写入主机。外设芯片在对DMA模块进行配置时,可先确定芯片内存与内存的地址之间的映射关系,这样,当外设芯片处理上行数据时,DMA模块可根据存储的映射关系,将芯片内存中的上行数据以DMA的方式写入内存。其中,外设芯片对DMA模块进行配置的详细步骤流程可以参考前述内容中主机对主机内DMA模块进行配置的过程,这里不重复赘述。
上述实现方式,通过对服务器已有的DMA硬件进行驱动开发,将DMA硬件使能,结合硬件把DMA的地址空间配置映射,使其能够实现将内存中的下行数据写入至芯片内存的功能,使得外设芯片不再需要处理下行数据,不再需要进行DMA读操作,从而降低外设芯片的处理压力,提高外设芯片的数据处理带宽。
第二方面,提供了一种主机,该主机应用于计算设备,计算设备包括主机、内存和外设芯片,主机、内存和外设芯片通过总线耦合,主机包括:获取单元,用于获取数据处理请求,数据处理请求包括下行数据,下行数据用于指示主机待向外设芯片发送的数据,存储单元,用于主机将下行数据存储至内存,直接内存访问DMA单元,用于以直接内存访问DMA方式将下行数据拷贝至外设芯片。
实施第二方面描述的主机,该主机将数据处理请求中的下行数据存储至内存,然后通过DMA方式将下行数据拷贝至外设芯片,外设芯片不再需要处理下行数据,进而使得外设芯片的数据带宽可以完全被利用,外设芯片的数据处理带宽不再受到DMA带宽的限制。
在一可能的实现方式中,主机的根复合体支持DMA功能。
在一可能的实现方式中,主机包括确定单元,确定单元,用于在获取单元获取数据处理请求之前,确定芯片内存的物理地址与内存的物理地址之间的映射关系,DMA单元,用于根据映射关系,以直接内存访问DMA方式将下行数据拷贝至外设芯片。
在一可能的实现方式中,DMA单元,用于获取下行数据的源地址,根据映射关系确定源地址的目的地址,DMA单元,用于申请数据通道,数据通道包括描述符,描述符包括源地址和目的地址,DMA单元,用于通过数据通道,以直接内存访问DMA方式将下行数据拷贝至外设芯片。
在一可能的实现方式中,计算设备包括直通卡,多个外设芯片设置于直通卡内,外设芯片以直通方式通过总线与主机耦合。
在一可能的实现方式中,计算设备包括带有PCIe交换机的插卡,多个外设芯片设置于带有PCIe交换机的插卡内,外设芯片通过带有PCIe交换机的插卡的PCIe交换机与主机耦合。
在一可能的实现方式中,数据通道包括PCIe交换机扩展的数据通道。
在一可能的实现方式中,总线包括快捷外围部件互联标准PCIe总线、统一UB总线、计算机快速链接CXL总线、缓存一致互连协议CCIX总线、Z时代总线中的一种或者多种。
在一可能的实现方式中,外设芯片包括快捷外围部件互联标准PCIe设备芯片、存储卡芯片、网卡芯片、独立冗余磁盘阵列RAID芯片、加速卡芯片中的一种或者多种,其中,加速卡包括图像处理器GPU、处理器分散处理单元DPU、神经网络处理器NPU中的一种或者多种。
第三方面,提供了一种处理器,该处理器设置于计算设备,计算设备包括处理器、内存和外设芯片,处理器、内存和外设芯片通过总线耦合,处理器用于执行上述第一方面描述方法中主机的操作步骤。
第四方面,提供了一种计算设备,该计算设备包括主机、内存和外设芯片,主机、内存和外设芯片通过总线耦合,主机用于实现如第一方面描述方法中主机的操作步骤,外设芯片用于实现如第一方面描述方法中外设芯片的操作步骤。
第五方面,提供了一种可读存储介质,该可读存储介质中存储有指令,当其在主机上运行时,使得主机执行上述第一方面描述的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1是本申请提供的一种数据处理系统的结构示意图;
图2是本申请提供的另一种数据处理系统的结构示意图;
图3是本申请提供的一种数据处理方法的步骤流程示意图;
图4是本申请提供的一种数据处理方法中DMA驱动初始化的步骤流程示意图;
图5是本申请提供的一种数据处理方法中设备配置的步骤流程示意图;
图6是本申请提供的一种数据处理方法在SW场景下的步骤流程示意图;
图7是本申请提供的一种主机的结构示意图。
具体实施方式
首先,对本申请涉及的“单芯片”和“多芯片堆叠”应用场景进行说明。
单芯片指的是处理器与单个外设芯片之间的数据交换场景,多芯片堆叠则指的是多个单芯片堆叠成一个设备或模组后,处理器与该设备或模组进行数据交换的场景。应理解,在高并发、多数据流场景下,外设芯片在处理多个数据流时,多个数据流之间容易互相影响,因此在该类场景下,通过多芯片堆叠的方式将多个处理能力较弱的单芯片堆叠成一个设备或者模组,一个单芯片处理一个数据流,从而实现高并发、多数据流场景下的物理隔离。
在单芯片和多芯片堆叠场景下,处理器与外设芯片之间需要大量的数据交换,通常情况下,处理器与外设芯片之间高速交换数据通过直接内存访问(direct memory access,DMA)技术来实现,具体是由外设芯片处理DMA写操作将外设芯片内存中的数据写入主机的内存中,外设芯片处理DMA读操作从主机的内存中读取数据至外设芯片内存中。其中,DMA技术是一种高速的数据传输技术,通过DMA硬件实现处理器与外设芯片之间的数据交换,从而减少处理器的处理压力,提高数据传输的效率。
以PCIe芯片为例,主机100通过根复合体110(root complex,RC)与每个节点设备(end point,EP)的芯片(即上述外设芯片)通过总线耦合,其中,RC用于将处理器的访问事务转换为PCIe 总线上的访问事务。应理解,PCIe总线以报文形式交换信息或传输数据,因此,RC负责根据CPU的访问事务产生对应的报文,将其传输给下游的外设芯片,同理,RC还负责接收下游外设芯片上报的报文,并根据报文内容将信息或数据转发给CPU。
但是,在单芯片场景下,外设芯片的上下行带宽通常受限于DMA数据带宽,举例来说,假设外设芯片的带宽为200MB/s,但是由于DMA数据带宽为100MB/s,那么该外设芯片的实际带宽只能达到100MB/s,使得外设芯片带宽非常受限,浪费芯片的处理能力。
在多芯片堆叠场景下,多个外设芯片堆叠后的上下行带宽也会受限于DMA数据带宽,使得每个外设芯片的数据处理能力无法被全部利用,导致外设芯片处理能力被浪费。举例来说,当多个外设芯片堆叠后,假设每个外设芯片的单芯片处理能力可以达到100MB/s,理论上两个外设芯片堆叠而成的PCIe芯片的处理能力可以达到200MB/s,但是,由于外设芯片需要进行DMA读和DMA写,所以外设芯片的处理能力需要进行频分,比如上行带宽为50MB/s,下行带宽为50MB/s,这样,两个外设芯片堆叠而成的PCIe芯片的处理能力最终可达到的数据带宽为上行带宽100MB/s,下行带宽100MB/s,相当于外设芯片在读数据时,只能使用一半的数据带宽,写数据时,也只能使用一半的数据带宽,浪费外设芯片的处理能力。
综上可知,由于处理器与外设芯片之间高速交换数据需要使用DMA技术,导致外设芯片的上下行带宽通常受限于DMA的数据带宽,导致单芯片和多芯片堆叠场景下,外设芯片的数据带宽无法被完全利用,浪费外设芯片的处理能力,外设芯片的上下行带宽低。
为了解决上述单芯片和多芯片堆叠场景下,外设芯片的上下行带宽低的问题,本申请提供了一种数据处理系统,该数据处理系统通过将下行数据交由主机的DMA模块进行处理,由主机的DMA模块将下行数据存储至内存,然后以DMA的方式将下行数据拷贝至外设芯片,外设芯片不需要处理下行数据,从而减轻外设芯片的处理压力,提高外设芯片的上下行带宽。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。
如图1所示,图1是本申请提供的一种数据处理系统的结构示意图,该数据处理系统1000包括主机(host)100、外设芯片200、内存400以及芯片内存240,其中,主机100、外设芯片200、内存400以及芯片内存240之间通过总线300耦合,其中,外设芯片200的数量可以是一个或者多个。
数据处理系统1000可部署于计算设备,计算设备可以是物理服务器,比如X86、ARM服务器等,具体可以是单个物理服务器,也可以是服务器集群中的节点;计算设备也可以是其他具有存储功能的存储设备,例如存储阵列或存储服务器;计算设备还可以是边缘服务器,本申请不作具体限定。
主机100可以包括至少一个通用处理器,例如CPU、NPU或者CPU和硬件芯片的组合。上述硬件芯片是专用集成电路(Application-Specific Integrated Circuit,ASIC)、编程逻辑器件(Programmable Logic Device,PLD)或其组合。上述PLD是复杂编程逻辑器件(Complex Programmable Logic Device,CPLD)、现场编程逻辑门阵列(Field-Programmable Gate Array,FPGA)、通用阵列逻辑(Generic Array Logic,GAL)或其任意组合。
外设芯片200可以是一个系统级芯片(system on chip,SoC),具体实现中,外设芯片200可以是主机100可通过总线300耦合的任何芯片,比如声卡、网卡(nic)、通用串行总线(universal serial bus,USB)卡、集成开发环境(interated development environment,IDE)接口卡、磁盘阵列(redundant arrays of independent disks,RAID)卡、视频采集卡等等,本申请不作具体限定。
需要说明的,在单芯片场景下,与主机100通过总线耦合的外设芯片数量可以是1个。 在多芯片堆叠场景下,与主机100通过总线耦合的外设芯片数量可以是多个。举例来说,在多芯片堆叠场景下,若外设芯片为PCIe设备芯片,那么多个外设芯片可堆叠组成一张PCIe卡,若外设芯片为网卡芯片,那么多个外设芯片可堆叠组成堆叠网卡,若外设芯片为磁盘芯片,那么多个外设芯片可堆叠组成磁盘阵列(redundant arrays of independent disks,RAID),若外设芯片为处理器芯片,比如中央处理器(central processing unit,CPU)、图像处理器(graphics processing unit,GPU)、处理器分散处理单元(data processing unit,DPU)、神经网络处理器(neural-network processing unit,NPU)等专用处理器芯片,那么多个外设芯片可堆叠组成加速组件,主机100可以使用处理器处理主业务系统,使用该加速组件处理其他系统,比如神经网络训练系统,图像渲染系统等,本申请不作具体限定。
可选地,在多芯片堆叠场景下,一个或者多个外设芯片200可设置于直通卡内,该外设芯片以直通方式通过总线300与主机100耦合,以图1为例,外设芯片1和外设芯片2可设置于直通卡内,外设芯片1通过总线与主机100耦合,外设芯片2通过总线与主机100耦合,直通卡可以是图1中虚线框选出的部分,该种组合方式在下文中称为“直通场景”。
内存400是主机100的内存,内存400具体可以是易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)、动态随机存储器(dynamic RAM,DRAM)、静态随机存储器(static RAM,SRAM)、同步动态随机存储器(synchronous dynamic RAM,SDRAM)、双倍速率同步动态随机存储器(double data rate RAM,DDR)、高速缓存(cache)等等,内存400还可以包括上述种类的组合,本申请不对此进行限定。
芯片内存240是外设芯片200的内存,可以是外设芯片接口插的内存条或者内存颗粒,具体可以是易失性存储器,例如RAM、DRAM、SRAM、SDRAM、DDR、cache等等,芯片内存240还可以包括上述种类的组合,本申请不对此进行限定。
总线300可以是快捷外围部件互连标准(peripheral component interconnect express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)总线、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)总线、Z世代(GenZ)总线等,本申请不作具体限定。
需要说明的,为了使本申请能够被更好地理解,下文将统一以芯片200为PCIe芯片、总线300为PCIe总线为例对本申请的方案进行描述。
进一步地,主机100和外设芯片200可进一步划分为多个单元模块,图1是一种示例性划分方式,如图1所示,主机100可包括内存控制器110和RC 120,其中,内存控制器110和RC 120通过系统总线130耦合,外设芯片200可包括节点设备(end point,EP)端口210和外设芯片内存控制器220,其中,EP 210和外设芯片内存控制器220之间通过系统总线230耦合,其中,系统总线230还可以与芯片内存240耦合。并且,主机100和外设芯片200还可包括更多的单元模块,比如主机100还可包括通信接口、电源等等,外设芯片200还可包括通信接口、电源等等,本申请不作具体限定。
内存控制器110和外设芯片内存控制器220可以是具有处理功能的硬件芯片,上述硬件芯片是专用集成电路(Application-Specific Integrated Circuit,ASIC)、编程逻辑器件(Programmable Logic Device,PLD)或其组合。上述PLD是复杂编程逻辑器件(Complex Programmable Logic Device,CPLD)、现场编程逻辑门阵列(Field-Programmable Gate Array,FPGA)、通用阵列逻辑(Generic Array Logic,GAL)或其任意组合。内存控制器110可以执行各种类型的数字存储指令,例如存储在内存400中的软件或者固件程序,它能使主机100 提供较宽的多种服务。
系统总线130和系统总线230可以是PCIe总线、EISA总线、UB总线、CXL总线、CCIX总线、GenZ总线等,本申请不作具体限定。
RC 120通过总线300与外设芯片200的EP 210连接。其中,RC 120用于将主机100的访问事务转换为PCIe总线上的访问事务。应理解,PCIe总线以报文形式交换信息或传输数据,因此,RC 120负责根据CPU的访问事务产生对应的报文,或者,对接收到的报文进行处理,将报文内容将信息或数据转发给主机100。
在本申请实施例中,主机100可获取第一数据请求,该第一数据请求包括下行数据,该下行数据用于指示主机100待向外设芯片200发送的数据,主机100将下行数据存储至内存400中,然后以DMA方式将下行数据拷贝至外设芯片200。同理,外设芯片200可获取芯片内存240中的上行数据,该上行数据用于指示外设芯片200待向主机100发送的数据,然后以DMA方式将上行数据拷贝至主机100的内存400中。
具体实现中,主机100以DMA方式将下行数据拷贝至外设芯片200时,可以先将拷贝好的下行数据通过DMA技术发送至外设芯片内存控制器220,再由外设芯片内存控制器220将下行数据存储至芯片内存240中。同理,外设芯片200以DMA方式将上行数据拷贝至主机100时,可以先将拷贝好的上行数据通过DMA技术发送至内存控制器110,再由内存控制器110将上行数据存储至内存400中。
可以理解的,本申请提供的方案中,内存400中的下行数据不再由外设芯片200进行DMA读取,而是由主机100通过DMA技术写入芯片内存240中,这样,外设芯片200不再处理DMA读操作,不仅可以减少外设芯片200的处理压力,而且外设芯片200的数据处理带宽不需要进行频分,使得外设芯片200的数据处理带宽得以提高。
举例来说,假设图1中每个外设芯片200的单芯片处理能力可以达到100MB/s,由于外设芯片200只需要处理DMA写操作,不需要处理DMA读操作,换句话说,外设芯片200只需要实现将芯片内存240中的上行数据通过DMA写入主机100的内存400中,因此外设芯片的上行带宽为100MB/s,DMA读操作由主机100的RC 120来实现,因此外设芯片200的下行带宽为100MB/s,这样,不仅单芯片场景下外设芯片上行带宽和下行带宽都可以达到最大化,而且多芯片堆叠场景下,两个外设芯片200的上下行带宽都可达到100MB/s,芯片组的数据处理带宽可以达到理论值200MB/s,使得多芯片堆叠场景下外设芯片处理能力也可以得到最大化,高并发、多数据流处理的带宽不再受到DMA带宽的限制。
具体实现中,RC 120可包括DMA模块121,主机100可通过该DMA模块121将下行数据拷贝至外设芯片。其中,DMA模块121可以是一种能够实现主机100的DMA功能的硬件模块,具体可包括DMA控制器、寄存器等等,通过DMA模块121实现将内存400中的数据写入芯片内存240的功能。其中,DMA模块121可以是RC 120在通过DMA技术将内存400中的数据写入芯片内存240之前安装的,该DMA模块121的驱动程序可以是CPU厂商(不限于ARM、X86、天池等CPU类型)提供的开源内核设备驱动。主机可以对DMA模块121进行配置,使得DMA模块121能够实现将内存400中的下行数据写入至芯片内存240中。
需要说明的,DMA模块121可以是如图1所示的主机100内部集成的DMA硬件单元,在一些实施例中,DMA模块121也可以部署于主机100外部,比如主机100是CPU芯片,DMA模块121可以是CPU内部集成的DMA硬件单元,也可以是CPU外部的一个逻辑电路,该逻辑电路可实现DMA功能,本申请不对此进行具体限定。
应理解,DMA是一种成熟的读写技术,当前的服务器无论何种架构都会配置有DMA硬件,本申请通过对服务器已有的DMA硬件进行驱动开发,将DMA硬件使能,结合硬件把DMA的地址空间配置映射,使其能够实现将内存400中的下行数据写入至芯片内存240的功能,使得外设芯片200不再需要处理下行数据,不再需要进行DMA读操作,从而降低外设芯片200的处理压力,提高外设芯片200的数据处理带宽。
在一实施例中,RC 120在对DMA模块121进行配置时,确定芯片内存240与内存400的地址之间的映射关系,这样,当RC 120获取数据处理请求时,DMA模块121可根据存储的映射关系,将内存400中的下行数据以DMA的方式写入外设芯片200的芯片内存240中。
可选地,RC 120可通过对主机的DMA模块121进行配置来确定芯片内存240与主机的内存400的地址之间的映射关系,配置过程可包括设备枚举、驱动初始化以及设备配置,其中,设备枚举指的是对外设芯片200进行设备枚举,获得每个外设芯片200的拓扑信息。驱动初始化指的是对主机100的DMA模块121的驱动进行初始化,确定主机100的DMA模块121的通道信息。设备配置指的是对DMA模块121进行地址配置,根据上述拓扑信息以及通道信息,确定芯片内存与内存的地址之间的映射关系。
其中,拓扑信息是外设芯片200与主机100通过总线耦合时生成的总线拓扑结构,拓扑信息用于描述外设芯片200组成的设备系统的拓扑结构,具体可以是一个数据结构链表,比如PCI设备树。拓扑信息还可包括每个外设芯片200的身份信息,比如外设芯片200的设备号(decive_id)、厂家标识(vendor_id)、PCI设备的总线设备功能(bus device function,BDF)编码等等,本申请不作具体限定。通道信息可包括DMA模块121的通道数量、每个通道所占用的内存空间信息、通道内的结构体赋值信息等等,本申请不作具体限定。
可选地,设备枚举过程可以如下:RC 120可通过深度优先遍历(depth first search,DFS)算法,从RC 120出发寻找与RC 120相连的外设芯片200和桥(bridge),对寻找到的外设芯片200和桥进行BDF编号的分配。然后,读取基础地址寄存器(base address register,BAR)空间,对BAR空间进行映射和访问测试,为每个寻找到的外设芯片200和桥分配PCI资源,资源分配完毕后,获得上述拓扑信息,也就是PCI设备树。
具体实现中,在设备枚举过程之后,驱动初始化之前,RC 120可以在获得外设芯片200的拓扑信息之后,通过设备扫描来确定每个外设芯片200对应的驱动,举例来说,确定网卡对应的网卡驱动,声卡对应的声卡驱动等等,具体的,参考前述内容可知,外设芯片200的拓扑信息可包括外设芯片200的身份信息,通过将外设芯片200的身份信息与驱动注册的身份信息进行匹配,从而确定每个外设芯片200对应的驱动,举例来说,可以将外设芯片200的vendor_id和decive_id与驱动注册的vendor_id和decive_id进行匹配,在二者均一致的情况下,确定外设芯片200对应的驱动。应理解,由于外设芯片200的种类数量不同,对应的驱动数量也不同,因此在获得外设芯片200的拓扑信息之后,通过身份信息匹配的方式确定每个外设芯片200对应的驱动,可以避免后续对驱动进行初始化时,由于驱动不匹配出现的初始化失败或者写入失败等问题。
在一可能的实现方式中,RC 120对DMA模块121进行配置之后,主机100获取到数据处理请求时,DMA模块121可以先申请DMA通道,并根据上述映射关系为DMA通道配置DMA描述符,该DMA描述符包括数据的源地址和目的地址,然后将DMA描述符搬运到DMA硬件的物理环,使能DMA模块121根据该DMA描述符进行数据传输。
需要说明的,如果主机100中的DMA模块121数量为多个,那么RC 120在对多个DMA模块121按照上述过程进行配置时,可以确定每个应用对应的DMA模块,当RC 120发起数 据写入请求时,可以先确定应用对应的DMA模块,然后使用该DMA模块申请DMA通道,这里不重复展开赘述。
同理,外设芯片200可包括DMA模块211,外设芯片200可获取芯片内存240中的上行数据,该上行数据用于指示外设芯片待向主机100发送的数据,然后以DMA方式将上行数据拷贝至内存400。
具体实现中,外设芯片200也可以对DMA模块211进行配置,使得DMA模块211能够实现将芯片内存240中的上行数据写入主机100的内存400中。外设芯片200在对DMA模块211进行配置时,可先确定芯片内存240与内存400的地址之间的映射关系,这样,当外设芯片200发起数据写入请求时,DMA模块211可根据存储的映射关系,将芯片内存240中的数据以DMA的方式写入内存400中。其中,外设芯片200对DMA模块211进行配置的详细步骤流程可以参考前述内容中,RC 120对DMA模块121进行配置的过程,这里不重复赘述。
在一实施例中,该数据处理系统还可以包括所在的计算设备包括带有PCIe交换机的插卡,上述一个或者多个外设芯片200可以设置于带有PCIe交换机(switch,SW)的插卡内,该外设芯片200通过带有上述PCIe交换机的插卡的PCIe交换机与主机耦合,该应用场景可称为SW场景。
示例性地,如图2所示,图2是本申请提供的另一种数据处理系统1001,其中图2是上述PCIe交换机场景(简称SW场景)下的数据处理系统1001,图1是直通场景下的数据处理系统1000,数据处理系统1001包括主机100、内存400和PCIe交换机510。
其中,该PCIe交换机510带有插卡,一个或者多个外设芯片200可以设置于该PCIe交换机510的插卡内,该外设芯片200通过PCIe交换机510与主机100耦合,该插卡可以称为SW卡,例如图2所示的SW卡500,需要说明的,图2所示的例子中2个外设芯片200通过PCIe交换机510与主机100耦合,具体实现中,还可以有更多或者更少的外设芯片200,本申请不对外设芯片的数量进行具体限定。
PCIe交换机510用于提供扩展或聚合能力,允许更多的外设芯片200连接至主机的一个PCIe接口,主机100可以与更多的外设芯片200通过总线耦合,并且,每个外设芯片200可以有更多的数据通道,从而提高多芯片堆叠场景下堆叠卡的数据处理带宽。例如图1所示的直通场景下只有2个EP210与主机100通过总线耦合,在图2所示的SW场景下可以有4个EP210与主机通过PCIe交换机耦合,使得整个数据处理系统1001的带宽更高。
在本申请实施例中,主机100通过RC 120对DMA模块121进行配置后,主机100可通过DMA模块121将内存400中的下行数据通过DMA技术写入外设芯片200的芯片内存240中,外设芯片200对DMA模块211进行配置后,外设芯片200可通过DMA模块211将芯片内存240中的上行数据通过DMA技术写入主机的内存400中。具体实现方式可参考图1实施例这里不重复赘述。
具体实现中,主机100可以先申请数据通道,该数据通道可以包括描述符,该描述符携带有下行数据的源地址和目的地址,主机100可通过该数据通道以DMA方式将下行数据拷贝至外设芯片200,其中,上述数据通道可包括PCIe交换机510扩展出的数据通道。
需要说明的,图1中的PCIe交换机510扩展了1个DMA数据通道,在一些实施例中,PCIe交换机510还可以扩展出更多的DMA数据通道以供更多的外设芯片200使用,并且,图1中的外设芯片200中的每个EP 210中部署有1个DMA模块211,具体实现中,外设芯片200中的多个EP 210可以共用1个DMA模块211,也可以一个EP 210中存在多个DMA 模块211,本申请不作具体限定。
可以理解的,若不使用本申请提供的方案,PCIe交换机需要部署DMA读和DMA写功能,PCIe交换机对上行数据和下行数据进行处理,使得PCIe交换机需要较高的DMA处理能力,并且需要适配主机100和外设芯片200的处理能力。使用本申请提供的技术方案,本申请将下行数据的处理交由主机100的DMA模块121处理,上行数据交由外设芯片200的DMA模块211处理,由于PCIe交换机510不需要进行DMA读写操作,PCIe交换机510硬件的DMA需求降低,甚至可以没有DMA的功能,只需要提供接口扩展功能即可,使得用户在选择PCIe交换机510时,可以不需要考虑PCIe交换机510的处理能力,PCIe交换机510可选范围增加,同时,开发人员也不需要为PCIe交换机510单独开发和维护DMA驱动代码,PCIe交换机510的开发维护成本降低。
综上可知,本申请提供的数据处理系统,主机将数据处理请求中的下行数据存储至内存,然后通过DMA模块以DMA方式将下行数据拷贝至外设芯片,外设芯片获取芯片内存中的上行数据,以DMA方式将上行数据拷贝至主机内存,从而使得外设芯片全部带宽都可用于处理上行数据,不再需要处理下行数据,下行数据交由主机的DMA模块进行处理,不仅使得单芯片场景下,外设芯片的数据带宽可以完全被利用,单个外设芯片的数据处理带宽不再受到DMA带宽的限制,在多芯片堆叠场景下,比如直通卡和SW卡的数据处理带宽也可以完全被利用,使得高并发、多数据流处理的带宽不再受限。
图3是本申请提供的一种数据处理方法,该方法可应用于图1或图2所示的数据处理系统1000以及数据处理系统1001,该数据处理系统1000或数据处理系统1001可部署于计算设备上,计算设备可包括主机100、内存400和外设芯片200,如图3所示,该方法可包括以下步骤:
步骤S310:主机100获取数据处理请求,该数据处理请求包括下行数据,该下行数据用于指示主机100待向外设芯片200发送的数据。该步骤可以由图1或图2中的内存控制器110实现。
步骤S320:主机100将下行数据存储至内存400。该步骤可以由图1或图2中的内存控制器110实现。
步骤S330:主机100以直接内存访问DMA方式将下行数据拷贝至外设芯片200。该步骤可以由图1或图2中的DMA 121模块实现。
在一实施例中,主机100的根复合体RC 120支持DMA功能,主机100通过DMA技术将数据写入外设芯片200中,其中,DMA技术的描述可参考前述图1和图2实施例中的描述,这里不重复赘述。可以理解的,主机100以直接内存访问DMA方式将下行数据拷贝至外设芯片200,使得外设芯片200需要读取主机的内存400中的数据时,不再需要进行DMA读操作,从而降低外设芯片200的处理压力,提高外设芯片200的数据处理带宽。
在一实施例中,主机100与外设芯片200通过总线300耦合,其中,总线300包括快捷外围部件互联标准PCIe总线、统一UB总线、计算机快速链接CXL总线、缓存一致互连协议CCIX总线、Z时代总线中的一种或者多种。外设芯片包括快捷外围部件互联标准PCIe设备芯片、存储卡芯片、网卡芯片、独立冗余磁盘阵列RAID芯片、加速卡芯片中的一种或者多种,其中,加速卡包括图像处理器GPU、处理器分散处理单元DPU、神经网络处理器NPU中的一种或者多种。主机100、总线300以及外设芯片200的形态描述可参考图1和图2实施例中的描述,这里不再重复赘述。
在一实施例中,外设芯片200包括DMA模块211,外设芯片200可获取芯片内存240中的上行数据,该上行数据用于指示外设芯片200向主机100发送的数据,通过DMA模块211亿DMA方式将上述上行数据拷贝至内存400中。
在一实施例中,主机100将主机的内存400中的下行数据以直接内存访问DMA方式将下行数据拷贝至外设芯片200之前,还可包括以下步骤:主机100确定芯片内存的物理地址与主机的内存地址之间的映射关系。这样,主机可根据该映射关系,将主机的内存中的数据写入外设芯片的芯片内存中。
具体实现中,主机可包括RC,RC通过总线与芯片的EP端口连接。其中,RC用于将主机的访问事务转换为PCIe总线上的访问事务。应理解,PCIe总线以报文形式交换信息或传输数据,因此,RC负责根据CPU的访问事务产生对应的报文,或者,对接收到的报文进行处理,将报文内容将信息或数据转发给处理器。在本申请实施例中,RC可以用于确定芯片内存与主机的内存的地址之间的映射关系。
可选地,RC可通过对主机的DMA模块进行配置来确定芯片内存与主机的内存的地址之间的映射关系,配置过程可包括设备枚举、驱动初始化以及设备配置,其中,设备枚举指的是对外设芯片进行设备枚举,获得每个外设芯片的拓扑信息。驱动初始化指的是对主机的DMA模块的驱动进行初始化,确定主机的DMA模块的通道信息。设备配置指的是对DMA模块进行地址配置,根据上述拓扑信息以及通道信息,确定芯片内存与内存的地址之间的映射关系。
其中,拓扑信息是多个外设芯片与主机通过总线耦合时生成的总线拓扑结构,拓扑信息用于描述多个外设芯片组成的设备系统的拓扑结构,具体可以是一个数据结构链表,比如PCI设备树。拓扑信息还可包括每个外设芯片的身份信息,比如外设芯片的decive_id、vendor_id、BDF编码等等,本申请不作具体限定。通道信息可包括DMA模块的通道数量、每个通道所占用的内存空间信息、通道内的结构体赋值信息等等,本申请不作具体限定。
具体实现中,RC对外设芯片进行设备枚举,获得每个外设芯片的拓扑信息的具体流程可以如下:主机的RC可通过DFS算法,从RC出发寻找与RC相连的外设芯片和桥(bridge),对寻找到的外设芯片和桥进行BDF编号的分配。然后,读取BAR空间,对BAR空间进行映射和访问测试,为每个寻找到的外设芯片和桥分配PCI资源,资源分配完毕后,获得上述拓扑信息,也就是PCI设备树。
具体实现中,RC对外设芯片进行设备枚举之后,执行驱动初始化之前,通过设备扫描来确定每个外设芯片200对应的驱动,举例来说,确定网卡对应的网卡驱动,声卡对应的声卡驱动等等,具体的,参考前述内容可知,外设芯片200的拓扑信息可包括外设芯片200的身份信息,通过将外设芯片200的身份信息与驱动注册的身份信息进行匹配,从而确定每个外设芯片200对应的驱动,举例来说,可以将外设芯片200的vendor_id和decive_id与驱动注册的vendor_id和decive_id进行匹配,在二者均一致的情况下,确定外设芯片200对应的驱动。应理解,由于外设芯片200的种类数量不同,对应的驱动数量也不同,因此在获得外设芯片200的拓扑信息之后,通过身份信息匹配的方式确定每个外设芯片200对应的驱动,可以避免后续对驱动进行初始化时,由于驱动不匹配出现的初始化失败或者写入失败等问题。
应理解,RC对外设芯片进行设备枚举、设备扫描之后,还需要对DMA模块进行驱动初始化,如图4所示,图4是本申请提供的一种DMA驱动初始化的步骤流程示意图,RC根据每个设备的拓扑信息,对主机的DMA模块的驱动进行初始化,确定主机的DMA模块的通道信息的具体流程可以如下:
S410:获取DMA模块的身份信息。
这里的DMA模块可以是图1和图2实施例中的DMA模块121。
具体实现中,DMA模块的身份信息可以是DMA模块的BDF号,以便后续处理过程中便于对DMA模块进行定位、统计、采集状态等。具体实现中,DMA模块的BDF号可以记录于日志空间中。应理解,步骤S410之前的设备枚举过程不会对DMA模块进行枚举,因为DMA模块是主机内的DMA硬件设备,因此需要通过步骤S410来获取DMA模块的身份信息。
可选地,获取DMA模块身份信息之后,可以设置DMA驱动数据指针为私有设备指针,应理解,DMA驱动初始化时的设备指针为公共指针,将DMA模块的指针设置为私有后,可以使得DMA模块能够专门供给步骤S410之前设备枚举获得的多个外设芯片使用。
S420:对DMA模块进行PCIe配置。
具体实现中,可通过DMA驱动代码中的设置(set)函数或者内核的系统函数,对DMA模块进行PCIe配置,具体的,配置内容包括对DMA模块的空间进行内存地址的配置以及其他PCIe相关的配置,本申请不作具体限定。
S430:获取DMA模块的通道信息。
具体实现中,通道信息可包括DMA模块的可用通道数量,然后为每个通道申请对应的内存空间,举例来说,若每个通道的结构体大小为A,通道数量为B,那么在步骤S430可申请A×B大小的内存空间。应理解,上述举例用于说明,本申请不作具体限定。
可选地,通道信息还可包括每个通道结构体的赋值,简单来说,在为每个通道申请好对应的内存后,可以对通道里的结构体赋予有效的值,进行赋值初始化操作,这里的值可以根据实际业务环境缺,本申请不作具体限定。
S440:可以对DMA模块进行开关使能,使其DMA功能开启,具体可包括对DMA模块进行状态配置、收发方式配置等等,还可包括其他DMA相关功能的配置,这里不一一举例说明。
需要说明的,主机内的DMA模块数量通常为一个或者多个,在DMA模块数量为多个时,可以对每个DMA模块按照上述步骤S410~步骤S430及其可选步骤的描述进行配置,这里不重复赘述。
作为一种可能的实现方式,RC对外设芯片进行设备枚举、设备扫描、驱动初始化之后,还需要进行设备配置,如图5所示,图5是本申请提供的一种设备配置的步骤流程示意图,如图5所示,对DMA模块进行地址配置,根据上述拓扑信息以及通道信息,确定芯片内存与内存的地址之间的映射关系的具体步骤流程可以如下:
步骤S510:获取每个外设芯片的物理地址信息和主机的内存地址信息。
具体实现中,物理地址信息可以是外设芯片存储数据的物理起始地址,主机的内存地址信息指的是前述设备枚举过程中,为每个外设芯片分配的存储空间,可包括主机的内存的起始地址及长度,主机的内存地址信息具体可以是每个外设芯片对应的BAR2地址和长度。
步骤S520:获取每个外设芯片对应队列的队列信息。
应理解,图5实施例所描述的例子中,数据以收发队列的方式进行数据传输,因此步骤S520获取每个队列的队列信息,在其他实现方式中,数据还可以以其他方式进行数据传输,比如报文等等,步骤S520可以根据数据的传输方式获取对应的信息,这里不一一举例说明。
具体实现中,外设芯片对应队列的队列信息科包括队列的收发指针、队列资源信息、队列对应的内存空间信息等等,本申请不作具体限定。可选地,步骤S520还可以将每个队列与 主机线程进行关联,简单来说,如果线程A与队列A进行了关联,那么线程A处理的数据A可以通过队列A进行数据的发送。
步骤S530:根据通道信息,确定每个队列对应的通道。
具体实现中,每个通道和一组收发队列关联,和一组收发线程绑定,每个外设芯片可以与一个收发线程对应,从而确定芯片内存与内存的地址之间的映射关系。
在一实施例中,主机将主机的内存中的数据写入外设芯片内存时,可根据内存数据的源地址,结合映射关系确定源地址的目的地址,然后申请数据通道,该数据通道包括描述符,该描述符包括上述源地址和目的地址,这样,主机通过该数据通道,可将数据写入外设芯片的芯片内存中。具体实现中,上述数据通道可以是DMA数据通道,上述描述符可以是DMA描述符。
具体实现中,如果主机内的DMA模块数量为多个时,主机可先通过DMA驱动确定本次数据写入需要使用的DMA设备,然后将数据的源地址和目的地址写入DMA描述符,再通过描述符申请DMA通道,将该DMA通道与数据所对应的收发线程进行关联。步骤S510~步骤S530处数据已经与收发线程和收发队列进行了关联,此时再将其与DMA通道进行关联,可以保证DMA模块使用该DMA通道传输该数据的收发队列。
同理,外设芯片也可以对外设芯片的DMA模块进行配置,使得外设芯片的DMA模块能够实现将芯片内存中的数据写入主机的内存中。外设芯片在对外设芯片的DMA模块进行配置时,可先确定芯片内存与内存的地址之间的映射关系,这样,当外设芯片发起数据写入请求时,外设芯片的DMA模块可根据存储的映射关系,将芯片内存中的数据以DMA的方式写入内存中。其中,外设芯片对DMA模块进行配置的详细步骤流程可以参考前述内容中,RC 120对DMA模块121进行配置的过程,这里不重复赘述。
在一实施例中,上述计算设备还可包括直通卡,多个外设芯片200设置于直通卡内,外设芯片200以直通方式通过总线300与主机100耦合,例如图1所示的数据处理系统即为直通场景下的数据处理系统。
在一实施例中,计算设备包括带有PCIe交换机的插卡,多个外设芯片设置于带有PCIe交换机的插卡内,外设芯片通过带有PCIe交换机的插卡的PCIe交换机与主机耦合。其中,PCIe交换机的描述可参考图2实施例中关于PCIe交换机510的描述,这里不重复赘述。
具体实现中,主机100通过RC 120对DMA模块121进行配置后,主机100可通过DMA模块121将内存400中的下行数据通过DMA技术写入外设芯片200的芯片内存240中,外设芯片200对DMA模块211进行配置后,外设芯片200可通过DMA模块211将芯片内存240中的上行数据通过DMA技术写入主机的内存400中。具体实现方式可参考图1实施例这里不重复赘述。
可选地,主机100可以先申请数据通道,该数据通道可以包括描述符,该描述符携带有下行数据的源地址和目的地址,主机100可通过该数据通道以DMA方式将下行数据拷贝至外设芯片200,其中,上述数据通道可包括PCIe交换机510扩展出的数据通道。
为了使本申请能够被更好地理解,如图6所示,图6是本申请提供的一种SW场景下数据处理方法的步骤流程示意图,其中,图6所示的步骤流程是图2所示的数据处理系统中,主机100与外设芯片1中的EP11进行数据交互时的步骤流程示意图。如图6所示,该方法可包括以下步骤:
步骤1.内存控制器110获取数据处理请求,该数据处理请求包括下行数据,该下行数据用于指示主机100待向外设芯片200发送的数据,该步骤具体可参考图3实施例中的步骤 S310,这里不重复赘述。
步骤2.内存控制器110将下行数据存储至内存400。
步骤3.内存控制器110根据下行数据的源地址,结合映射关系确定下行数据源地址的目的地址,向DMA模块121发送数据通道申请信息,该申请信息包括下行数据的源地址和目的地址。
步骤4.DMA模块121申请数据通道,通过该数据通道,以DMA方式将下行数据拷贝至外设芯片内存控制器220。该数据通道包括描述符,该描述符包括上述下行数据的源地址和目的地址。
具体实现中,上述数据通道可以是DMA数据通道,上述描述符可以是DMA描述符。在图2所示的SW场景中,步骤4申请的数据通道可以是PCIe交换机510扩展的数据通道。
步骤5.外设芯片内存控制器220将下行数据存储于芯片内存1中。
上述步骤1~步骤5是关于下行数据的处理过程,下面结合步骤6~步骤9对上行数据的处理过程进行解释说明。
步骤6.外设芯片内存控制器220获取芯片内存240中的上行数据,该上行数据用于指示外设芯片200向主机100发送的数据。
步骤7.外设芯片内存控制器220向DMA模块12发送数据通道申请的信息,该申请信息包括上行数据的源地址和目的地址。
步骤8.DMA模块12申请数据通道,通过该数据通道,以DMA方式将上述上行数据拷贝至内存控制器110,该数据通道包括描述符,该描述符包括上述上行数据的源地址和目的地址。
步骤9.内存控制器110将上行数据存储于内存400中。
可以理解的,本申请将下行数据的处理交由主机100的DMA模块121处理,上行数据交由外设芯片200的DMA模块211处理,由于PCIe交换机510不需要进行DMA读写操作,PCIe交换机510硬件的DMA需求降低,甚至可以没有DMA的功能,只需要提供接口扩展功能即可,使得用户在选择PCIe交换机510时,可以不需要考虑PCIe交换机510的处理能力,PCIe交换机510可选范围增加,同时,开发人员也不需要为PCIe交换机510单独开发和维护DMA驱动代码,PCIe交换机510的开发维护成本降低。
需要说明的,外设芯片1中的其他EP比如EP12与主机100进行数据交互时的步骤流程、以及其他外设芯片比如外设芯片2与主机100进行数据交互式时的步骤流程与其图6中的步骤1~步骤9类似,这里不再重复展开赘述。直通场景下的数据处理方法与图6所示的类似,但是直通场景下主机100的DMA模块121以及外设芯片200的DMA模块211所使用的数据通道不包括PCIe交换机510扩展的数据通道,这里不再重复举例说明。
综上可知,本申请提供的数据处理系统,主机将数据处理请求中的下行数据存储至内存,然后通过DMA模块以DMA方式将下行数据拷贝至外设芯片,外设芯片获取芯片内存中的上行数据,以DMA方式将上行数据拷贝至主机内存,从而使得外设芯片全部带宽都可用于处理上行数据,不再需要处理下行数据,下行数据交由主机的DMA模块进行处理,不仅使得单芯片场景下,外设芯片的数据带宽可以完全被利用,单个外设芯片的数据处理带宽不再受到DMA带宽的限制,在多芯片堆叠场景下,比如直通卡和SW卡的数据处理带宽也可以完全被利用,使得高并发、多数据流处理的带宽不再受限。
上文结合图1至图6描述了本申请提供的一种数据处理方法,接下来,结合图7、图1 以及图2介绍本申请所提供的一种主机和数据处理系统。
图7是本申请提供的一种主机的结构示意图,该主机可以是图1~图6中的主机100,该主机可应用于图1或图2所示的数据处理系统中,该数据处理系统可部署于计算设备,该计算设备包括主机100、内存400以及外设芯片200,主机100、内存400和外设芯片200通过总线300耦合。如图7所示,该主机100可包括获取单元710、存储单元720、DMA单元730和确定单元740。
获取单元710,用于获取数据处理请求,数据处理请求包括下行数据,下行数据用于指示主机待向外设芯片发送的数据;存储单元720,用于主机将下行数据存储至内存;DMA单元730,用于以直接内存访问DMA方式将下行数据拷贝至外设芯片。
应理解的是,本发明本申请实施例的主机100可以通过中央处理单元(central processing unit,CPU)实现,也可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图3至图5所示的数据处理方法时,主机100及其各个模块也可以为软件模块。
在一实施例中,主机100的根复合体支持DMA功能。
在一实施例中,主机包括确定单元740,确定单元740,用于在获取单元710获取数据处理请求之前,确定芯片内存的物理地址与内存的物理地址之间的映射关系;DMA单元730,用于根据映射关系,以直接内存访问DMA方式将下行数据拷贝至外设芯片。
在一实施例中,DMA单元730,用于获取下行数据的源地址,根据映射关系确定源地址的目的地址;DMA单元730,用于申请数据通道,数据通道包括描述符,描述符包括源地址和目的地址;DMA单元730,用于通过数据通道,以直接内存访问DMA方式将下行数据拷贝至外设芯片。
在一实施例中,计算设备包括直通卡,多个外设芯片设置于直通卡内,外设芯片以直通方式通过总线与主机耦合。具体可参考图1实施例关于直通场景的描述,这里不重复赘述。
在一实施例中,计算设备包括带有PCIe交换机的插卡,多个外设芯片设置于带有PCIe交换机的插卡内,外设芯片通过带有PCIe交换机的插卡的PCIe交换机与主机耦合。具体可参考图2实施例关于SW场景的描述,这里不重复赘述。
在一实施例中,数据通道包括PCIe交换机扩展的数据通道。
在一实施例中,总线包括快捷外围部件互联标准PCIe总线、统一UB总线、计算机快速链接CXL总线、缓存一致互连协议CCIX总线、Z时代总线中的一种或者多种。
在一实施例中,外设芯片包括快捷外围部件互联标准PCIe设备芯片、存储卡芯片、网卡芯片、独立冗余磁盘阵列RAID芯片、加速卡芯片中的一种或者多种,其中,加速卡包括图像处理器GPU、处理器分散处理单元DPU、神经网络处理器NPU中的一种或者多种。
根据本发明本申请实施例的主机100可对应于执行本发明本申请实施例中描述的方法,并且主机100中的各个单元的上述和其它操作和/或功能分别为了实现图1至图6中的各个方法的相应流程,为了简洁,在此不再赘述。
综上可知,本申请提供主机将数据处理请求中的下行数据存储至内存,然后通过DMA模块以DMA方式将下行数据拷贝至外设芯片,从而使得外设芯片全部带宽都可用于处理上行数据,不再需要处理下行数据,下行数据交由主机的DMA模块进行处理,不仅使得单芯 片场景下,外设芯片的数据带宽可以完全被利用,单个外设芯片的数据处理带宽不再受到DMA带宽的限制,在多芯片堆叠场景下,比如直通卡和SW卡的数据处理带宽也可以完全被利用,使得高并发、多数据流处理的带宽不再受限。
本申请实施例提供一种计算机读存储介质,包括:该计算机读存储介质中存储有计算机指令;当该计算机指令在计算机上运行时,使得该计算机执行上述方法实施例所述的数据处理方法。
本申请实施例提供了一种包含指令的计算机程序产品,包括计算机程序或指令,当该计算机程序或指令在计算机上运行时,使得该计算机执行上述方法实施例所述的数据处理方法。
本申请实施例提供了一种处理器,该处理器可设置于计算设备,该计算设备包括处理器、内存和外设芯片,其中,处理器、内存和外设芯片通过总线耦合,该处理器可以是图1和图2实施例中的主机100,内存可以是图1和图2实施例中的内存400,外设芯片可以是图1和图2实施例中的外设芯片200,总线可以是图1和图2实施例中的总线300,该处理器可以实现如图实现图1至图6中的各个方法中主机100的相应流程,为了简洁,在此不再赘述。
本申请实施例提供了一种计算设备,计算设备包括主机、内存和外设芯片,主机、内存和外设芯片通过总线耦合,主机实现图1至图6中的各个方法中主机100的相应流程,外设芯片实现图1至图6中的各个方法中外设芯片200的相应流程,图1所示的数据处理系统1000或者图2所示的数据处理系统1001均可部署于该计算设备中,为了简洁,在此不再赘述。
上述实施例,全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例全部或部分地以计算机程序产品的形式实现。计算机程序产品包括至少一个计算机指令。在计算机上加载或执行计算机程序指令时,全部或部分地产生按照本发明实施例的流程或功能。计算机为通用计算机、专用计算机、计算机网络、或者其他编程装置。计算机指令存储在计算机读存储介质中,或者从一个计算机读存储介质向另一个计算机读存储介质传输,例如,计算机指令从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机读存储介质是计算机能够存取的任何用介质或者是包含至少一个用介质集合的服务器、数据中心等数据存储节点。用介质是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(digital video disc,DVD)、或者半导体介质。半导体介质是SSD。
以上,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,轻易想到各种等效的修复或替换,这些修复或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (13)

  1. 一种数据处理方法,其特征在于,所述方法应用于计算设备,所述计算设备包括主机、内存和外设芯片,所述主机、内存和外设芯片通过总线耦合,所述方法包括:
    所述主机获取数据处理请求,所述数据处理请求包括下行数据,所述下行数据用于指示所述主机待向所述外设芯片发送的数据;
    所述主机将所述下行数据存储至所述内存;
    所述主机以直接内存访问DMA方式将所述下行数据拷贝至所述外设芯片。
  2. 根据权利要求1所述方法,其特征在于,所述主机的根复合体支持DMA功能。
  3. 根据权利要求1或2所述方法,其特征在于,所述外设芯片包括DMA模块,所述方法还包括:
    所述外设芯片获取所述外设芯片的芯片内存中上行数据,所述上行数据用于指示所述外设芯片待向所述主机发送的数据;
    所述外设芯片以DMA方式将所述上行数据拷贝至所述内存。
  4. 根据权利要求3所述的方法,其特征在于,所述主机获取数据处理请求之前,所述方法还包括:
    所述主机确定所述芯片内存的物理地址与所述内存的物理地址之间的映射关系;
    所述主机以直接内存访问DMA方式将所述下行数据拷贝至所述外设芯片中包括:
    所述主机根据所述映射关系,以直接内存访问DMA方式将所述下行数据拷贝至所述外设芯片。
  5. 根据权利要求4所述的方法,其特征在于,所述主机根据所述映射关系,以直接内存访问DMA方式将所述下行数据拷贝至所述外设芯片包括:
    所述主机获取所述下行数据的源地址,根据所述映射关系确定所述源地址的目的地址;
    所述主机申请数据通道,所述数据通道包括描述符,所述描述符包括所述源地址和所述目的地址;
    所述主机通过所述数据通道,以直接内存访问DMA方式将所述下行数据拷贝至所述外设芯片。
  6. 根据权利要求1至3中任一权利要求所述方法,其特征在于,所述计算设备包括直通卡,多个所述外设芯片设置于所述直通卡内,所述外设芯片以直通方式通过总线与所述主机耦合。
  7. 根据权利要求1至6中任一权利要求所述方法,其特征在于,所述计算设备包括带有PCIe交换机的插卡,多个所述外设芯片设置于所述带有PCIe交换机的插卡内,所述外设芯 片通过带有所述PCIe交换机的插卡的PCIe交换机与所述主机耦合。
  8. 根据权利要求7所述的方法,其特征在于,所述数据通道包括所述PCIe交换机扩展的数据通道。
  9. 根据权利要求1至8任一权利要求所述的方法,其特征在于,所述总线包括快捷外围部件互联标准PCIe总线、统一UB总线、计算机快速链接CXL总线、缓存一致互连协议CCIX总线、Z时代总线中的一种或者多种。
  10. 根据权利要求1至9任一权利要求所述的方法,其特征在于,所述外设芯片包括快捷外围部件互联标准PCIe设备芯片、存储卡芯片、网卡芯片、独立冗余磁盘阵列RAID芯片、加速卡芯片中的一种或者多种,其中,所述加速卡包括图像处理器GPU、处理器分散处理单元DPU、神经网络处理器NPU中的一种或者多种。
  11. 一种主机,其特征在于,所述主机应用于计算设备,所述计算设备包括所述主机、内存和外设芯片,所述主机、内存和外设芯片通过总线耦合,所述主机包括:
    获取单元,用于获取数据处理请求,所述数据处理请求包括下行数据,所述下行数据用于指示所述主机待向所述外设芯片发送的数据;
    存储单元,用于所述主机将所述下行数据存储至所述内存;
    直接内存访问DMA单元,用于以直接内存访问DMA方式将所述下行数据拷贝至所述外设芯片。
  12. 一种处理器,其特征在于,所述处理器设置于计算设备,所述计算设备包括处理器、内存和外设芯片,所述处理器、内存和外设芯片通过总线耦合,所述处理器用于实现如权利要求1至10任一权利要求所述的方法的操作步骤。
  13. 一种计算设备,其特征在于,所述计算设备包括主机、内存和外设芯片,所述主机、内存和外设芯片通过总线耦合,所述主机和外设芯片分别用于实现如权利要求1至10任一权利要求所述的方法的操作步骤的功能。
PCT/CN2023/085690 2022-03-31 2023-03-31 一种数据处理方法、主机及相关设备 WO2023186143A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210336383.4A CN116932451A (zh) 2022-03-31 2022-03-31 一种数据处理方法、主机及相关设备
CN202210336383.4 2022-03-31

Publications (1)

Publication Number Publication Date
WO2023186143A1 true WO2023186143A1 (zh) 2023-10-05

Family

ID=88199444

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085690 WO2023186143A1 (zh) 2022-03-31 2023-03-31 一种数据处理方法、主机及相关设备

Country Status (2)

Country Link
CN (1) CN116932451A (zh)
WO (1) WO2023186143A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117632043A (zh) * 2024-01-25 2024-03-01 北京超弦存储器研究院 Cxl内存模组、控制芯片、数据处理方法、介质和系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1961300A (zh) * 2004-06-30 2007-05-09 英特尔公司 使用集成dma引擎进行高性能易失性磁盘驱动器存储器访问的装置和方法
US20070162650A1 (en) * 2005-12-13 2007-07-12 Arm Limited Distributed direct memory access provision within a data processing system
CN101149717A (zh) * 2007-11-16 2008-03-26 威盛电子股份有限公司 计算机系统及直接内存访问传输方法
US20130282971A1 (en) * 2012-04-24 2013-10-24 Hon Hai Precision Industry Co., Ltd. Computing system and data transmission method
CN107993183A (zh) * 2017-11-24 2018-05-04 暴风集团股份有限公司 图像处理装置、方法、终端和服务器
CN112000598A (zh) * 2020-07-10 2020-11-27 深圳致星科技有限公司 用于联邦学习的处理器、异构处理系统及隐私数据传输方法
CN114253883A (zh) * 2021-12-03 2022-03-29 烽火通信科技股份有限公司 一种endpoint设备访问方法、系统及endpoint设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1961300A (zh) * 2004-06-30 2007-05-09 英特尔公司 使用集成dma引擎进行高性能易失性磁盘驱动器存储器访问的装置和方法
US20070162650A1 (en) * 2005-12-13 2007-07-12 Arm Limited Distributed direct memory access provision within a data processing system
CN101149717A (zh) * 2007-11-16 2008-03-26 威盛电子股份有限公司 计算机系统及直接内存访问传输方法
US20130282971A1 (en) * 2012-04-24 2013-10-24 Hon Hai Precision Industry Co., Ltd. Computing system and data transmission method
CN107993183A (zh) * 2017-11-24 2018-05-04 暴风集团股份有限公司 图像处理装置、方法、终端和服务器
CN112000598A (zh) * 2020-07-10 2020-11-27 深圳致星科技有限公司 用于联邦学习的处理器、异构处理系统及隐私数据传输方法
CN114253883A (zh) * 2021-12-03 2022-03-29 烽火通信科技股份有限公司 一种endpoint设备访问方法、系统及endpoint设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117632043A (zh) * 2024-01-25 2024-03-01 北京超弦存储器研究院 Cxl内存模组、控制芯片、数据处理方法、介质和系统
CN117632043B (zh) * 2024-01-25 2024-05-28 北京超弦存储器研究院 Cxl内存模组、控制芯片、数据处理方法、介质和系统

Also Published As

Publication number Publication date
CN116932451A (zh) 2023-10-24

Similar Documents

Publication Publication Date Title
US20200278880A1 (en) Method, apparatus, and system for accessing storage device
US11929927B2 (en) Network interface for data transport in heterogeneous computing environments
US7664909B2 (en) Method and apparatus for a shared I/O serial ATA controller
US8291141B2 (en) Mechanism to flexibly support multiple device numbers on point-to-point interconnect upstream ports
JP3033935B2 (ja) アダプタ・ハードウェアへのインターフェース方法
US11829309B2 (en) Data forwarding chip and server
WO2020259418A1 (zh) 一种数据访问方法、网卡及服务器
WO2019233322A1 (zh) 资源池的管理方法、装置、资源池控制单元和通信设备
US20220191153A1 (en) Packet Forwarding Method, Computer Device, and Intermediate Device
AU2015402888B2 (en) Computer device and method for reading/writing data by computer device
WO2023174146A1 (zh) 卸载卡命名空间管理、输入输出请求处理系统和方法
WO2023186143A1 (zh) 一种数据处理方法、主机及相关设备
WO2020219810A1 (en) Intra-device notational data movement system
WO2016101856A1 (zh) 数据访问方法及装置
US11029847B2 (en) Method and system for shared direct access storage
US10747615B2 (en) Method and apparatus for non-volatile memory array improvement using a command aggregation circuit
US11601515B2 (en) System and method to offload point to multipoint transmissions
CN117971135B (zh) 存储设备的访问方法、装置、存储介质和电子设备
WO2023051248A1 (zh) 一种数据访问系统、方法及相关设备
US11281612B2 (en) Switch-based inter-device notational data movement system
US20230350824A1 (en) Peripheral component interconnect express device and operating method thereof
CN113297111B (zh) 人工智能芯片及其操作方法
CN113986457A (zh) 一种远程输入输出设备中断映射装置与方法
CN115858434A (zh) 一种计算设备及请求处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23778507

Country of ref document: EP

Kind code of ref document: A1