WO2023134588A1 - 计算系统、方法、装置及加速设备 - Google Patents

计算系统、方法、装置及加速设备 Download PDF

Info

Publication number
WO2023134588A1
WO2023134588A1 PCT/CN2023/071094 CN2023071094W WO2023134588A1 WO 2023134588 A1 WO2023134588 A1 WO 2023134588A1 CN 2023071094 W CN2023071094 W CN 2023071094W WO 2023134588 A1 WO2023134588 A1 WO 2023134588A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
memory
acceleration device
host device
acceleration
Prior art date
Application number
PCT/CN2023/071094
Other languages
English (en)
French (fr)
Inventor
刘昊程
朱琦
崔宝龙
汪海疆
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023134588A1 publication Critical patent/WO2023134588A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0646Configuration or reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the computing field, and in particular to a computing system, method, device and acceleration device.
  • Heterogeneous computing technology refers to computing using a system composed of multiple computing devices, and the multiple computing devices are different in instruction sets and architectures. Heterogeneous computing technology has become the focus of current research because it can cost-effectively obtain high-performance computing capabilities, enable abundant computing resources, and improve business end-to-end performance.
  • the computing device when a computing device performs accelerated processing of a task to be processed based on heterogeneous computing technology, the computing device generally sends the data required by the task to be processed to another computing device communicating with the computing device, and the other computing device Pending tasks to process.
  • the computing device that sends the data required by the task to be processed can be considered as a host device, and the computing device that processes the task to be processed can be considered as an acceleration device.
  • the data required for the task to be processed is stored in the storage device communicating with the host device, resulting in multiple transfers of the data through the storage device-host device-acceleration device, resulting in a large amount of data transfer in the heterogeneous computing process big. Therefore, how to reduce the amount of data transmission in the process of heterogeneous computing is an urgent problem to be solved.
  • Embodiments of the present application provide a computing system, method, device, and acceleration device, which can solve the problem of a large amount of data transmission in a heterogeneous computing process.
  • a computing system in a first aspect, includes: a host device and an acceleration device.
  • the host device is communicatively connected to the acceleration device, and the acceleration device is coupled to a memory, and the memory stores the first data required by the service to be processed.
  • the host device is configured to send calling information to the acceleration device, where the calling information is used to indicate the storage address of the first data.
  • the acceleration device is configured to receive the calling information sent by the host device, and obtain the first data from the memory according to the storage address.
  • the acceleration device is further configured to execute tasks to be processed based on the first data to obtain processing results.
  • the memory that stores the first data required by the business to be processed is coupled to the acceleration device, and the acceleration device can obtain the first data from the memory according to the call information, and execute based on the first data pending tasks.
  • the host device informs the acceleration device of the storage address, so that the acceleration device can directly obtain data from the memory according to the storage address and process it, preventing the host device from obtaining data from the memory and then transmitting the data to the acceleration device, which can reduce Calculate the amount of data transmission in the system, so as to solve the problem of large data transmission in the process of heterogeneous computing.
  • the service to be processed includes N consecutive operations, where N is an integer greater than 1.
  • the acceleration device is further configured to execute the i+1th operation according to the processing result of the i-th operation in the N operations, and obtain the processing result of the i+1-th operation.
  • i is an integer, and 1 ⁇ i ⁇ N-1.
  • the processing result of the Nth operation is the processing result.
  • the processing result of the i-th operation is stored in the memory or the acceleration device.
  • the acceleration device can quickly read the processing result of the i-th operation when executing the i+1-th operation, thereby improving computing efficiency.
  • the host device is further configured to obtain proxy information, and generate call information according to the proxy information.
  • the proxy information includes: the virtual storage address of the first data.
  • the virtual storage address of the first data can be used to call the acceleration device to execute the task to be processed, reducing the amount of data transmission and implementation difficulty.
  • the agent information is stored in the host device. In this way, when the host device needs to determine the calling information, it can quickly read the proxy information to determine the calling information, thereby improving the calling efficiency.
  • the first data is stored in a memory object, where the memory object is a segment of physical address storage space provided by the memory.
  • the acceleration device is further configured to send the virtual storage address of the first data to the host device.
  • the host device is further configured to receive the virtual storage address of the first data sent by the acceleration device. In this way, the host device can determine proxy information according to the virtual storage address of the first data.
  • a calculation method is provided.
  • the method is executed by the acceleration device, the acceleration device is connected to the host device in communication, the acceleration device is coupled with the memory, and the memory stores the first data required by the service to be processed.
  • the calculation method includes: receiving the call information sent by the host device, and the call information is used for Indicate the storage address of the first data; obtain the first data from the memory according to the storage address; execute the task to be processed based on the first data, and obtain a processing result.
  • the business to be processed includes consecutive N operations, and N is an integer greater than 1; executing the task to be processed based on the first data to obtain a processing result includes: according to the first data in the N operations The processing result of the i operation, execute the i+1th operation, and get the processing result of the i+1th operation; i is an integer, and 1 ⁇ i ⁇ N-1; where, the processing result of the Nth operation is processing result.
  • the processing result of the i-th operation is stored in a memory or an acceleration device.
  • the first data is stored in a memory object, where the memory object is a segment of physical address storage space provided by the memory.
  • the method in the second aspect further includes: sending the virtual storage address of the first data to the host device.
  • a calculation method is provided.
  • the method is executed by a host device, and the host device communicates with the acceleration device, and the acceleration device is coupled with a memory, and the memory stores first data required by services to be processed.
  • the method includes: sending call information to the acceleration device, where the call information is used to indicate the storage address of the first data.
  • the business to be processed includes consecutive N operations, and N is an integer greater than 1; executing the task to be processed based on the first data to obtain a processing result includes: according to the first data in the N operations The processing result of the i operation, execute the i+1th operation, and get the processing result of the i+1th operation; i is an integer, and 1 ⁇ i ⁇ N-1; where, the processing result of the Nth operation is processing result.
  • the processing result of the i-th operation is stored in a memory or an acceleration device.
  • the method described in the third aspect further includes: acquiring proxy information, where the proxy information includes: a virtual storage address of the first data; and generating call information according to the proxy information.
  • the agent information is stored in the host device.
  • the first data is stored in a memory object, where the memory object is a segment of physical address storage space provided by the memory.
  • the method of the third aspect further includes: receiving the virtual storage address of the first data sent by the acceleration device.
  • the present application provides an acceleration device, the acceleration device includes a memory and at least one processor, the memory is used to store a set of computer instructions, and when the processor executes a set of computer instructions, it is used to implement the second aspect and the second The operation steps of the method in any possible implementation manner in the aspect.
  • the present application provides a host device, the host device includes a memory and at least one processor, the memory is used to store a set of computer instructions, and when the processor executes a set of computer instructions, it is used to implement the third aspect and the third aspect.
  • the operation steps of the method in any possible implementation manner in the aspect.
  • the present application provides a computer-readable storage medium, in which computer programs or instructions are stored, and when the computer programs or instructions are executed, the methods described in the above-mentioned aspects or possible implementation modes of each aspect are implemented operation steps.
  • the present application provides a computer program product, the computer program product includes instructions, and when the computer program product runs on the management node or the processor, the management node or the processor executes the instructions to achieve any of the above aspects Or the operation steps of the method described in the possible implementation manners of any aspect.
  • the present application provides a chip, including a memory and a processor, the memory is used to store computer instructions, and the processor is used to call and run the computer instructions from the memory, so as to implement any or all possible aspects of the above-mentioned aspects.
  • a chip including a memory and a processor
  • the memory is used to store computer instructions
  • the processor is used to call and run the computer instructions from the memory, so as to implement any or all possible aspects of the above-mentioned aspects.
  • FIG. 1 is a first schematic diagram of the architecture of a computing system provided by an embodiment of the present application
  • FIG. 2 is a second schematic diagram of the architecture of a computing system provided by an embodiment of the present application.
  • FIG. 3 is a third schematic diagram of the architecture of a computing system provided by an embodiment of the present application.
  • FIG. 4 is a schematic flow diagram 1 of a calculation method provided in the embodiment of the present application.
  • FIG. 5 is a detailed flowchart of S430 provided in the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a proxy class-delegate class mapping relationship provided by an embodiment of the present application.
  • FIG. 7 is a schematic flow diagram II of a calculation method provided in the embodiment of the present application.
  • FIG. 8 is a schematic flow chart III of a calculation method provided in the embodiment of the present application.
  • Fig. 9 is a schematic diagram 1 of application acceleration provided by the embodiment of the present application.
  • FIG. 10 is a second schematic diagram of application acceleration provided by the embodiment of the present application.
  • Near-data computing refers to arranging data computing and processing actions on the computing device closest to the storage location of the data.
  • the distance between a data storage location and a computing device can be defined as: the sum of the time when the data is transmitted to the computing device and the time when the computing device processes the data.
  • the computing device that is closest to the storage location of certain data may be the computing device that acquires and processes the data the fastest among multiple computing devices.
  • Using near-data computing technology can significantly reduce unnecessary data copying and improve the efficiency of data computing and processing.
  • the memory object in the embodiment of the present application, may refer to: an address space for storing a set of data.
  • a memory object may also be called a real object, a proxy object, and the like.
  • resource objects, persistent objects, or connection objects may be stored in the memory object, which is not limited.
  • the resource object can refer to: read and write the file handle (file handle, FD) of the storage entity;
  • the persistent object can refer to: two consecutive operations, and these two operations need to process at least the same set of data;
  • the connection object can Refers to: network connection operations, etc.
  • the proxy object in the embodiment of the present application, may refer to an object associated with a memory object, which can control the access and service actions of the memory object.
  • Heterogeneous proxy objects in this embodiment of the application, may refer to: memory objects and proxy objects belong to different computing device vendors. For example, a proxy object resides on a host device, and a memory object resides on an acceleration device.
  • Proxy in the embodiment of the present application, may refer to entrusting a memory object to a proxy through programming to control the access and service actions of the memory object.
  • a proxy object of the memory object can be generated in a programmatic manner, and the access and service actions of the memory object can be controlled through the proxy object.
  • Serialization refers to the process of converting an object into a serial form that can be transmitted.
  • the proxy class is a collection of proxy objects with common attributes and behaviors, that is, the template of proxy objects, which can generate proxy objects.
  • the delegate class is a collection of proxy objects with common attributes and behaviors, that is, the template of the proxy object, which can generate the proxy object.
  • the proxy class can preprocess the message for the delegate class, filter the message and forward the message, and perform subsequent processing after the message is executed by the delegate class.
  • the proxy class does not implement specific services, but uses the delegate class to complete the service, and encapsulates the execution results.
  • FIG. 1 is a first schematic diagram of the architecture of a computing system provided by an embodiment of the present application.
  • the computing system 100 includes a host device 110 and an acceleration device 120 .
  • the host device 110 and the acceleration device 120 will be introduced respectively below with reference to the accompanying drawings.
  • Host device 110 is a computing device.
  • the host device 110 may be a server, a personal computer, a mobile phone, a tablet computer, a smart car, or other devices, etc., which is not limited thereto.
  • the host device 110 may include a processor 111 and a communication interface 113, and the processor 111 and the communication interface 113 are coupled to each other.
  • the host device 110 further includes a memory 112 .
  • the processor 111 , the memory 112 and the communication interface 113 are coupled to each other.
  • the embodiment of the present application does not limit the specific implementation of the mutual coupling between the processor 111, the memory 112 and the communication interface 113.
  • the bus 114 is represented by a thick line in FIG. 1 , and the connection manner between other components is only for schematic illustration and is not limited thereto.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 1 , but it does not mean that there is only one bus or one type of bus.
  • the communication interface 113 is used to communicate with other devices or a communication network.
  • the communication interface 113 may be a transceiver or an input/output (input/output, I/O) interface.
  • I/O input/output
  • the I/O interface can be used to communicate with devices located outside the host device 110 .
  • an external device inputs data to the host device 110 through the I/O interface; after the host device 110 processes the input data, it may send an output result of the data processing to the external device through the I/O interface.
  • the processor 111 is the computing core and control core of the host device 110, and it may be a central processing unit (central processing unit, CPU), or other specific integrated circuits.
  • the processor 111 can also be other general processors, digital signal processing (digital signal processing, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the host device 110 may also include multiple processors.
  • the processor 111 may include one or more processor cores.
  • the processor 111 may be connected to the memory 112 through a double data rate (DDR) bus or other types of buses.
  • DDR double data rate
  • the memory 112 is the main memory of the host device 110 .
  • the memory 112 is usually used to store various running software in the operating system, instructions executed by the processor 111 or input data required by the processor 111 to execute the instructions, data generated after the processor 111 executes the instructions, and the like. In order to improve the access speed of the processor 111, the memory 112 needs to have the advantage of fast access speed.
  • a dynamic random access memory (dynamic random access memory, DRAM) is usually used as the memory 112 .
  • the memory 112 may also be other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the memory 112 may also be a read-only memory (read only memory, ROM).
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like. This embodiment does not limit the quantity and type of the memory 112 .
  • the computing system 100 may also be provided with a data storage system 130.
  • the data storage system 130 may be located outside the host device 110 (as shown in FIG. 1 ), communicate with the host device through a network 110 exchanging data.
  • the data storage system 130 may also be located inside the host, and exchange data with the processor 111 through the PCIe bus 114 . At this time, the data storage system 130 acts as a hard disk.
  • Acceleration device 120 is a computing device.
  • the acceleration device 120 may be a server, a personal computer, a mobile phone, a tablet computer, a smart car, or other devices, etc., which is not limited thereto.
  • the acceleration device 120 may include a processor 121 and a communication interface 123, and the processor 121 and the communication interface 123 are coupled to each other.
  • the acceleration device 120 also includes a memory 122 .
  • the processor 121 , the memory 122 and the communication interface 123 are coupled to each other.
  • the memory 122 may be located inside the acceleration device 120 (as shown in FIG. 1 ), and exchange data with the processor 121 through the bus. At this time, the memory 122 acts as a hard disk.
  • the memory 122 may be located outside the acceleration device 120, and exchange data with the acceleration device 120 through a network.
  • the acceleration device 120 can be used to execute tasks to be processed, such as performing matrix calculations, graphics operations, network data interaction, disk read and write, etc.
  • the acceleration device 120 can be implemented by one or more processors.
  • the processor can include any one of CPU, graphics processing unit (graphics processing unit, GPU), neural network processor (neural-network processing units, NPU), tensor processing unit (tensor processing unit, TPU), FPGA, ASIC kind.
  • GPU also known as display core, visual processor, and display chip, is a microprocessor that specializes in image computing on personal computers, workstations, game consoles, and some mobile devices (such as tablets, smartphones, etc.).
  • the NPU simulates human neurons and synapses at the circuit level, and uses deep learning instruction sets to directly process large-scale neurons and synapses. One instruction completes the processing of a group of neurons.
  • ASIC is suitable for a single-purpose integrated circuit product.
  • processor 121, memory 122 and communication interface 123 are similar to above-mentioned processor 111, memory 112 and communication interface 113, more detailed description about processor 121, memory 122 and communication interface 123 can refer to above-mentioned processor 111, memory 112 The description of the communication interface 113 will not be repeated here.
  • the memory 122 can be a non-volatile memory (non-volatile memory, NVM), such as ROM, flash memory or solid state disk (solid state disk, SSD), etc., like this, the memory 122 can be used to store larger data , such as the data required by the business to be processed.
  • NVM non-volatile memory
  • ROM read-only memory
  • flash memory solid state disk
  • SSD solid state disk
  • the memory 122 can be used to store larger data , such as the data required by the business to be processed.
  • a buffer is usually set therein, which can only store a small amount of data temporarily, and cannot be saved when the device stops working. Therefore, the existing acceleration device cannot store the data required by the business to be processed, but the acceleration device 120 provided in the embodiment of the present application can use the memory 122 to store the data required by the business to be processed.
  • the acceleration device 120 can be inserted into a card slot on the motherboard of the host device 110, and exchange data with the processor 111 through the bus 114.
  • the bus 114 can be It is a PCIe bus, or it can also be a bus of a computing express link (compute express link, CXL), a universal serial bus (universal serial bus, USB) protocol or other protocols, so as to support data transfer between the acceleration device 120 and the host device 110 transmission.
  • FIG. 2 is a second schematic diagram of a computing system architecture provided by an embodiment of the present application.
  • the acceleration device 120 is not inserted into a card slot on the motherboard of the host device 110 , but is a device independent of the host device 110 .
  • the host device 110 may be connected to the acceleration device 120 through a wired network such as a network cable, or may be connected to the acceleration device 120 through a wireless network such as a wireless hotspot (Wi-Fi) or Bluetooth (Bluetooth).
  • Wi-Fi wireless hotspot
  • Bluetooth Bluetooth
  • the coupling of the acceleration device and the memory may refer to: the acceleration device includes a memory, or the acceleration device is connected to a memory.
  • the host device 110 may be configured to send calling information to the acceleration device 120 to call the acceleration device 120 to process pending tasks.
  • the acceleration device 120 can be configured to execute pending tasks according to the calling information sent by the host device 110 , and feed back the obtained processing results to the host device 110 .
  • the combination of the host device 110 and the acceleration device 120 can implement the calculation method provided by the embodiment of the present application, so as to solve the problem of a large amount of data transmission in the heterogeneous calculation process.
  • the computing system 100 also includes a client device 130, through which the user can input data to the host device 110, for example, the client device 130 inputs data to the host device 110 through the communication interface 113, and the host device After 110 processes the input data, it sends the output result of the data processing to the client device 130 through the communication interface 113 .
  • the client device 130 may be a terminal device, including but not limited to a personal computer, a server, a mobile phone, a tablet computer, or a smart car.
  • the types of the processors in the host device 110 and the processors in the acceleration device 120 may be the same or different, which is not limited. That is to say, in the embodiment of the present application, regardless of whether the instruction sets and architectures of the host device 110 and the acceleration device 120 are consistent, the two can cooperate with each other to implement the computing method provided in the embodiment of the present application.
  • the operating systems running on the host device 110 and the acceleration device 120 may be consistent or not, and the comparison is not limited.
  • the operating system running in the host device 110 is an operating system with a first security level
  • the operating system running in the acceleration device 120 is an operating system with a second security level, wherein the first security level is lower than the second security level, so , the host device 110 can process calculations in an acceleration device with a higher security level to improve data security.
  • FIG. 3 is a third schematic diagram of a computing system architecture provided by an embodiment of the present application. Please refer to FIG. 3 .
  • the computing system 300 includes: a host device 310 and an acceleration device 320 .
  • the communication connection between the host device 310 and the acceleration device 320 is not limited to.
  • the computing system 300 shown in FIG. 3 may be implemented by using hardware circuits, or may be implemented by combining hardware circuits with software, so as to realize corresponding functions.
  • the computing system 300 shown in FIG. 3 may represent a software architecture that can run on the computing system 100 shown in FIG. The /unit can run on the host device 110 or the acceleration device 120 shown in FIG. 1 .
  • the host device 310 can also be called a computing device.
  • the host device 310 includes: a runtime module 311 , a compiling module 312 , and an offload execution engine 313 .
  • the acceleration device 320 may also be referred to as an acceleration device.
  • the acceleration device 320 includes: a communication module 321 and a processing module 322 .
  • the above modules/engines/units can be implemented by hardware in the form of instructions.
  • the runtime module 311, the compiling module 312, and the offload execution engine 313 may be executed by the processor 111 of the host device 110 in the form of instructions, so as to realize corresponding functions.
  • the communication module 321 and the processing module 322 may be executed by the processor 121 of the acceleration device 120 in the form of instructions, so as to implement corresponding functions.
  • the computing system 300 shown in FIG. 3 is used as an example of a software architecture below, and the functions of the structures included in each module/engine in the computing system 300 will be described in detail.
  • the compilation module 312 is represented as a computer instruction, can be executed by a processor, and can realize the function of the compilation instruction.
  • the compiling module 312 may include: a proxy generating unit 312A.
  • the proxy generation unit 312A can generate the proxy class structure in a compiled manner.
  • the proxy class structure may include proxy functions and proxy information.
  • the proxy information can be expressed as address data, which is used to indicate the address of the memory object, that is to say, the proxy information can be a pointer.
  • the proxy function can be represented as a set of computer instructions, which can implement logical encapsulation for preprocessing messages, filtering messages, and delivering messages of the entrusting class, as well as realizing conversion and result encapsulation between the services of the entrusting class and the entrusting class.
  • the proxy generating unit 312A includes a proxy class generator and a proxy function generator.
  • the proxy class generator is used to generate the proxy class structure of a certain delegate class, which can include two sets of data, one set of data can be used to record proxy information, and the other set of data can be used to record proxy functions, that is to say, The two sets of data in the structure of the proxy class have not yet recorded the proxy function and proxy information of the proxy class.
  • the proxy function generator is used to generate proxy functions in the proxy class structure, such as functions for pushing information to heterogeneous devices. In this way, by using the proxy class generator and the proxy function generator, the structure of the proxy class of a certain delegate class can be generated.
  • the runtime module 311 represents computer instructions, can be executed by a processor, and can realize the function of executing a set of operations (or called logic) and offloading execution actions during runtime.
  • the runtime module 311 may include: an execution unit 311A and an identification unit 311B.
  • the identifying unit 311B is configured to identify the object currently running on the host device 310.
  • the identifying unit 311B includes a proxy object recognizer, which can identify whether the object currently running on the host device 310 is a proxy object.
  • the execution unit 311A is used to control the access to the memory object through the proxy object.
  • the execution unit 311A includes a push execution logic, which can convert the operation of the host device 310 on the proxy object into a push execution logic, and Generate call information, and trigger the offload execution engine 313 to send the call information to the acceleration device 320 , thereby controlling access to memory objects in the acceleration device 320 .
  • a push execution logic which can convert the operation of the host device 310 on the proxy object into a push execution logic, and Generate call information, and trigger the offload execution engine 313 to send the call information to the acceleration device 320 , thereby controlling access to memory objects in the acceleration device 320 .
  • the offload execution engine 313 is represented as a computer instruction, can be executed by a processor, and can realize the function of sending information to other computing devices.
  • the offload execution engine 313 may send invocation information to the acceleration device 320 according to the trigger of the push execution logic unit.
  • the communication module 321 is used to realize the function of exchanging data with other devices.
  • the communication module 321 may receive the call information sent by the host device 310 .
  • the processing module 322 is used to realize the function of executing tasks to be processed. Exemplarily, the processing module 322 may execute pending tasks according to the calling information sent by the host device 310 .
  • the computing method provided by the embodiment of the present application can be implemented to solve the problem of large data transmission volume in the heterogeneous computing process.
  • a more detailed description of each module/engine/unit in the computing system 300 can be directly obtained by referring to the relevant descriptions in the embodiments shown in FIG. 4 , FIG. 5 , FIG. 7 , and FIG.
  • FIG. 4 is a schematic flow chart of a calculation method provided by the present application.
  • the calculation method can be applied to the above-mentioned computing system, and can be executed by the host device and the acceleration device in the above-mentioned computing system.
  • the host device 310 may include various structures in the host device 110 shown in FIG. 1
  • the acceleration device 320 may include various structures in the acceleration device 120 shown in FIG. 1 .
  • the acceleration device 320 is coupled with a memory 322, and the memory 322 may be disposed outside or inside the acceleration device 320, and the memory 322 disposed inside the acceleration device 320 in FIG. 4 is only used as a possible example.
  • the first data required by the service to be processed is stored in the memory 322 .
  • the calculation method provided in this embodiment includes the following steps S410-S430.
  • the host device 310 sends invocation information to the acceleration device 320.
  • the acceleration device 320 receives the calling information sent by the host device 310 .
  • the call information is used to indicate the storage address of the first data.
  • the calling information may include a physical storage address or a virtual storage address of the first data.
  • the service to be processed may be a big data service, an artificial intelligence service, a cloud computing service, or any other service.
  • the calculations involved may include matrix calculations, graphics operations, network data interaction, disk read and write, etc., which are not limited.
  • the data to realize the service to be processed may include one or more groups.
  • the first data is data that needs to be processed by the acceleration device 320 in the set or multiple sets of data.
  • the first data may include input data and calculation instructions (or called functions), and the computing device may process the service to be processed by processing the input data and the calculation instructions.
  • the first data required by the business to be processed includes multiple matrices and matrix calculation functions, and the computing device can use the matrix calculation functions to perform operations on the multiple matrices to obtain processing results.
  • the acceleration device 320 obtains the first data from the memory 322 according to the storage address.
  • the acceleration device 320 can convert the virtual storage address of the first data into the physical storage address of the first data, and then read the data from the memory 322 according to the physical storage address of the first data Read first data.
  • the memory 322 is disposed inside the acceleration device 320 , and the acceleration device 320 can read the first data stored in the memory 322 through the bus according to the storage address.
  • the acceleration device 320 may read the first data stored in the memory 322 through a network.
  • the acceleration device 320 executes the task to be processed based on the first data, and obtains a processing result.
  • the acceleration device can first use the matrix sum operation function to calculate the sum of matrix 1, matrix 2, and matrix 3 (denoted as matrix 4), and then use the matrix inverse operation function to calculate the inverse matrix of matrix 4, and matrix 4 -1 is the processing result.
  • the processing result is a result obtained by processing the first data.
  • the processing result can be the intermediate or final result of the task to be processed; when the first data is all the data in the data that realizes the business to be processed, the processing result can be The processing result may be the final result of the task to be processed.
  • the service to be processed may include N consecutive operations, where N is an integer greater than 1.
  • FIG. 5 is a detailed flowchart of S430 provided by the embodiment of the present application.
  • S430 shown in FIG. 4 may include:
  • the acceleration device 320 executes the i+1th operation according to the processing result of the i'th operation in the N operations, and obtains the processing result of the i+1th operation.
  • i is an integer, and 1 ⁇ i ⁇ N-1, and the processing result of the Nth operation is the processing result.
  • the memory 322 is set inside the acceleration device 320 , and the processing result of the i-th operation may be stored in the memory 322 .
  • the processing result of the i-th operation may be stored in the memory 322 or a memory inside the acceleration device 320, which is not limited.
  • operation 1 is to use the matrix sum operation function to calculate the sum of matrix 1, matrix 2, and matrix 3 (denoted as matrix 4 )
  • operation 2 is to use the matrix inverse operation function to obtain the inverse matrix of matrix 4 (matrix 4 -1 )
  • operation 3 is to use the matrix sum operation function to obtain the sum of matrix 4 -1 and matrix 1 (marked as matrix 5)
  • Matrix 4 and Matrix 4 -1 are intermediate processing results
  • Matrix 5 is the processing result.
  • Both matrix 4 and matrix 4 -1 can be stored in the memory 322 .
  • the sum matrix 5 may also be stored in memory 322 .
  • the continuous N operations in the business to be processed are all processed in the acceleration device 320, that is to say, the intermediate processing results of multiple operations in the business to be processed can be avoided from being repeatedly transmitted between the host device and the acceleration device , thereby reducing the amount of data transmission and improving computing efficiency.
  • the host device 310 informs the acceleration device 320 of the storage address, so that the acceleration device 320 can directly obtain the data from the memory 322 according to the storage address, and process it, so as to prevent the host device 310 from obtaining the data from the memory 322 and then transfer the data to Transmission to the acceleration device 320, that is, it can avoid the multiple transmission process of data through the storage device-host device-acceleration device, which can reduce the amount of data transmission in the computing system, save bandwidth resources, and improve the efficiency of unloading and pushing, thereby solving the problem of The problem of large amount of data transmission in the process of heterogeneous computing.
  • the instruction structure of the first data required by the business to be processed can also be retained, avoiding large-scale refactoring of codes, and improving development efficiency.
  • the above S410 when the above S410 is implemented by a software module/engine/unit included in the host device 310 or the acceleration device 320, it may be executed by the offload execution engine 313 and the communication module 321 shown in FIG. 3 .
  • the offload execution engine 313 sends calling information to the acceleration device 320
  • the communication module 321 receives the calling information sent by the host device 310 .
  • S420-S430 When the above S420-S430 are implemented by a software module/engine/unit included in the acceleration device 320, they may be executed by the processing module 322 shown in FIG. 3 .
  • S420, S430, and S431 can all be executed by the processing module 322 .
  • the above-mentioned first data is stored in a memory object, and the memory object is a segment of physical address storage space provided by the memory 322 .
  • the first data may be a resource object, a persistent object, and a connection object. That is to say, in the method embodiment shown in FIG. Objects are processed to support heterogeneous computing of resource objects, persistent objects, and connection objects, avoiding code refactoring and saving costs.
  • the method shown in FIG. 4 further includes:
  • the acceleration device 320 sends the processing result to the host device 310.
  • the host device 310 receives the processing result sent by the acceleration device 320 .
  • the host device 310 may also include a communication module.
  • the communication module in device 310 executes.
  • the communication module 321 sends the processing result to the host device 310
  • the communication module in the host device 310 receives the processing result sent by the acceleration device 320 .
  • the method shown in FIG. 4 further includes:
  • the host device 310 acquires proxy information, and generates call information according to the proxy information.
  • the proxy information includes: the virtual storage address of the first data.
  • the proxy information is stored in the host device 310 .
  • the method shown in FIG. 4 further includes:
  • the acceleration device 320 sends the virtual storage address of the first data to the host device 310.
  • the host device 310 receives the virtual storage address of the first data sent by the acceleration device 320 . In this way, the host device can determine proxy information according to the virtual storage address of the first data.
  • the way for the host device 310 to obtain proxy information may include: the host device 310 receives the virtual storage address of the first data sent by the acceleration device 320, and determines the proxy information according to the virtual storage address of the first data.
  • S401 and S402 are implemented by the software modules/engines/units included in the host device 310 or the acceleration device 320, the runtime module 311, the communication module 321 and the host device 310 shown in FIG.
  • the communication module executes.
  • S401 is executed by the runtime module 311;
  • the communication module 321 is used to send the virtual storage address of the first data to the host device 310;
  • the communication module in the host device 310 is used to receive the virtual storage address of the first data sent by the acceleration device 320 .
  • the host device 310 and the acceleration device 320 can be combined to implement the compiling process and the mapping process.
  • the compiling process may generate the proxy class structure in a compiling manner
  • the mapping process may write proxy information in the proxy class structure and determine the proxy object.
  • S401 and S402 may be included in the mapping process.
  • S410-S440 may be regarded as an execution process.
  • the compilation process includes the following steps 1.1 to 1.4.
  • Step 1.1 the host device 310 obtains the service data, and judges whether it needs to call the acceleration device 320 to execute the first object in the service data according to the identifier in the service data. If it is necessary to call the acceleration device 320 to execute the first object of the service data, perform step 1.2; otherwise, perform step 1.4.
  • the service data refers to the data to realize the service to be processed, and the relevant description of the service to be processed can be implemented by referring to the above method, and will not be repeated here.
  • the business data may include one or more first objects, and the first objects here may refer to objects in the programming field, that is, to abstract objective things into computer instructions and/or data.
  • the first object here may be a resource object, a persistent object, a connection object, etc., which is not limited.
  • the identifier in the business data can be used to indicate whether the first object in the business data needs a proxy. If the identifier in the business data is used to indicate that the first object needs a proxy, it means that the first object needs to call the acceleration device for processing; if The identifier in the service data is used to indicate that the first object does not need a proxy, which means that the first object does not need to call the acceleration device for processing.
  • the identifier in the service data can be preset, for example, can be set by the user, and the user can add the identifier in the preset position of the service data to indicate whether the first object in the first service needs an agent.
  • the preset location may be the data corresponding to the part of the service offloading process in the service data.
  • step 1.2 the host device 310 determines the delegate class corresponding to the first object, and generates the structure of the proxy class corresponding to the delegate class.
  • step 1.3 the host device 310 generates the proxy function in the structure of the proxy class.
  • the host device 310 generates a proxy class structure of the proxy class corresponding to the first object, the structure may include two sets of data, the first set of data may be used to record proxy information, the second set of data may be used to record proxy functions, In step 1.2, the two sets of data in the structure of the proxy class have not yet recorded the proxy function and proxy information of the proxy class.
  • the host device 310 may generate a proxy function in the second set of data, where the proxy function is used to indicate a function in the delegate class.
  • Figure 6 is a schematic diagram of the mapping relationship between a proxy class and a proxy class provided by the embodiment of the present application.
  • the second set of data in the proxy class includes proxy function 1 and proxy function 2
  • the proxy class includes the function A, function B and data A
  • proxy function 1 indicates function A
  • proxy function 2 indicates function B
  • the first set of data in the proxy class is used to store proxy information
  • the proxy information can be used to indicate the storage address of a memory object.
  • Step 1.4 the host device 310 generates an executable program according to the business data.
  • the proxy class-delegate class mapping relationship shown in FIG. 6 is generated.
  • the proxy object can be generated through the proxy class, and the proxy object (that is, the memory object) can be generated through the delegate class. That is to say, according to the above proxy class-delegate class mapping relationship, the proxy object associated with the memory object can be generated.
  • the proxy object that is, the memory object
  • the proxy object associated with the memory object can be generated.
  • steps 1.1 to 1.4 are implemented by the software module/engine/unit included in the host device 310, they can be executed by the compiling module 312 shown in FIG. 3 .
  • steps 1.1 to 1.4 can be executed by the compiling module 312 .
  • step 1.2 is executed by the proxy class generator of the proxy generating unit 312A in the compiling module 312 ; and step 1.3 is executed by the proxy function generator of the proxy generating unit 312A in the compiling module 312 .
  • FIG. 7 is a schematic flow diagram II of a calculation method provided by an embodiment of the present application.
  • the calculation method includes the following steps S701 to S709.
  • the host device 310 sends service data to the acceleration device 320.
  • the acceleration device 320 receives the service data sent by the host device 310 .
  • the acceleration device 320 determines the second object according to the service data.
  • the second object is similar to the first object, and related descriptions may refer to the above-mentioned description of the first object, which will not be repeated here.
  • the acceleration device 320 determines whether the second object is a proxy object.
  • the acceleration device 320 generates object information.
  • the object information may be used to indicate the second object.
  • the acceleration device 320 may serialize the second object to obtain a transmission sequence, which is object information.
  • the acceleration device 320 sends the object information to the host device 310.
  • the host device 310 receives the object information sent by the acceleration device 320 .
  • the acceleration device 320 stores the second object as the first data in the memory.
  • the acceleration device 320 acquires the storage address of the first data.
  • the acceleration device 320 sends the storage address of the first data to the host device 310.
  • the host device 310 receives the storage address of the first data sent by the acceleration device 320 .
  • the host device 310 determines proxy information of the second object according to the storage address of the first data.
  • the host device 310 can use the proxy class to generate the structure of the second object, and then use the storage address of the first data as Proxy information is written to the first set of data in the structure of the second object.
  • S708 corresponds to the above S402
  • S709 corresponds to the above S401
  • related descriptions of S707 to S709 may also refer to the above embodiments.
  • the method may further include: the host device 310 determines an acceleration device among the multiple acceleration devices as the acceleration device 320 .
  • the host device 310 may determine, among the multiple acceleration devices, the acceleration device closest to the memory of the host device 310 as the acceleration device 320 . In this way, unnecessary data copying can be significantly reduced, data computing and processing efficiency can be improved, and near-data computing for calls can be realized.
  • a proxy object associated with the memory object can be generated.
  • the host device 310 can implement a call to the memory object in the acceleration device 320 according to the proxy information and the proxy function in the proxy object. Detailed implementation follows the procedure below.
  • S701-S709 are implemented by the software modules/engines/units included in the host device 310 or the acceleration device 320, the runtime module 311, the communication module 321, the processing module 322 and the host computer shown in FIG.
  • the communication module in device 310 executes.
  • S701, S705, and S708 are executed by the communication module in the host device 310 and the communication module 321;
  • S702-S704, S706, and S707 are executed by the processing module 322;
  • S709 is executed by the runtime module 311.
  • FIG. 8 is the third schematic flow chart of a calculation method provided by the embodiment of the present application.
  • the calculation method includes the following steps S801-S808.
  • the host device 310 executes a service to be processed based on the service data, and acquires a third object of the service to be processed during execution.
  • the relevant description of service data can be implemented with reference to the above method, and will not be repeated here.
  • the third object is similar to the first object, and related descriptions may refer to the above-mentioned description of the first object, which will not be repeated here.
  • the manner in which the host device 310 acquires the third object may be, when executing a certain function in the service data, determining the object corresponding to the function as the third object.
  • the host device 310 determines whether the third object is a proxy object.
  • the third object is the proxy object; otherwise, it can be determined that the third object Not a proxy object.
  • the host device 310 acquires proxy information of the third object.
  • the host device 310 generates call information according to the proxy information.
  • the calling information may include the virtual storage address of the first data and function execution information.
  • the function indication information is used to instruct the acceleration device 320 to execute the function in the memory object.
  • the function indication information may include proxy function 1, and the function indication information may be used to instruct the acceleration device 320 to execute function A in the memory object.
  • the call information can call the acceleration device 320 to execute a function in a certain memory object, refine call granularity, and improve call efficiency.
  • the host device 310 sends invoking information to the acceleration device 320 .
  • the acceleration device 320 receives the calling information sent by the host device 310 .
  • the acceleration device 320 acquires the first data from the memory according to the storage address.
  • the acceleration device 320 executes the task to be processed based on the first data, and obtains a processing result.
  • the acceleration device 320 sends the processing result to the host device 310.
  • the host device 310 receives the processing result sent by the acceleration device 320 .
  • the host device 310 when the host device 310 encounters an object that needs to be processed by invoking the acceleration device 320 when executing the business to be processed, it can call the acceleration device 320 for processing by calling information, and this process will not interfere with the task to be processed on the side of the host device 310 processing, and the actions performed on the acceleration device 320 may be unified with the actions performed on the host device 310 .
  • the unification of the control flow of the host device 310 and the acceleration device 320 can be realized, which can not only reduce the amount of data transmission, but also reduce the fragmentation degree of programming.
  • S801-S808 are implemented by the software modules/engines/units included in the host device 310 or the acceleration device 320, the runtime module 311, offload execution engine 313, communication module 321,
  • the processing module 322 and the communication module in the host device 310 execute.
  • S808 is executed by the communication module and the communication module 321 in the host device 310;
  • S806 and S807 are executed by the processing module 322;
  • S801-S804 are executed by the runtime module 311;
  • S805 is executed by the offload execution engine 313.
  • Example 1 please refer to FIG. 9 , which is a schematic diagram of application acceleration provided by an embodiment of the present application.
  • An application runs on the host device 310, and the application includes storage read and write operations and network input and output operations, and the application can be used to implement services to be processed.
  • the host device 310 also includes a storage file handle proxy object and a network input and output operation proxy object.
  • An acceleration process runs in the acceleration device 320, and the acceleration process includes storing file handles and network input and output operations.
  • the host device 310 can store the file handle proxy object and the network input and output operation proxy object, and call the acceleration device 320 to perform the storage file handle and network input and output operations, which can reduce the number of computer systems.
  • the amount of data transmission in the network saves bandwidth resources and improves the efficiency of offloading and pushing.
  • the host device 310 operates proxy objects instead of storage file handles and network input and output operations. In this way, resource objects, persistent objects, and connection objects cannot be serialized and returned to the host device. It is impossible to keep the process that does not uninstall push execution, and needs to be refactored and modified. That is to say, support for heterogeneous computing such as resource objects, persistent objects, and connection objects can be realized, avoiding code refactoring and saving costs.
  • the acceleration device 320 shown in FIG. 9 may be a smart SSD or a smart network card.
  • FIG. 10 is a second schematic diagram of application acceleration provided by the embodiment of the present application.
  • An application runs on the host device 310, the application includes data processing operation 1 and data processing operation 2, and the application can be used to realize the service to be processed.
  • data processing operation 1 and data processing operation 2 are continuous operations, that is, data processing operation 1 and data processing operation 2 have a time sequence dependency relationship, for example, the processing result of data processing operation 1 is the result of data processing operation 2 Input data.
  • a data processing operation proxy object is also included in the host device 310 .
  • An acceleration process runs in the acceleration device 320 , and the acceleration process includes data processing operation 1 and data processing operation 2 .
  • the host device 310 can call the acceleration device 320 to execute the data processing operation 1 and the data processing operation 2 through the data processing operation proxy object.
  • the processing result returned by the acceleration device 320 to the host device 310 is the processing result of the data processing operation 2, and the data processing operation 1 is no longer fed back to the host device 310, thereby avoiding the intermediate processing results of multiple operations in the business to be processed.
  • Repeated transmission between the host device and the acceleration device reduces the amount of data transmission and improves computing efficiency.
  • the calculation method provided by the embodiment of the present application can reduce the sharing and use of memory pages between the host device and the acceleration device, and improve the performance of unified memory addressing. Use efficiency and save bandwidth between the host device and the acceleration device.
  • the calculation method provided by the embodiment of the present application can reduce the direct memory access (direct memory access, DMA) data volume between the host device and the acceleration device, improve the offloading efficiency, and save the host and heterogeneous bandwidth between devices.
  • DMA direct memory access
  • the calculation method provided by the embodiment of the present application can reduce the network data transmission between the host device and the acceleration device, reduce the transmission of object information on which the control flow depends, improve the offloading efficiency, and save time between the host device and the acceleration device. bandwidth.
  • the host device and the acceleration device include hardware structures and/or software modules corresponding to each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software with reference to the units and method steps of the examples described in the embodiments disclosed in the present application. Whether a certain function is executed by hardware or computer software drives the hardware depends on the specific application scenario and design constraints of the technical solution.
  • the host device 310 in FIG. 3 is a computing device
  • the acceleration device 320 is an acceleration device.
  • a possible example is given to illustrate the computing device and the accelerating device.
  • the computing device can be used to implement the functions of the host device 310 in the above method embodiments, and the acceleration device can be used to realize the functions of the acceleration device 320 in the above method embodiments, so the computing device combined with the acceleration device can realize the beneficial features of the above method embodiments. Effect.
  • the computing device may be the host device 310 shown in FIG. 3 , or the host device shown in FIGS. 1-2 , or a module applied to the host device (such as chip).
  • the acceleration device may be the acceleration device 320 shown in FIG. 3 , or the acceleration device shown in FIGS. 1 to 2 , or a module (such as a chip) applied to the acceleration device.
  • the computing device includes a runtime module 311 , a compilation module 312 and an offload execution engine 313 .
  • the acceleration device includes a communication module 321 and a processing module 322 .
  • the computing device is used to implement the functions of the host device in the method embodiment shown in FIG. 4 above, and the acceleration device is used to realize the functions of the acceleration device in the method embodiment shown in FIG. 4 above.
  • the specific process of the calculation device and the acceleration device for implementing the above calculation method includes the following content 1 to content 3.
  • the offload execution engine 313 is configured to send invocation information to the acceleration device 320 .
  • the communication module 321 is configured to receive the calling information sent by the offload execution engine 313 .
  • the processing module 322 is configured to obtain the first data from the memory 322 according to the storage address.
  • the processing module 322 is further configured to execute the task to be processed based on the first data, and obtain a processing result. Regarding how to execute the task to be processed based on the first data and obtain the processing result, reference may be made to the above S431, which will not be repeated here.
  • the host device informs the acceleration device of the storage address, so that the acceleration device can directly obtain data from the memory according to the storage address, and process it, avoiding that the host device obtains data from the memory and then transmits the data to the acceleration device , which can reduce the amount of data transmission in the computing system, save bandwidth resources, and improve the efficiency of offloading and pushing, thereby solving the problem of large data transmission in the process of heterogeneous computing.
  • the instruction structure of the first data required by the business to be processed can also be retained, avoiding large-scale refactoring of codes, and improving development efficiency.
  • the device may further implement other steps in the above calculation method, specifically refer to the above steps shown in FIG. 5 , FIG. 7 , and FIG. 8 , which will not be repeated here.
  • the computing device or acceleration device in the embodiment of the present invention can be implemented by a CPU, or by an ASIC, or by a programmable logic device (programmable logic device, PLD), and the above-mentioned PLD can be a complex program logic device (complex programmable logical device, CPLD), FPGA, general array logic (generic array logic, GAL) or any combination thereof.
  • PLD programmable logic device
  • the computing device or accelerating device implements the computing method shown in any one of Fig. 4, Fig. 5, Fig. 7, and Fig. 8 through software
  • the computing device or accelerating device and its modules may also be software modules.
  • the computing system 100 shown in FIG. 1 as an example, when the host device 110 is used to implement the computing system shown in FIG.
  • the processor 111 and the communication interface 113 are used to execute the above-mentioned functions of the host device.
  • the processor 111, the communication interface 113, and the memory 112 can also cooperate to implement each operation step in the computing method executed by the host device.
  • the host device 110 may also execute the functions of the computing device shown in FIG. 3 , which will not be repeated here.
  • the memory 112 in the host device 110 can be used to store software programs and modules, such as program instructions/modules corresponding to the calculation method provided in the embodiment of the present application.
  • the processor 111 in the host device 110 executes various functional applications and data processing by executing software programs and modules stored in the memory 112 .
  • the communication interface 113 in the host device 110 can be used for signaling or data communication with other devices. In this application, the host device 110 may have multiple communication interfaces 113 .
  • the host device 110 in the embodiment of the present application may correspond to the computing device in the embodiment of the present application, and may correspond to the implementation of the method shown in Figure 4, Figure 5, Figure 7, and Figure 8 of the embodiment of the present application.
  • the corresponding subject of each unit/module in the host device 110 can realize the corresponding flow of each method in FIG. 4 , FIG. 5 , FIG. 7 , and FIG. 8 .
  • the processor 121 and the communication interface 123 are used to execute the functions of the acceleration device described above.
  • the processor 121, the communication interface 123, and the memory 122 may also cooperate to implement each operation step in the calculation method performed by the acceleration device.
  • the acceleration device 120 may also perform the functions of the acceleration device shown in FIG. 3 , which will not be described in detail here.
  • the memory 122 in the acceleration device 120 can be used to store software programs and modules, such as program instructions/modules corresponding to the calculation method provided by the embodiment of the present application.
  • the processor 121 in the acceleration device 120 executes various functional applications and data processing by executing software programs and modules stored in the memory 122 .
  • the communication interface 123 in the acceleration device 120 can be used for signaling or data communication with other devices. In this application, the acceleration device 120 may have multiple communication interfaces 123 .
  • the acceleration device 120 in the embodiment of the present application may correspond to the acceleration device in the embodiment of the present application, and may correspond to the implementation of the methods shown in Figure 4, Figure 5, Figure 7, and Figure 8 in the embodiment of the present application.
  • the corresponding subject of each unit/module in the acceleration device 120 can realize the corresponding process of each method in FIG. 4 , FIG. 5 , FIG. 7 , and FIG. 8 .
  • processor in the embodiments of the present application may be a CPU, NPU or GPU, and may also be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, transistor logic devices, hardware components or other random combination.
  • a general-purpose processor can be a microprocessor, or any conventional processor.
  • the method steps in the embodiments of the present application may be implemented by means of hardware, or may be implemented by means of a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or known in the art any other form of storage medium.
  • An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
  • the storage medium may also be a component of the processor.
  • the processor and storage medium can be located in the ASIC.
  • the ASIC can be located in a network device or a terminal device.
  • the processor and the storage medium may also exist in the network device or the terminal device as discrete components.
  • the processor may include, but not limited to, at least one of the following: CPU, microprocessor, digital signal processor (DSP), microcontroller (microcontroller unit, MCU), or artificial intelligence processors and other computing devices that run software.
  • each computing device may include one or more cores for executing software instructions to perform operations or processing.
  • the processor can be built into a SoC (system on a chip) or ASIC, or it can be an independent semiconductor chip.
  • the processor may further include necessary hardware accelerators, such as FPGAs, PLDs, or logic circuits for implementing dedicated logic operations.
  • the hardware can be CPU, microprocessor, DSP, MCU, artificial intelligence processor, ASIC, SoC, FPGA, PLD, dedicated digital circuit, hardware accelerator or non-integrated discrete device Any one or any combination of them, which can run necessary software or not depend on software to execute the above method flow.
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded or executed on the computer, the processes or functions described in the embodiments of the present application according to the present invention will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Wired or wireless transmission to another website site, computer, server or data center.
  • Wired may refer to coaxial cable, optical fiber, or digital subscriber line (DSL), etc.
  • wireless may refer to infrared, wireless, or microwave.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Advance Control (AREA)

Abstract

提供一种计算系统、方法、装置及加速设备。计算系统(100)包括:主机设备(110)和加速设备(120),主机设备(110)与加速设备(120)通信连接,加速设备与存储器(112)耦合,存储器(112)存储有待处理业务所需的第一数据。主机设备(110),用于向加速设备(120)发送调用信息,调用信息用于指示第一数据的存储地址。加速设备(120),用于接收主机设备(110)发送的调用信息,并根据存储地址从存储器中获取第一数据,基于第一数据执行待处理任务,得到处理结果。主机设备(110)将存储地址告知加速设备(120),加速设备(120)直接根据存储地址从存储器(112)获取数据,避免主机设备(110)从存储器(112)获取数据,能够减少数据传输量。

Description

计算系统、方法、装置及加速设备
本申请要求于2022年1月12日提交中国专利局、申请号为202210032978.0、发明名称为“计算系统、方法、装置及加速设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算领域,尤其涉及一种计算系统、方法、装置及加速设备。
背景技术
异构计算技术是指:使用多个计算设备组成的系统进行计算,且这多个计算设备之间在指令集和体系结构上存在不同。异构计算技术由于能够经济有效地获取高性能计算能力、使能丰富的计算资源、提升业务端到端性能,已成为当前研究的重点。
通常,一个计算设备基于异构计算技术对待处理任务进行加速处理时,该计算设备一般将待处理任务所需的数据发送至与该计算设备通信的另一个计算设备,由另一个计算设备对该待处理任务进行处理。其中,发送待处理任务所需的数据的计算设备可以认为是主机设备,处理待处理任务的计算设备可以认为是加速设备。
然而,待处理任务所需的数据存储在与主机设备通信的存储设备中,导致该数据需经存储设备-主机设备-加速设备的多次传输过程,从而导致异构计算过程中的数据传输量大。因此,如何减少异构计算过程中的数据传输量是目前亟需解决的问题。
发明内容
本申请实施例提供一种计算系统、方法、装置及加速设备,能够解决异构计算过程中的数据传输量大的问题。
为达到上述目的,本申请采用如下技术方案:
第一方面,提供一种计算系统。该计算系统包括:主机设备和加速设备。主机设备与加速设备通信连接,加速设备与存储器耦合,该存储器存储有待处理业务所需的第一数据。主机设备,用于向加速设备发送调用信息,该调用信息用于指示第一数据的存储地址。加速设备,用于接收主机设备发送的调用信息,并根据该存储地址从存储器中获取第一数据。加速设备,还用于基于第一数据执行待处理任务,得到处理结果。
在第一方面所述的计算系统中,存储有待处理业务所需的第一数据的存储器与加速设备耦合,加速设备根据调用信息可以从存储器中获取该第一数据,并基于该第一数据执行待处理任务。换句话说,主机设备将存储地址告知加速设备,便于加速设备直接根据存储地址从存储器中获取数据,并进行处理,避免主机设备从存储器中获取数据,再将数据传输给加速设备,这样能够减少计算系统中的数据传输量,从而解决异构计算过程中的数据传输量大的问题。
在一种可选的实现方式中,待处理业务包括连续的N个操作,N为大于1的整数。加速设备,还用于根据N个操作中的第i个操作的处理结果,执行第i+1个操作,得到第i+1个操作的处理结果。其中,i为整数,且1≤i≤N-1。第N个操作的处理结果为处理结果。如此,可以避免待处理业务中的多个操作的中间处理结果在主机设备和加速设备之间反复传输,从而减少 数据传输量,提升计算效率。
在另一种可选的实现方式中,第i个操作的处理结果存储在所述存储器或加速设备中。如此,加速设备在执行第i+1个操作时可以快速地读取到第i个操作的处理结果,从而提升计算效率。
在另一种可选的实现方式中,主机设备,还用于获取代理信息,并依据该代理信息生成调用信息。其中,该代理信息包含:第一数据的虚拟存储地址。换言之,可以利用第一数据的虚拟存储地址来调用加速设备执行待处理任务,减少数据传输量和实现难度。
在另一种可选的实现方式中,代理信息存储在主机设备中。如此,主机设备在需要确定调用信息时,可以快速地读取代理信息以确定调用信息,提升调用效率。
在另一种可选的实现方式中,第一数据存储在内存对象中,内存对象是由存储器提供的一段物理地址存储空间。
在另一种可选的实现方式中,加速设备,还用于向主机设备发送第一数据的虚拟存储地址。相应地,主机设备,还用于接收加速设备发送的第一数据的虚拟存储地址。如此,主机设备可以根据该第一数据的虚拟存储地址,确定代理信息。
第二方面,提供一种计算方法。该方法由加速设备执行,加速设备与主机设备通信连接,加速设备与存储器耦合,存储器存储有待处理业务所需的第一数据,该计算方法包括:接收主机设备发送的调用信息,调用信息用于指示第一数据的存储地址;根据存储地址从存储器中获取第一数据;基于第一数据执行待处理任务,得到处理结果。
在另一种可选的实现方式中,待处理业务包括连续的N个操作,N为大于1的整数;基于第一数据执行待处理任务,得到处理结果,包括:根据N个操作中的第i个操作的处理结果,执行第i+1个操作,得到第i+1个操作的处理结果;i为整数,且1≤i≤N-1;其中,第N个操作的处理结果为处理结果。
在另一种可选的实现方式中,第i个操作的处理结果存储在存储器或加速设备中。
在另一种可选的实现方式中,第一数据存储在内存对象中,内存对象是由存储器提供的一段物理地址存储空间。
在另一种可选的实现方式中,第二方面的方法还包括:向主机设备发送第一数据的虚拟存储地址。
值得注意的是,第二方面中的计算方法的有益效果可以参见第一方面中任一方面的描述,此处不再赘述。
第三方面,提供一种计算方法。该方法由主机设备执行,主机设备与加速设备通信连接,加速设备与存储器耦合,存储器存储有待处理业务所需的第一数据。该方法包括:向所述加速设备发送调用信息,所述调用信息用于指示所述第一数据的存储地址。
在另一种可选的实现方式中,待处理业务包括连续的N个操作,N为大于1的整数;基于第一数据执行待处理任务,得到处理结果,包括:根据N个操作中的第i个操作的处理结果,执行第i+1个操作,得到第i+1个操作的处理结果;i为整数,且1≤i≤N-1;其中,第N个操作的处理结果为处理结果。
在另一种可选的实现方式中,第i个操作的处理结果存储在存储器或加速设备中。
在另一种可选的实现方式中,第三方面所述的方法还包括:获取代理信息,代理信息包含:第一数据的虚拟存储地址;依据代理信息生成调用信息。
在另一种可选的实现方式中,代理信息存储在主机设备中。
在另一种可选的实现方式中,第一数据存储在内存对象中,内存对象是由存储器提 供的一段物理地址存储空间。
在另一种可选的实现方式中,第三方面的方法还包括:接收加速设备发送的第一数据的虚拟存储地址。
值得注意的是,第三方面中的计算方法的有益效果可以参见第二方面中任一方面的描述,此处不再赘述。
第四方面,本申请提供一种加速设备,加速设备包括存储器和至少一个处理器,存储器用于存储一组计算机指令,当处理器执行一组计算机指令时,用于实现第二方面和第二方面中任一种可能实现方式的方法的操作步骤。
第五方面,本申请提供一种主机设备,主机设备包括存储器和至少一个处理器,存储器用于存储一组计算机指令,当处理器执行一组计算机指令时,用于实现第三方面和第三方面中任一种可能实现方式的方法的操作步骤。
第六方面,本申请提供一种计算机可读存储介质,存储介质中存储有计算机程序或指令,当计算机程序或指令被执行时,实现上述各个方面或各个方面的可能的实现方式所述的方法的操作步骤。
第七方面,本申请提供一种计算机程序产品,该计算程序产品包括指令,当计算机程序产品在管理节点或处理器上运行时,使得管理节点或处理器执行该指令,以实现上述任意一方面或任一方面的可能的实现方式中所述的方法的操作步骤。
第八方面,本申请提供一种芯片,包括存储器和处理器,存储器用于存储计算机指令,处理器用于从存储器中调用并运行该计算机指令,以执实现上述任一方面或任一方面的可能的实现方式中所述方法的操作步骤。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请实施例提供的一种计算系统的架构示意图一;
图2为本申请实施例提供的一种计算系统的架构示意图二;
图3为本申请实施例提供的一种计算系统的架构示意图三;
图4为本申请实施例提供的一种计算方法的流程示意图一;
图5为本申请实施例提供的S430的一种详细流程示意图;
图6为本申请实施例提供的一种代理类-委托类的映射关系示意图;
图7为本申请实施例提供的一种计算方法的流程示意图二;
图8为本申请实施例提供的一种计算方法的流程示意图三;
图9为本申请实施例提供的一种应用加速示意图一;
图10为本申请实施例提供的一种应用加速示意图二。
具体实施方式
为了下述实施例的描述清楚简洁,先给出可能涉及的技术术语的简要介绍。
1、近数据计算,是指将数据的计算处理动作编排在距离该数据的存储位置最近的计算设备上。其中,一个数据的存储位置与一个计算设备的距离远近程度可以定义为:数据传输至计算设备的时间与计算设备处理该数据的时间之和。换言之,距离某个数据的存储位置最近的计算设备可以是多个计算设备中获取并处理该数据最快的计算设备。
利用近数据计算技术,可以显著地减少不必要的数据拷贝,提高数据的计算处理效率。
2、内存对象,在本申请实施例中可以指:用于存储一组数据的地址空间。本申请实施例中,内存对象也可以称为真实对象、被代理对象等。
本申请实施例中,内存对象中可以存储资源对象、持久化对象或连接对象等,对此不作限定。其中,资源对象可以是指:读写存储实体的文件句柄(file handle,FD);持久化对象可以是指:连续的两个操作,且这两个操作需要至少处理同一组数据;连接对象可以是指:网络连接操作等。
3、代理对象,在本申请实施例中可以指:是与一个内存对象关联的对象,可以控制该内存对象的访问和服务的动作。
4、异构代理对象,在本申请实施例中可以指:内存对象和代理对象分属于不同的计算设备商。例如,代理对象位于主机设备,内存对象位于加速设备。
5、代理,在本申请实施例中可以指:通过编程的方式,将内存对象进行委托代理,以控制该内存对象的访问和服务的动作。示例的,可以通过编程的方式生成该内存对象的代理对象,通过该代理对象可以控制该内存对象的访问和服务的动作。
6、序列化,是指将对象转换为可以传输的序列形式的过程。
7、代理类,是一些具有共同属性和行为的代理对象的集合,也即是代理对象的模板,可以生成代理对象。委托类,是一些具有共同属性和行为的被代理对象的集合,也即是被代理对象的模板,可以生成被代理对象。代理类能够为委托类进行预处理消息,过滤消息并转发消息,以及进行消息被委托类执行后的后续处理。代理类不实现具体的服务,而是利用委托类完成服务,并将执行结果封装处理。
下面将结合附图对本申请实施例的实施方式进行详细描述。
图1为本申请实施例提供的一种计算系统的架构示意图一。请参照图1,该计算系统100包括主机设备110、加速设备120。主机设备110和加速设备120之间通信连接。
下面结合附图分别对主机设备110和加速设备120进行介绍。
主机设备110是一种计算设备。示例的,主机设备110可以是服务器、个人电脑、手机、平板电脑、智能车或其他设备等,对此不作限定。主机设备110可以包括处理器111、通信接口113,处理器111和通信接口113之间互相耦合。
可选的,主机设备110还包括存储器112。其中,处理器111、存储器112和通信接口113之间互相耦合。本申请实施例并不限定处理器111、存储器112和通信接口113之间互相耦合的具体实现方式,在图1中以处理器111、存储器112和通信接口113之间通过总线114连接作为一种示例。总线114在图1中以粗线表示,其它部件之间的连接方式,仅是进行示意性说明,并不引以为限。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图1中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
通信接口113用于与其他设备或通信网络通信。通信接口113可以为收发器或输入/输出(input/output,I/O)接口。以通信接口113为I/O接口举例,I/O接口可以用于与位于主机设备110外部的设备通信。例如,外部的设备通过I/O接口向主机设备110输入数据;主机设备110对输入的数据进行处理之后,可以再通过I/O接口向外部的设备发送对该数据处理后的输出结果。
处理器111是主机设备110的运算核心和控制核心,它可以是中央处理器(central processing unit,CPU),也可以是其他特定的集成电路。处理器111还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated  circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。实际应用中,主机设备110也可以包括多个处理器。处理器111中可以包括一个或多个处理器核。在处理器111中安装有操作系统和其他软件程序,从而处理器111能够实现对存储器112及各种外部设备互连总线(peripheral component interconnect,PCI)-express(PCIe)设备的访问。处理器111可以通过双倍速率(double data rate,DDR)总线或者其他类型的总线和存储器112相连。
存储器112是主机设备110的主存。存储器112通常用来存放操作系统中各种正在运行的软件、处理器111执行的指令或处理器111运行指令所需要的输入数据、处理器111运行指令后产生的数据等。为了提高处理器111的访问速度,存储器112需要具备访问速度快的优点。在传统的计算机设备中,通常采用动态随机存取存储器(dynamic random access memory,DRAM)作为存储器112。除了DRAM之外,存储器112还可以是其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。另外,存储器112也可以是只读存储器(read only memory,ROM)。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。本实施例不对存储器112的数量和类型进行限定。
可选的,为了对数据进行持久化存储,该计算系统100中还可以设置有数据存储系统130,数据存储系统130可位于主机设备110的外部(如图1所示),通过网络与主机设备110交换数据。可选的,数据存储系统130也可以位于主机的内部,通过PCIe总线114与处理器111交换数据。此时,数据存储系统130表现为硬盘。
加速设备120是一种计算设备。示例的,加速设备120可以是服务器、个人电脑、手机、平板电脑、智能车或其他设备等,对此不作限定。加速设备120可以包括处理器121、通信接口123,处理器121和通信接口123之间互相耦合。
加速设备120还包括存储器122。其中,处理器121、存储器122和通信接口123之间互相耦合。在实际应用中,存储器122可以位于加速设备120内部(如图1所示),通过总线与处理器121交换数据,此时存储器122表现为硬盘。或者,存储器122可以位于加速设备120外部,通过网络与加速设备120交换数据。
加速设备120可以用于执行待处理任务,例如进行矩阵计算、图形运算、网络数据交互、磁盘读写等,加速设备120可以由一个或多个处理器来实现。处理器可以包括CPU、图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing units,NPU)、张量处理单元(tensor processing unit,TPU)、FPGA、ASIC中的任意一种。其中,GPU又称显示核心、视觉处理器、显示芯片,是一种专门在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上图像运算工作的微处理器。NPU在电路层模拟人类神经元和突触,并且用深度学习指令集直接处理大规模的神经元和突触,一条指令完成一组神经元的处理。ASIC适合于某一单一用途的集成电路产品。
其中,处理器121、存储器122和通信接口123与上述处理器111、存储器112和通信接口113类似,关于处理器121、存储器122和通信接口123的更详细描述可以参照上述处理器111、存储器112和通信接口113的说明,在此不再赘述。
值得注意的是,存储器122可以是非易失性存储器(non-volatile memory,NVM),比如ROM、闪存或固态硬盘(solid state disk,SSD)等,这样,存储器122可以用于存储较大的数据,比如待处理业务所需的数据等。而在现有的加速设备中,其内通常设置的是缓冲区,仅能暂时存储少量的数据,在设备停止工作时无法保存。因此,现有的加速设备并不能存储待 处理业务所需的数据,而本申请实施例提供的加速设备120可以利用存储器122存储待处理业务所需的数据。
在第一种可能的示例中,如图1所示,加速设备120可以插在主机设备110的主板上的卡槽中,通过总线114与处理器111交换数据,在此情况下,总线114可以是PCIe总线,或者也可以是计算快速互联(compute express link,CXL)、通用串行总线(universal serial bus,USB)协议或其他协议的总线,以支持加速设备120与主机设备110之间进行数据传输。
在第二种可能的示例中,请参照图2,图2为本申请实施例提供的一种计算系统的架构示意图二。与图1所示的计算系统的不同之处在于,在图2中,加速设备120不是插在主机设备110的主板上的卡槽中,而是一个独立于主机设备110的设备。此时,主机设备110可以通过网线等有线网络与加速设备120进行连接,也可以通过无线热点(Wi-Fi)或者蓝牙(Bluetooth)等无线网络与加速设备120进行连接。
也即是说,本申请实施例中,加速设备与存储器耦合可以是指:加速设备包括存储器,或者加速设备连接有存储器。
在图1中,主机设备110可以用于向加速设备120发送调用信息,以调用加速设备120处理待处理任务。加速设备120可以用于根据主机设备110发送的调用信息,执行待处理任务,并将得到的处理结果反馈至主机设备110。在此过程中,主机设备110和加速设备120结合起来可以实现本申请实施例提供的计算方法,以解决异构计算过程中的数据传输量大的问题。具体实现方式可以参照下文中计算方法实施例部分的描述,在此不予赘述。
可选的,如图1所示,计算系统100还包括客户设备130,用户可通过客户设备130向主机设备110输入数据,例如,客户设备130通过通信接口113向主机设备110输入数据,主机设备110对输入的数据进行处理之后,再通过通信接口113向客户设备130发送对该数据处理后的输出结果。客户设备130可以是一种终端设备,包括但不限于个人电脑、服务器、手机、平板电脑或者智能车等。
上述图1所示的计算系统中,主机设备110中的处理器和加速设备120中的处理器的类别可以一致或不一致,对此不作限定。也即是说,本申请实施例中,不论主机设备110、加速设备120的指令集、体系结构是否一致,这两者均能相互配合以实现本申请实施例提供的计算方法。
上述图1所示的计算系统中,主机设备110和加速设备120中运行的操作系统可以一致或不一致,对比不作限定。例如,主机设备110中运行的操作系统为第一安全级别的操作系统,加速设备120中运行的操作系统为第二安全级别的操作系统,其中,第一安全级别低于第二安全级别,这样,主机设备110可以将计算放在安全级别更高的加速设备中进行处理,提高数据安全程度。
图3为本申请实施例提供的一种计算系统的架构示意图三,请参照图3,该计算系统300包括:主机设备310和加速设备320。主机设备310和加速设备320之间通信连接。
在一些可能的实施例中,图3所示的计算系统300可以采用硬件电路实现,或者可以通过硬件电路结合软件实现,以实现相应的功能。
在另一些可能的实施例中,图3所示的计算系统300可以表示一种软件架构,可以运行在图1所示的计算系统100上,也即是说,计算系统300中的模块/引擎/单元可以运行在图1所示的主机设备110或加速设备120上。
值得注意的是,当主机设备310通过多个软件模块/引擎/单元实现时,主机设备310也可以称为一种计算装置。如图3所示,主机设备310包括:运行时模块311、编译模块312、卸载执行引擎313。另外,加速设备320也可以称为一种加速装置。如图3所示,该加速设备320包括: 通信模块321和处理模块322。
可以理解的是,上述模块/引擎/单元可以以指令的形式被硬件实现。例如,运行时模块311、编译模块312、卸载执行引擎313可以以指令的形式被主机设备110的处理器111执行,以实现相应的功能。又如,通信模块321和处理模块322可以以指令的形式被加速设备120的处理器121执行,以实现相应的功能。
下面以图3所示的计算系统300表示一种软件架构举例,对计算系统300中各个模块/引擎包括的结构的功能作详细说明。
编译模块312表现为计算机指令,可以被处理器执行,能够实现编译指令的功能。编译模块312可以包括:代理生成单元312A。代理生成单元312A能够以编译的方式生成代理类的结构。本申请实施例中,代理类的结构中可以包括代理函数和代理信息。代理信息可以表现为地址数据,用于指示内存对象的地址,也即是说,代理信息可以是指针。代理函数可以表现为一组计算机指令,能够实现为委托类的预处理消息、过滤消息、传递消息进行逻辑封装,以及实现代理类与委托类的服务之间的转换和结果封装。
示例的,代理生成单元312A包括代理类生成器和代理函数生成器。代理类生成器用于生成某个委托类的代理类的结构,该结构可以包括两组数据,一组数据可以用于记录代理信息,另一组数据可以用于记录代理函数,也即是说,代理类的结构中的两组数据暂未记录代理类的代理函数和代理信息。代理函数生成器用于在代理类的结构中生成代理函数,比如向异构设备推送信息的函数。这样,利用代理类生成器和代理函数生成器,可以生成某个委托类的代理类的结构。
运行时模块311表现为计算机指令,可以被处理器执行,能够实现运行时执行一组操作(或者称为逻辑)和卸载执行动作的功能。运行时模块311可以包括:执行单元311A和识别单元311B。识别单元311B用于对当前主机设备310运行的对象进行识别,示例的,识别单元311B中包括代理对象识别器,该代理对象识别器可以识别当前主机设备310运行的对象是否为代理对象。执行单元311A用于通过代理对象控制对内存对象的访问,示例的,执行单元311A中包括推送执行逻辑器,该推送执行逻辑器可以将主机设备310对代理对象的操作转换为推送执行逻辑,并生成调用信息,以及触发卸载执行引擎313向加速设备320发送调用信息,从而控制对加速设备320中的内存对象的访问。
卸载执行引擎313表现为计算机指令,可以被处理器执行,能够实现将信息发送到其他计算设备的功能。示例的,卸载执行引擎313可以根据推送执行逻辑器的触发向加速设备320发送调用信息。
通信模块321用于实现与其他设备交互数据的功能。示例的,通信模块321可以接收主机设备310发送的调用信息。
处理模块322用于实现执行待处理任务的功能。示例的,处理模块322可以根据主机设备310发送的调用信息执行待处理任务。
结合上述计算系统300中各个模块/引擎,可以实现本申请实施例提供的计算方法,以解决异构计算过程中的数据传输量大的问题。有关上述计算系统300中各个模块/引擎/单元更详细的描述可以直接参考下述图4、图5、图7、图8所示的实施例中相关描述直接得到,这里不加赘述。
以上对本申请提供的计算系统进行了介绍,下面将结合附图对本申请实施例提供的计算方法进行说明。
请参照图4,图4为本申请提供的一种计算方法的流程示意图一,该计算方法可以应用于 上述计算系统,可以由上述计算系统中的主机设备和加速设备执行。以图1所示的计算系统100举例,在图4中,主机设备310可以包括图1所示主机设备110中的各个结构,加速设备320可以包括图1所示加速设备120中的各个结构。并且,加速设备320与存储器322耦合,该存储器322可以设置于加速设备320外部或者内部,图4中存储器322设置于加速设备320内部仅作为一种可能示例。在存储器322中存储有待处理业务所需的第一数据。
如图4所示,本实施例提供的计算方法包括以下步骤S410~S430。
S410,主机设备310向加速设备320发送调用信息。相应地,加速设备320接收主机设备310发送的调用信息。
其中,调用信息用于指示第一数据的存储地址。示例的,调用信息可以包括第一数据的物理存储地址或虚拟存储地址。
本申请实施例中,待处理业务可以是大数据业务、人工智能业务、云计算业务或其他任何一种业务。这些业务中,涉及的计算可以包括矩阵计算、图形运算、网络数据交互、磁盘读写等,对此不作限定。
实现待处理业务的数据可以包括一组或多组。第一数据是在这一组或多组数据中需要被加速设备320处理的数据。示例的,第一数据可以包括输入数据和计算指令(或者称为函数),计算设备通过处理该输入数据和该计算指令可以对待处理业务进行处理。以待处理业务涉及的计算为矩阵计算举例,该待处理业务所需的第一数据包括多个矩阵和矩阵计算函数,计算设备可以利用矩阵计算函数对这多个矩阵进行运算,得到处理结果。
S420,加速设备320根据存储地址从存储器322中获取第一数据。
其中,当调用信息为第一数据的虚拟存储地址时,加速设备320可以将第一数据的虚拟存储地址转换为第一数据的物理存储地址,然后依据第一数据的物理存储地址从存储器322中读取第一数据。
在图4所示的实施例中,存储器322设置于加速设备320内部,加速设备320可以根据存储地址通过总线读取存储器322中存储的第一数据。
在存储器322设置于加速设备320外部的实施例中,加速设备320可以通过网络读取存储器322中存储的第一数据。
S430,加速设备320基于第一数据执行待处理任务,得到处理结果。
示例的,如果第一数据包括矩阵1、矩阵2和矩阵3以及矩阵逆运算函数、矩阵求和运算函数,待处理任务中需要计算:(矩阵1+矩阵2+矩阵3) -1,那么,加速设备可以首先利用矩阵求和运算函数计算矩阵1、矩阵2、矩阵3之和(记为矩阵4),然后利用矩阵逆运算函数计算矩阵4的逆矩阵,矩阵4 -1即为处理结果。
其中,该处理结果是处理第一数据得到的结果。当第一数据是实现待处理业务的数据中的部分数据时,该处理结果可以为待处理任务的中间结果或最终结果;当第一数据是实现待处理业务的数据中的全部数据时,该处理结果可以为待处理任务的最终结果。
可选的,待处理业务可以包括连续的N个操作,N为大于1的整数。在此情况下,请参照图5,图5为本申请实施例提供的S430的一种详细流程示意图,图4所示的S430,可以包括:
S431,加速设备320根据N个操作中的第i个操作的处理结果,执行第i+1个操作,得到第i+1个操作的处理结果。
其中,i为整数,且1≤i≤N-1,第N个操作的处理结果为处理结果。
可选的,在图4所示的实施例中,存储器322设置于加速设备320内部,第i个操作的处理结果可以存储在所述存储器322。
在存储器322设置于加速设备320外部的实施例中,第i个操作的处理结果可以存储在存储器322或加速设备320内部的存储器中,对此不作限定。
示例的,如果待处理业务包括连续的3个操作,分别为操作1、操作2、操作3,操作1为利用矩阵求和运算函数求矩阵1、矩阵2、矩阵3之和(记为矩阵4),操作2为利用矩阵逆运算函数求矩阵4的逆矩阵(矩阵4 -1),操作3为利用矩阵求和运算函数求矩阵4 -1与矩阵1之和(记为矩阵5),那么矩阵4、矩阵4 -1为中间处理结果,矩阵5为处理结果。矩阵4、矩阵4 -1均可以存储在存储器322中。和矩阵5也可以存储在存储器322中。
在S431中,待处理业务中的连续的N个操作均在加速设备320处理,也即是说,可以避免待处理业务中的多个操作的中间处理结果在主机设备和加速设备之间反复传输,从而减少数据传输量,提升计算效率。
基于上述S410~S430,主机设备310将存储地址告知加速设备320,便于加速设备320直接根据存储地址从存储器322中获取数据,并进行处理,避免主机设备310从存储器322中获取数据,再将数据传输给加速设备320,也即是可以避免数据需经存储设备-主机设备-加速设备的多次传输过程,这样能够减少计算系统中的数据传输量,节省带宽资源,提升卸载推送效率,从而解决异构计算过程中的数据传输量大的问题。另外,通过执行上述S410~S430还可以保留待处理业务所需的第一数据的指令结构,避免大规模重构代码,提高开发效率。
值得注意的是,当上述的S410由主机设备310或加速设备320包括的软件模块/引擎/单元实现时,可由图3所示出的卸载执行引擎313和通信模块321执行。例如,卸载执行引擎313向加速设备320发送调用信息,通信模块321接收主机设备310发送的调用信息。
当上述的S420~S430由加速设备320包括的软件模块/引擎/单元实现时,可由图3所示出的处理模块322执行。例如,S420、S430、S431均可以由处理模块322执行。
可选的,上述第一数据存储在内存对象中,内存对象是由存储器322提供的一段物理地址存储空间。其中,第一数据可以为资源对象、持久化对象和连接对象,也即是说,图4所示的方法实施例中,可以通过代理的形式调用加速设备320对资源对象、持久化对象或连接对象等进行处理,从而实现对资源对象、持久化对象和连接对象等的异构计算的支持,避免重构代码,节省成本。
可选的,在S430之后,图4所示的方法还包括:
S440,加速设备320向主机设备310发送处理结果。相应地,主机设备310接收加速设备320发送的处理结果。
值得注意的是,主机设备310中还可以包括通信模块,当上述的S440由主机设备310或加速设备320包括的软件模块/引擎/单元实现时,可由图3所示出的通信模块321和主机设备310中的通信模块执行。例如,通信模块321向主机设备310发送处理结果,主机设备310中的通信模块接收加速设备320发送的处理结果。
可选地,在S410之前,图4所示的方法还包括:
S401,主机设备310获取代理信息,并依据代理信息生成调用信息。
其中,代理信息包含:第一数据的虚拟存储地址。可选的,代理信息存储在主机设备310中。
可选地,在S410之前,图4所示的方法还包括:
S402,加速设备320向主机设备310发送第一数据的虚拟存储地址。相应地,主机设备310接收加速设备320发送的第一数据的虚拟存储地址。如此,主机设备可以根据该第一数据的虚拟存储地址,确定代理信息。
结合S401和S402,主机设备310获取代理信息的方式可以包括:主机设备310接收加速设备320发送的第一数据的虚拟存储地址,并根据第一数据的虚拟存储地址确定代理信息。
值得注意的是,当上述的S401、S402由主机设备310或加速设备320包括的软件模块/引擎/单元实现时,可由图3所示出的运行时模块311、通信模块321和主机设备310中的通信模块执行。例如,S401由运行时模块311执行;通信模块321用于向主机设备310发送第一数据的虚拟存储地址;主机设备310中的通信模块用于接收加速设备320发送的第一数据的虚拟存储地址。
在上述方法实施例的基础上,主机设备310和加速设备320可以结合在一起,实现编译过程和映射过程。其中,编译过程可以以编译的方式生成代理类的结构,映射过程可以在代理类的结构中写入代理信息,以及确定代理对象。S401、S402可以包括于映射过程。并且,S410~S440可以认为是执行过程。
下面分别介绍编译过程、映射过程、执行过程的详细实现方式。
1、编译过程,包括如下步骤1.1~步骤1.4。
步骤1.1,主机设备310获取业务数据,并根据业务数据中的标识判断是否需要调用加速设备320执行该业务数据中的第一对象。若需要调用加速设备320执行该业务数据的第一对象,则执行步骤1.2;否则,执行步骤1.4。
业务数据是指实现待处理业务的数据,待处理业务的相关描述可以参照上述方法实施,在此不再赘述。
业务数据中可以包括一个或多个第一对象,这里的第一对象可以是指:编程领域中的对象,也即是将客观事物的抽象为计算机指令和/或数据。例如,这里的第一对象可以是资源对象、持久化对象和连接对象等,对此不作限定。
业务数据中的标识可以用于指示业务数据中的第一对象是否需要代理,若业务数据中的标识用于指示该第一对象需要代理,则表示该第一对象需要调用加速设备进行处理;若业务数据中的标识用于指示该第一对象不需要代理,则表示该第一对象不需要调用加速设备进行处理。
业务数据中的标识可以是预先设定的,例如,可以由用户设定,用户可以在业务数据的预设位置添加标识,以指示第一业务中的第一对象是否需要代理。其中,预设位置可以是业务数据中的业务卸载流程部分对应的数据。
步骤1.2,主机设备310确定第一对象对应的委托类,并生成该委托类对应的代理类的结构。
步骤1.3,主机设备310生成该代理类的结构中的代理函数。
示例的,主机设备310生成第一对象对应的委托类的代理类的结构,该结构可以包括两组数据,第一组数据可以用于记录代理信息,第二组数据可以用于记录代理函数,在步骤1.2中,代理类的结构中的两组数据暂未记录代理类的代理函数和代理信息。主机设备310可以在第二组数据中生成代理函数,该代理函数用于指示委托类中的函数。
例如,图6为本申请实施例提供的一种代理类-委托类的映射关系示意图,请参照图6,代理类中的第二组数据包括代理函数1和代理函数2,委托类中包括函数A、函数B和数据A,代理函数1指示函数A,代理函数2指示函数B,代理类中的第一组数据用于存储代理信息,该代理信息可以用于指示一个内存对象的存储地址。
步骤1.4,主机设备310根据业务数据生成可执行程序。
在上述步骤1.1~步骤1.4中,生成了图6所示的代理类-委托类的映射关系。通过代理类可以生成代理对象,通过委托类可以生成被代理对象(也即是内存对象),也即是说,根据上 述代理类-委托类的映射关系,可以生成与内存对象关联的代理对象,详细实现方式参照下述映射过程。
值得注意的是,当上述步骤1.1~步骤1.4由主机设备310包括的软件模块/引擎/单元实现时,可由图3所示出的编译模块312执行。例如,步骤1.1~步骤1.4可以由编译模块312执行。其中,步骤1.2由编译模块312中的代理生成单元312A的代理类生成器执行;步骤1.3由编译模块312中的代理生成单元312A的代理函数生成器执行。
2、映射过程,请参照图7,图7为本申请实施例提供的一种计算方法的流程示意图二。该计算方法包括如下步骤S701~S709。
S701,主机设备310向加速设备320发送业务数据。相应的,加速设备320接收主机设备310发送的业务数据。
业务数据的相关描述可以参照上述方法实施,在此不再赘述。
S702,加速设备320根据业务数据确定第二对象。
第二对象与第一对象类似,相关描述可以参照上述对第一对象的描述,在此不再赘述。
S703,加速设备320判断第二对象是否为被代理对象。
当第二对象不是被代理对象时,执行S704;当第二对象为被代理对象时,执行S706。
其中,可以根据第二对象的返回信息确定该对象是否为被代理对象。示例的,当第二对象的返回信息指示为第二对象的代理时,表示该对象为被代理对象;当第二对象的返回信息指示为第二对象时,表示该对象为被代理对象。
S704,加速设备320生成对象信息。
对象信息可以用于指示第二对象。示例的,加速设备320可以对第二对象进行序列化,得到传输序列,该传输序列即为对象信息。
S705,加速设备320向主机设备310发送对象信息。相应的,主机设备310接收加速设备320发送的对象信息。
S706,加速设备320将第二对象作为第一数据存储在存储器中。
S707,加速设备320获取第一数据的存储地址。
S708,加速设备320向主机设备310发送第一数据的存储地址。相应的,主机设备310接收加速设备320发送的第一数据的存储地址。
S709,主机设备310根据第一数据的存储地址确定第二对象的代理信息。
示例的,假设通过上述编译过程,主机设备310已经生成了第二对象的代理类,在S709中,主机设备310可以利用该代理类生成第二对象的结构,然后将第一数据的存储地址作为代理信息写入第二对象的结构的第一组数据。
其中,S708对应于上述S402,S709对应于上述S401,S707~S709的相关说明也可以参照上述实施例。
其中,在S701之前,方法还可以包括:主机设备310在多个加速设备中确定一个加速设备作为加速设备320。示例的,主机设备310可以将多个加速设备中与主机设备310的存储器距离最近的加速设备确定为加速设备320。这样,可以显著地减少不必要的数据拷贝,提高数据的计算处理效率,实现对调用的近数据计算。
在S701~S709中,可以生成与内存对象关联的代理对象,在此情况下,主机设备310可以根据代理对象中的代理信息和代理函数实现对加速设备320中的内存对象的调用,详细实现方式参照下述执行过程。
值得注意的是,当上述S701~S709由主机设备310或加速设备320包括的软件模块/引擎/单 元实现时,可由图3所示出的运行时模块311、通信模块321、处理模块322和主机设备310中的通信模块执行。例如,S701、S705、S708由主机设备310中的通信模块和通信模块321执行;S702~S704、S706、S707由处理模块322执行;S709由运行时模块311执行。
3、执行过程,请参照图8,图8为本申请实施例提供的一种计算方法的流程示意图三。该计算方法包括如下步骤S801~S808。
S801,主机设备310基于业务数据执行待处理业务,并获取执行过程中待处理业务的第三对象。
业务数据的相关描述可以参照上述方法实施,在此不再赘述。第三对象与第一对象类似,相关描述可以参照上述对第一对象的描述,在此不再赘述。
示例的,主机设备310获取第三对象的方式可以是,在执行业务数据中的某个函数时,将该函数对应的对象确定为第三对象。
S802,主机设备310判断第三对象是否为代理对象。
示例的,假设通过上述映射过程确定了多个代理对象,如果在这多个代理对象中存在与第三对象相同的对象,那么可以判断第三对象为代理对象;反之,则可以判断第三对象不是代理对象。
其中,当第三对象为代理对象时,执行S803;否则,返回执行S801,直至完成待处理业务的处理。
S803,主机设备310获取第三对象的代理信息。
S804,主机设备310依据代理信息生成调用信息。
其中,该调用信息可以包括第一数据的虚拟存储地址以及函数执行信息。该函数指示信息用于指示加速设备320执行内存对象中的函数。示例的,请参照图6,函数指示信息可以包括代理函数1,则该函数指示信息可以用于指示加速设备320执行内存对象中的函数A。如此,该调用信息可以调用加速设备320执行某个内存对象中的函数,细化调用粒度,提升调用效率。
S805,主机设备310向加速设备320发送调用信息。相应的,加速设备320接收主机设备310发送的调用信息。
S806,加速设备320根据存储地址从存储器中获取第一数据。
S807,加速设备320基于第一数据执行待处理任务,得到处理结果。
S808,加速设备320向主机设备310发送处理结果。相应的,主机设备310接收加速设备320发送的处理结果。
其中,S803~S808可以参照上述方法实施例(包括S410~S440)的相关说明,在此不再赘述。
在S801~S808中,主机设备310在执行待处理业务时,碰到需要调用加速设备320处理的对象,可以通过调用信息调用加速设备320处理,而这个过程不会干扰主机设备310侧对待处理任务的处理,并且,加速设备320处所执行的动作可以与主机设备310处所执行的动作统一。换言之,可以实现主机设备310和加速设备320的控制流的统一,既能减少数据传输量,又能减少编程的割裂程度。
值得注意的是,当上述S801~S808由主机设备310或加速设备320包括的软件模块/引擎/单元实现时,可由图3所示出的运行时模块311、卸载执行引擎313、通信模块321、处理模块322和主机设备310中的通信模块执行。例如,S808由主机设备310中的通信模块和通信模块321执行;S806、S807由处理模块322执行;S801~S804由运行时模块311执行;S805由卸载执行引擎313执行。
在上述方法实施例的基础上,下面结合几个示例对该方法实施例作进一步说明。
示例1,请参照图9,图9为本申请实施例提供的一种应用加速示意图。
主机设备310中运行有应用(application),该应用中包括存储读写操作和网络输入输出操作,该应用可以用于实现待处理业务。在主机设备310中还包括存储文件句柄代理对象和网络输入输出操作代理对象。加速设备320中运行有加速进程,该加速进程中包括存储文件句柄和网络输入输出操作。
利用上述图4~图8所示的方法实施例,主机设备310可以通过存储文件句柄代理对象和网络输入输出操作代理对象,调用加速设备320执行存储文件句柄和网络输入输出操作,能够减少计算系统中的数据传输量,节省带宽资源,提升卸载推送效率。另外,在此过程中,主机设备310操作的是代理对象而非存储文件句柄和网络输入输出操作,这样,解决了资源对象、持久化对象和连接对象等不可序列化返回主机设备,在主机设备无法保留不卸载推送执行的流程,需要进行重构修改的问题。也即是可以实现对资源对象、持久化对象和连接对象等的异构计算的支持,避免重构代码,节省成本。
值得注意的是,存储读写操作通常存在于智能SSD场景下,网络输入输出操作通常存在于智能网卡场景下。换言之,图9所示的加速设备320可以是智能SSD或者是智能网卡。
示例2,请参照图10,图10为本申请实施例提供的一种应用加速示意图二。
主机设备310中运行有应用,该应用中包括数据处理操作1和数据处理操作2,该应用可以用于实现待处理业务。其中,数据处理操作1和数据处理操作2是连续的操作,也即是数据处理操作1和数据处理操作2在时序上存在依赖关系,例如,数据处理操作1的处理结果是数据处理操作2的输入数据。在主机设备310中还包括数据处理操作代理对象。加速设备320中运行有加速进程,该加速进程中包括数据处理操作1和数据处理操作2。
利用上述图4~图8所示的方法实施例,主机设备310可以通过数据处理操作代理对象,调用加速设备320执行数据处理操作1和数据处理操作2。这样,加速设备320返回至主机设备310的处理结果是数据处理操作2的处理结果,数据处理操作1不再被反馈至主机设备310,从而避免待处理业务中的多个操作的中间处理结果在主机设备和加速设备之间反复传输,减少数据传输量,提升计算效率。
在上述计算方法实施例中,相比使用统一内存管理能力的异构卸载,本申请实施例提供的计算方法能够减少主机设备和加速设备之间内存页的共享和使用,提高统一内存编址的使用效率,节省主机设备和加速设备之间的带宽。相比不具备USM的本地异构卸载,本申请实施例提供的计算方法能够减少主机设备和加速设备之间直接存储器访问(direct memory access,DMA)数据量,提高卸载效率,节省主机与异构设备之间的带宽。相比远程推送卸载,本申请实施例提供的计算方法能够减少主机设备和加速设备之间的网络数据传输,减少控制流所依赖的对象信息传输,提高卸载效率,节省主机设备和加速设备之间的带宽。
可以理解的是,为了实现上述实施例中功能,主机设备和加速设备包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。
示例的,当上述图4、图5、图7、图8所示的方法由软件模块/单元实现时,图3中的主机设备310为计算装置,加速设备320为加速装置,这里结合图3给出一种可能的示例对该计算装置和加速装置进行说明。
计算装置可以用于实现上述方法实施例中主机设备310的功能,加速装置可以用于实现上述方法实施例中加速设备320的功能,因此计算装置结合加速装置能实现上述方法实施例所具备的有益效果。在本申请的实施例中,该计算装置可以是如图3所示的主机设备310,也可以是图1~图2中所示出的主机设备,还可以是应用于主机设备的模块(如芯片)。加速装置可以是如图3所示的加速设备320,也可以是图1~图2中所示出的加速设备,还可以是应用于加速设备的模块(如芯片)。
如图3所示,计算装置包括运行时模块311、编译模块312和卸载执行引擎313。加速装置包括通信模块321和处理模块322。该计算装置用于实现上述图4中所示的方法实施例中主机设备的功能,该加速装置用于实现上述图4中所示的方法实施例中加速设备的功能。在一种可能的示例中,该计算装置和加速装置用于实现上述计算方法的具体过程包括以下内容1~内容3。
1,卸载执行引擎313,用于向加速设备320发送调用信息。通信模块321,用于接收卸载执行引擎313发送的调用信息。
2,处理模块322,用于根据存储地址从存储器322中获取第一数据。
3,处理模块322,还用于基于第一数据执行待处理任务,得到处理结果。关于如何基于第一数据执行待处理任务,得到处理结果,可以参照上述S431,在此不再赘述。
如此,在本实施例中,主机设备将存储地址告知加速设备,便于加速设备直接根据存储地址从存储器中获取数据,并进行处理,避免主机设备从存储器中获取数据,再将数据传输给加速设备,这样能够减少计算系统中的数据传输量,节省带宽资源,提升卸载推送效率,从而解决异构计算过程中的数据传输量大的问题。另外,通过执行上述S410~S430还可以保留待处理业务所需的第一数据的指令结构,避免大规模重构代码,提高开发效率。
可选的,该装置还可以进一步实现上述计算方法中的其他步骤,具体参照上述图5、图7、图8所示出的步骤,在此不再赘述。
应理解的是,本发明本申请实施例的计算装置或加速装置可以通过CPU实现,也可以通过ASIC实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、FPGA、通用阵列逻辑(generic array logic,GAL)或其任意组合。计算装置或加速装置通过软件实现图4、图5、图7、图8中任一所示的计算方法时,计算装置或加速装置及其各个模块也可以为软件模块。
有关上述计算装置或加速装置更详细的描述可以直接参考上述图4、图5、图7、图8所示的实施例中相关描述直接得到,这里不加赘述。
在上述计算方法实施例的基础上,对于本申请实施例提供的计算系统,以图1所示的计算系统100举例,当主机设备110用于实现图4、图5、图7、图8所示的方法时,处理器111和通信接口113用于执行上述主机设备的功能。处理器111、通信接口113和存储器112还可以协同实现主机设备执行的计算方法中的各个操作步骤。主机设备110还可以执行图3所示出的计算装置的功能,此处不予赘述。主机设备110中的存储器112可用于存储软件程序及模块,如本申请实施例所提供的计算方法对应的程序指令/模块。主机设备110中的处理器111通过执行存储在存储器112内的软件程序及模块,从而执行各种功能应用以及数据处理。主机设备110中的通信接口113可用于与其他设备进行信令或数据的通信。在本申请中主机设备110可以具有多个通信接口113。
应理解,本申请实施例的主机设备110可对应于本申请实施例中的计算装置,并可以对应于执行本申请实施例的图4、图5、图7、图8所示方法实施例中的相应主体,并且主机设备110中的各个单元/模块的能够实现图4、图5、图7、图8中的各个方法的相应流程,为了简洁,在 此不再赘述。
在上述计算方法实施例的基础上,对于本申请实施例提供的计算系统,以图1所示的计算系统100举例,当加速设备120用于实现图4、图5、图7、图8所示的方法时,处理器121和通信接口123用于执行上述加速设备的功能。处理器121、通信接口123和存储器122还可以协同实现加速设备执行的计算方法中的各个操作步骤。加速设备120还可以执行图3所示出的加速装置的功能,此处不予赘述。加速设备120中的存储器122可用于存储软件程序及模块,如本申请实施例所提供的计算方法对应的程序指令/模块。加速设备120中的处理器121通过执行存储在存储器122内的软件程序及模块,从而执行各种功能应用以及数据处理。加速设备120中的通信接口123可用于与其他设备进行信令或数据的通信。在本申请中加速设备120可以具有多个通信接口123。
应理解,本申请实施例的加速设备120可对应于本申请实施例中的加速装置,并可以对应于执行本申请实施例的图4、图5、图7、图8所示方法实施例中的相应主体,并且加速设备120中的各个单元/模块的能够实现图4、图5、图7、图8中的各个方法的相应流程,为了简洁,在此不再赘述。
可以理解的是,本申请的实施例中的处理器可以是CPU、NPU或GPU,还可以是其它通用处理器、DSP、ASIC、FPGA或者其它可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于网络设备或终端设备中。当然,处理器和存储介质也可以作为分立组件存在于网络设备或终端设备中。
以上模块或单元的一个或多个可以软件、硬件或二者结合来实现。当以上任一模块或单元以软件实现的时候,所述软件以计算机程序指令的方式存在,并被存储在存储器中,处理器可以用于执行所述程序指令并实现以上方法流程。所述处理器可以包括但不限于以下至少一种:CPU、微处理器、数字信号处理器(DSP)、微控制器(microcontroller unit,MCU)、或人工智能处理器等各类运行软件的计算设备,每种计算设备可包括一个或多个用于执行软件指令以进行运算或处理的核。该处理器可以内置于SoC(片上系统)或ASIC,也可是一个独立的半导体芯片。该处理器内处理用于执行软件指令以进行运算或处理的核外,还可进一步包括必要的硬件加速器,如FPGA、PLD、或者实现专用逻辑运算的逻辑电路。
当以上模块或单元以硬件实现的时候,该硬件可以是CPU、微处理器、DSP、MCU、人工智能处理器、ASIC、SoC、FPGA、PLD、专用数字电路、硬件加速器或非集成的分立器件中的任一个或任一组合,其可以运行必要的软件或不依赖于软件以执行以上方法流程。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部 分地产生按照本发明本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。有线可以是指同轴电缆、光纤或数字用户线(digital subscriber line,DSL)等,无线可以是指红外、无线或微波等。
所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
以上所述,仅为本申请的具体实施方式。熟悉本技术领域的技术人员根据本申请提供的具体实施方式,可想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (19)

  1. 一种计算系统,其特征在于,包括主机设备和加速设备,所述主机设备与所述加速设备通信连接,所述加速设备与存储器耦合,所述存储器存储有待处理业务所需的第一数据;
    所述主机设备,用于向所述加速设备发送调用信息,所述调用信息用于指示所述第一数据的存储地址;
    所述加速设备,用于接收所述调用信息;
    所述加速设备,还用于根据所述存储地址从所述存储器中获取所述第一数据;
    所述加速设备,还用于基于所述第一数据执行所述待处理任务,得到处理结果。
  2. 根据权利要求1所述的系统,其特征在于,所述待处理业务包括连续的N个操作,N为大于1的整数;
    所述加速设备,还用于根据所述N个操作中的第i个操作的处理结果,执行第i+1个操作,得到所述第i+1个操作的处理结果;i为整数,且1≤i≤N-1;
    其中,第N个操作的处理结果为所述处理结果。
  3. 根据权利要求2所述的系统,其特征在于,所述第i个操作的处理结果存储在所述存储器或所述加速设备中。
  4. 根据权利要求1-3中任一项所述的系统,其特征在于,所述主机设备,还用于获取代理信息,所述代理信息包含:所述第一数据的虚拟存储地址;
    所述主机设备,还用于依据所述代理信息生成所述调用信息。
  5. 根据权利要求4所述的系统,其特征在于,所述代理信息存储在所述主机设备中。
  6. 根据权利要求1-5中任一项所述的系统,其特征在于,所述第一数据存储在内存对象中,所述内存对象是由所述存储器提供的一段物理地址存储空间。
  7. 根据权利要求4-6中任一项所述的系统,其特征在于,所述加速设备,还用于向所述主机设备发送所述第一数据的虚拟存储地址;
    所述主机设备,还用于接收所述加速设备发送的所述第一数据的虚拟存储地址。
  8. 一种计算方法,其特征在于,所述方法由加速设备执行,所述加速设备与主机设备通信连接,所述加速设备与存储器耦合,所述存储器存储有待处理业务所需的第一数据,所述方法包括:
    接收所述主机设备发送的调用信息,所述调用信息用于指示所述第一数据的存储地址;
    根据所述存储地址从所述存储器中获取所述第一数据;
    基于所述第一数据执行所述待处理任务,得到处理结果。
  9. 根据权利要求8所述的方法,其特征在于,所述待处理业务包括连续的N个操作,N为大于1的整数;
    所述基于所述第一数据执行所述待处理任务,得到处理结果,包括:
    根据所述N个操作中的第i个操作的处理结果,执行第i+1个操作,得到所述第i+1个操作的处理结果;i为整数,且1≤i≤N-1;
    其中,第N个操作的处理结果为所述处理结果。
  10. 根据权利要求9所述的方法,其特征在于,所述第i个操作的处理结果存储在所述存储器或所述加速设备中。
  11. 根据权利要求8-10中任一项所述的方法,其特征在于,所述第一数据存储在内存对象中,所述内存对象是由所述存储器提供的一段物理地址存储空间。
  12. 根据权利要求8-11中任一项所述的方法,其特征在于,所述方法还包括:
    向所述主机设备发送所述第一数据的虚拟存储地址。
  13. 一种计算装置,其特征在于,所述装置应用于加速设备,所述加速设备与主机设备通信连接,所述加速设备与存储器耦合,所述存储器存储有待处理业务所需的第一数据,所述装置包括:
    通信模块,用于接收所述主机设备发送的调用信息,所述调用信息用于指示所述第一数据的存储地址;
    处理模块,用于根据所述存储地址从所述存储器中获取所述第一数据;
    所述处理模块,用于基于所述第一数据执行所述待处理任务,得到处理结果。
  14. 根据权利要求13所述的装置,其特征在于,所述待处理业务包括连续的N个操作,N为大于1的整数;
    所述处理模块,还用于根据所述N个操作中的第i个操作的处理结果,执行第i+1个操作,得到所述第i+1个操作的处理结果;i为整数,且1≤i≤N-1;
    其中,第N个操作的处理结果为所述处理结果。
  15. 根据权利要求14所述的装置,其特征在于,所述第i个操作的处理结果存储在所述存储器或所述加速设备中。
  16. 根据权利要求13-15中任一项所述的装置,其特征在于,所述第一数据存储在内存对象中,所述内存对象是由所述存储器提供的一段物理地址存储空间。
  17. 根据权利要求13-16中任一项所述的装置,其特征在于,所述通信模块,还用于向所述主机设备发送所述第一数据的虚拟存储地址。
  18. 一种加速设备,其特征在于,所述加速设备包括至少一个处理器,所述加速设备与存储器耦合,所述存储器用于存储一组计算机指令和待处理业务所需的第一数据,当所述处理器执行所述一组计算机指令时,执行上述权利要求8-12中任一项所述的方法的操作步骤。
  19. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有计算机程序或指令,当所述计算机程序或指令被计算设备或计算设备所在的存储系统执行时,实现如权利要求8-12中任一项所述的方法。
PCT/CN2023/071094 2022-01-12 2023-01-06 计算系统、方法、装置及加速设备 WO2023134588A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210032978.0A CN116467245A (zh) 2022-01-12 2022-01-12 计算系统、方法、装置及加速设备
CN202210032978.0 2022-01-12

Publications (1)

Publication Number Publication Date
WO2023134588A1 true WO2023134588A1 (zh) 2023-07-20

Family

ID=87181148

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071094 WO2023134588A1 (zh) 2022-01-12 2023-01-06 计算系统、方法、装置及加速设备

Country Status (2)

Country Link
CN (1) CN116467245A (zh)
WO (1) WO2023134588A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273331A (zh) * 2017-06-30 2017-10-20 山东超越数控电子有限公司 一种基于cpu+gpu+fpga架构的异构计算系统和方法
CN108763299A (zh) * 2018-04-19 2018-11-06 贵州师范大学 一种大规模数据处理计算加速系统
CN109547531A (zh) * 2018-10-19 2019-03-29 华为技术有限公司 数据处理的方法、装置和计算设备
US20200326992A1 (en) * 2019-04-12 2020-10-15 Huazhong University Of Science And Technology Acceleration method for fpga-based distributed stream processing system
US20210097221A1 (en) * 2019-09-29 2021-04-01 Huazhong University Of Science And Technology Optimization method for graph processing based on heterogeneous fpga data streams

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273331A (zh) * 2017-06-30 2017-10-20 山东超越数控电子有限公司 一种基于cpu+gpu+fpga架构的异构计算系统和方法
CN108763299A (zh) * 2018-04-19 2018-11-06 贵州师范大学 一种大规模数据处理计算加速系统
CN109547531A (zh) * 2018-10-19 2019-03-29 华为技术有限公司 数据处理的方法、装置和计算设备
US20200326992A1 (en) * 2019-04-12 2020-10-15 Huazhong University Of Science And Technology Acceleration method for fpga-based distributed stream processing system
US20210097221A1 (en) * 2019-09-29 2021-04-01 Huazhong University Of Science And Technology Optimization method for graph processing based on heterogeneous fpga data streams

Also Published As

Publication number Publication date
CN116467245A (zh) 2023-07-21

Similar Documents

Publication Publication Date Title
US11010681B2 (en) Distributed computing system, and data transmission method and apparatus in distributed computing system
CN110647480B (zh) 数据处理方法、远程直接访存网卡和设备
US8874681B2 (en) Remote direct memory access (‘RDMA’) in a parallel computer
WO2017114283A1 (zh) 一种在物理主机中处理读/写请求的方法和装置
KR20100058670A (ko) 크로스-시스템의 프록시-기반 태스크 오프로딩을 위한 장치, 시스템 및 방법
WO2017173618A1 (zh) 压缩数据的方法、装置和设备
CN116541227B (zh) 故障诊断方法、装置、存储介质、电子装置及bmc芯片
WO2023104194A1 (zh) 一种业务处理方法及装置
US20240028423A1 (en) Synchronization Method and Apparatus
US11762790B2 (en) Method for data synchronization between host side and FPGA accelerator
CN112506676B (zh) 进程间的数据传输方法、计算机设备和存储介质
CN116627867B (zh) 数据交互系统、方法、大规模运算处理方法、设备及介质
Shim et al. Design and implementation of initial OpenSHMEM on PCIe NTB based cloud computing
US20230403232A1 (en) Data Transmission System and Method, and Related Device
WO2023207295A1 (zh) 数据处理方法、数据处理单元、系统及相关设备
WO2023134588A1 (zh) 计算系统、方法、装置及加速设备
WO2023124304A1 (zh) 芯片的缓存系统、数据处理方法、设备、存储介质及芯片
US20230153153A1 (en) Task processing method and apparatus
CN111078618A (zh) 电子设备以及双处理器的通信方法
CN115549858A (zh) 数据传输方法以及装置
CN116601616A (zh) 一种数据处理装置、方法及相关设备
WO2018188416A1 (zh) 一种数据搜索的方法、装置和相关设备
US20240111694A1 (en) Node identification allocation in a multi-tile system with multiple derivatives
US20210055971A1 (en) Method and node for managing a request for hardware acceleration by means of an accelerator device
US11748253B1 (en) Address generation for page collision prevention in memory regions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23739931

Country of ref document: EP

Kind code of ref document: A1