WO2023092415A1 - 一种消息处理方法及装置 - Google Patents

一种消息处理方法及装置 Download PDF

Info

Publication number
WO2023092415A1
WO2023092415A1 PCT/CN2021/133267 CN2021133267W WO2023092415A1 WO 2023092415 A1 WO2023092415 A1 WO 2023092415A1 CN 2021133267 W CN2021133267 W CN 2021133267W WO 2023092415 A1 WO2023092415 A1 WO 2023092415A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing unit
accelerator
event message
processing
event
Prior art date
Application number
PCT/CN2021/133267
Other languages
English (en)
French (fr)
Inventor
欧阳伟龙
胡粤麟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180104290.2A priority Critical patent/CN118265973A/zh
Priority to PCT/CN2021/133267 priority patent/WO2023092415A1/zh
Publication of WO2023092415A1 publication Critical patent/WO2023092415A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication

Definitions

  • the embodiments of the present application relate to the field of computer technologies, and in particular, to a message processing method and device.
  • the clock frequency of the high-performance processor (Central Processing Unit, CPU) has not changed much, and the performance improvement is slow.
  • the power consumption per square centimeter has changed from more than ten milliwatts to about one watt, which has also reached the limit, limiting the improvement of performance.
  • heterogeneous computing tasks depend on the CPU for scheduling, and heterogeneous computing resources need to wait for the CPU to move data, and there is a performance bottleneck in the scheduling and utilization of heterogeneous resources in the data processing system.
  • Embodiments of the present application provide a message processing method and device to improve resource utilization of a data processing system.
  • a message processing method including:
  • the first processing unit processes the first event message to obtain a second event message, the first event message is received by the first processing unit, or the first event message is obtained by the first processing unit based on generated by the processing of the application's request;
  • the first processing unit sends the second event message to the second processing unit according to context information, where the context information includes routing information from the first processing unit to the second processing unit, and the context information is generated based on processing requests from said application;
  • the first processing unit is a first engine
  • the second processing unit is a second accelerator
  • the first processing unit is a first accelerator
  • the second processing unit is a second engine
  • the first processing unit is a first engine
  • the second processing unit is a second engine
  • the first processing unit is a first accelerator and the second processing unit is a second accelerator.
  • the present application provides a method, including: the first processing unit processes the first event message to obtain the second event message; the first event message is received by the first processing unit, or the first event message is the first event message received by the first processing unit.
  • the unit is generated based on the processing request of the application; the first processing unit sends the second event message to the second processing unit according to the context information, the context information includes the routing information from the first processing unit to the second processing unit, and the context information is based on generated by the processing request of the application; wherein, the first processing unit may be an engine or an accelerator; the second processing unit may also be an engine or an accelerator; the first processing unit is different from the second processing unit.
  • the transmission of event messages between different processing units is realized based on context information, compared with the transmission scheduling of event messages using a scheduling method (such as using a scheduler for message scheduling), the above implementation
  • the method can avoid the performance bottleneck caused by the transmission scheduling, and then can improve the system processing performance.
  • the first processing unit sends the second event message to the second processing unit according to context information, including:
  • the first processing unit sends the second event message to the event queue corresponding to the second processing unit according to the routing information
  • the second processing unit acquires the second event message from the event queue.
  • messages are transmitted between different processing units based on the event queue.
  • a thread can send the data that needs to be processed by the accelerator to the corresponding event queue of the accelerator through the event message, so that the event message can be processed by the corresponding accelerator, reducing the The degree of coupling between the thread and the accelerator is improved, which in turn can improve the flexibility of resource allocation and improve the resource utilization rate of the data processing process.
  • the second event message includes a target event queue identifier, where the target event queue identifier is a queue identifier of an event queue corresponding to the second processing unit.
  • the "target message queue identifier" can be added to the message according to the context information, so as to realize the routing transmission of the message based on the event queue. Compared with the traditional bus, it can realize data communication between dynamically scheduled computing resources, forwarding The efficiency is higher, and the resource utilization rate of the data processing process is further improved.
  • the routing information further includes a target routing field, where the target routing field is used to indicate a target server, the target server is different from the source server, and the source server is where the first processing unit resides. server.
  • the routing information also includes a target routing field, which is used to indicate the target server, so that the target server may be different from the source server.
  • the method can form a communication link in a cross-routing domain manner, can build a cross-routing domain communication link network, and has better scheduling flexibility and scalability.
  • the second processing unit is a second accelerator; the first processing unit sends the second event message to the second processing unit according to context information, including:
  • the first processing unit sends the second event message to the event queue corresponding to the accelerator pool according to the routing information, the accelerator pool includes multiple accelerators, and the multiple accelerators are of the same type; according to the a state of a plurality of accelerators from which the second accelerator is determined;
  • the accelerator sends event messages, and provides a resource scheduling mechanism for shared accelerators, which can improve system processing performance.
  • the first processing unit before the first processing unit receives the first event message, it further includes:
  • the computing resource includes the first processing unit and the second processing unit
  • the context information is generated according to the processing request of the application program.
  • the first processing unit or the second processing unit is selected from the multiple processing units based on the status information of the multiple processing units when receiving the processing request of the application program Yes, the state information of the processing unit includes network topology performance.
  • the hardware status information of the hardware is obtained, and the optimal hardware is allocated according to the current hardware status, so that the allocated computing resources are more reasonable.
  • the hardware status information includes network topology.
  • Performance the optimal hardware may be the hardware with the best current performance, or the hardware with the best matching performance.
  • the method can trigger real-time dynamic scheduling of resources based on events corresponding to the received processing requests, thereby avoiding waste of resources and further improving system performance.
  • the at least two threads are loaded to run on at least two engines, wherein different threads run on different engines.
  • the determining of the processing request includes at least two tasks, including:
  • a corresponding task is determined.
  • multiple tasks belonging to the processing request can be constructed based on the semantics of the processing request. Different tasks have different task semantics.
  • Computing tasks can be dynamically created according to real-time events, and complex computing tasks can be efficiently split into Multiple tasks are simple and easy to implement, reducing resource waste.
  • the method also includes:
  • the first thread being one of the at least two threads
  • the method can stop threads or shut down corresponding hardware according to needs, can realize near-zero standby power consumption, and ensure low power consumption of the message processing method.
  • the processing request is used to request acquisition of target data, and the target data is stored in the memory of the second server;
  • the computing resource for executing the processing request further includes a third processing unit and a fourth processing unit;
  • the at least two engines include the first processing unit, the second processing unit, and the third processing unit;
  • the fourth processing unit is an accelerator;
  • the first event message and The second event message includes the identifier of the target data, the first processing unit and the second processing unit are located in the first server, and the third processing unit and the fourth processing unit are located in the second server.
  • Two servers; the context further includes routing information from the second processing unit to the third processing unit, and from the third processing unit to the fourth processing unit;
  • the method further includes:
  • the second processing unit encapsulates the second event message based on the second event message to generate a third event message
  • the second processing unit sends the third event message to the third processing unit located in the second server according to the context;
  • the third processing unit decapsulates the third event message based on the third event message to obtain a fourth event message, and sends the fourth event message to the fourth processing unit according to the context ;
  • the fourth processing unit obtains the identifier of the target data from the received fourth event message, acquires the target data from the memory of the second server according to the identifier of the target data, and obtains the target data according to the The target data obtains the fifth event message; the fifth event message is used to send the target data to the first server.
  • a method for obtaining the target data stored in the shared memory is provided.
  • the corresponding memory address is obtained through the identification of the target data, and the target data is obtained from the shared memory according to the memory address.
  • This method can avoid the use of global page sharing
  • the problem of occupying a large amount of memory in the method further improves the resource utilization rate of the data processing process.
  • the context information also includes operation configuration information
  • the first processing unit processes the first event message to obtain a second event message, including:
  • the first processing unit processes the first event message according to the operation configuration information to obtain a second event message.
  • the context also includes operation configuration information (such as bit width, points, etc.), so that the processing unit can process according to the operation configuration information, and can automatically trigger the corresponding processing mechanism after receiving the event message, which improves the The event-driven high energy efficiency advantage improves resource utilization.
  • operation configuration information such as bit width, points, etc.
  • the first event message and the second event message include an identifier of the context information, and the identifier of the context information is used to acquire the context information.
  • the event message includes the identifier of the context information (CID), which is used to indicate the context information of the application, so that the processing unit can quickly and efficiently obtain the corresponding operation configuration information or routing information, improving the data Resource utilization for processing.
  • CID context information
  • the second event message includes:
  • the message attribute information field includes event message routing information, and the event message routing information includes a target event queue identifier, and the target event queue identifier is the queue identifier of the event queue corresponding to the second processing unit;
  • a message length field including the total length information of the second event message
  • the data field includes the payload of the second event message.
  • the data field includes a first event information field
  • the first event information field includes at least one of the following:
  • the routing scope, the identifier of the context information, the identifier of the source message queue or the custom attribute, the routing scope includes at least one routing domain.
  • the data field includes a second event information field
  • the second event information field includes custom information of the application layer.
  • the frame structure of the event message is defined.
  • the frame structure can include: network layer subframe, operating system layer subframe, and application layer subframe from the outermost layer.
  • the frame structure of the event message supports according to the application scenario Doing dynamic expansion, encapsulating event messages in different formats in different scenarios, further enables the solution provided by this application to be flexibly applied to different application scenarios, improves the adaptability of data processing, and improves the efficiency of data forwarding.
  • the method also includes:
  • resource configuration information of the application program includes the number of engines, and one or more of accelerator types or accelerator numbers;
  • an accelerator used by the application program is determined, and the accelerator used by the application program includes the first accelerator and/or the second accelerator.
  • the resource configuration information of the application can be obtained according to the processing request received, and the accelerator and engine used by the application can be determined.
  • the resource configuration information includes but not limited to the number of engines, the type of accelerator, and the number of accelerators.
  • the engine and accelerator used by the application program are selected, so as to realize the real-time dynamic allocation that adapts to the resource status in real time, which not only guarantees performance requirements, but also ensures low power consumption.
  • the first processing unit is a first engine; the second processing unit is a second accelerator; and the first unit sends the second event message to the second processing unit
  • the corresponding event queue includes:
  • the first engine executes the first retranslation instruction of the second accelerator to send the second event message to the event queue corresponding to the second accelerator; the first retranslation instruction is loaded by the second Accelerator, and after assigning the identifier of the event queue corresponding to the second accelerator to the second accelerator, modify the machine code of the second accelerator according to the identifier of the event queue corresponding to the second accelerator; When the first retranslated instruction is executed, the first engine sends the second event message to the event queue corresponding to the second accelerator.
  • the event queue of the engine sends event messages, for example, in response to the second
  • the accelerator is loaded, and the identifier of the second event queue is assigned to the second accelerator; according to the identifier of the second event queue, the instruction set of the second accelerator is modified, and when the instructions in the modified instruction set are executed by the first thread on the first engine , the first thread sends the second event message to the second event queue.
  • the identifier of the event queue is used to replace the instruction of the accelerator, so that when different accelerators are continuously expanded, the microengine can be reused without modification.
  • the embodiment of the present application also provides a message processing device, including:
  • a first running module the first running module is configured to: process the first event message through the first processing unit to obtain a second event message, the first event message is received by the first processing unit, Or the first event message is generated by the first processing unit based on a processing request of an application;
  • the first processing unit sends the second event message to the second processing unit according to context information, where the context information includes routing information from the first processing unit to the second processing unit, and the context Information is generated based on processing requests from said application;
  • the first processing unit is a first engine
  • the second processing unit is a second accelerator
  • the first processing unit is a first accelerator
  • the second processing unit is a second engine
  • the first processing unit is a first engine
  • the second processing unit is a second engine
  • the first processing unit is a first accelerator and the second processing unit is a second accelerator.
  • the present application provides an embodiment to provide a message processing device, including a processor and a memory,
  • the memory is used to store executable programs
  • the processor is configured to execute a computer-executable program in a memory, so that the method described in any one of the first aspects is performed.
  • the present application provides an embodiment to provide a computer-readable storage medium, the computer-readable storage medium stores a computer-executable program, and when the computer-executable program is invoked by a computer, the computer executes the following: The method of any one of the first aspects.
  • the embodiment of the present application also provides a chip, including: a logic circuit and an input and output interface, the input and output interface is used to receive code instructions or information, and the logic circuit is used to execute the code instructions or according to Said information to perform the method according to any one of the first aspects.
  • the embodiment of the present application further provides a data processing system, where the data processing system includes the message processing apparatus as described in the second aspect.
  • the embodiment of the present application also provides a computer program product, the computer program product includes computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any method described in the item.
  • FIG. 1 is a schematic structural diagram of a data processing system provided in an embodiment of the present application
  • Fig. 2 is a schematic flow chart of a micro-engine processing the pipeline of instructions provided in the embodiment of the present application;
  • FIG. 3 is a schematic diagram of implementing semantic-driven data sharing provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a gating mode of an accelerator pool provided in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a multicast mode of an accelerator pool provided in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a highly resilient network with multiple routing domains provided in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an asynchronous interface design of a highly elastic network provided in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a basic structure of a highly elastic network transmission frame provided in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a structure of a subframe for highly elastic network transmission provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the composition structure of a highly dynamic operating system provided in the embodiment of the present application.
  • FIG. 11 is a schematic diagram of a design scheme of an edge intelligent computing provided in an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a message processing method provided in the embodiment of the present application.
  • FIG. 13 is a schematic diagram of computing resource invocation for edge intelligent computing provided in an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a design scheme of a video call provided in the embodiment of the present application.
  • FIG. 15 is a schematic diagram of computing resource invocation for a video call provided in the embodiment of the present application.
  • FIG. 16 is a schematic diagram of a semantically defined shared data mechanism of a supercomputing center provided in an embodiment of the present application.
  • FIG. 17 is a schematic diagram of a design scheme of a supercomputing server provided in an embodiment of the present application.
  • Fig. 18 is a schematic diagram of computing resource invocation of a supercomputing center provided in the embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of a message processing device provided in an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a message processing device provided in an embodiment of the present application.
  • FIG. 21 is a schematic structural diagram of a chip provided in an embodiment of the present application.
  • a relationship means that there may be three kinds of relationships, for example, A and/or B means: A exists alone, A and B exist simultaneously, and B exists independently.
  • the following at least one (one)" or similar expressions refer to any combination of these items, including any combination of single item (s) or plural items (s).
  • At least one of a, b, or c Item can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present application, unless otherwise specified, "plurality” means two or more.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design scheme described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
  • the terms “installation”, “installation”, “connection”, and “connection” should be understood in a broad sense, for example, it can be a fixed connection, It can also be a detachable connection or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediary, and it can be the internal communication of two components.
  • installation can be a fixed connection, It can also be a detachable connection or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediary, and it can be the internal communication of two components.
  • Application program refers to a computer program for completing one or more specific tasks. It runs in user mode, can interact with users, and has a visual user interface.
  • Heterogeneous computing is a new computing model that integrates the general-purpose computing power of the CPU and the directional acceleration computing power of professional chips, and finally achieves the unity of performance, power consumption, and flexibility.
  • Accelerator Heterogeneous computing uses different types of processors to handle different types of computing tasks. Common computing units include CPU, ASIC (Application-Specific Integrated Circuit, application custom integrated circuit), GPU (Graphics Processing Unit, image processing unit/accelerator), NPU (Neural Processing Unit, neural network processing unit/accelerator), FPGA ( Field Programmable Gate Arrays, Programmable Logic Array), etc. Accelerators refer to professional chips such as the above-mentioned ASIC, GPU, NPU, and FPGA.
  • the CPU is responsible for scheduling and serial tasks with complex logic
  • the accelerator is responsible for tasks with high parallelism to achieve computing acceleration.
  • the fp32 accelerator is an accelerator responsible for fp32 floating-point operations.
  • An event is an operation that can be recognized by the control, such as pressing the OK button, selecting a radio button or check box.
  • Each control has its own identifiable events, such as form loading, single-click, double-click and other events, text change events of edit boxes (text boxes), and so on.
  • the engine mentioned in the embodiment of the present application refers to a convergent computing micro-engine (Convergent Process Engine, XPU), which can also be called a micro-engine.
  • a microengine is a processing unit used to process a pipeline of instructions. Among them, the pipeline is dynamically scalable.
  • the microengine can support computing tasks, processes or threads required for heterogeneous computing such as CPU, GPU, and NPU.
  • Thread A thread is the smallest unit that an operating system can perform operation scheduling. It is included in the process and is the actual operating unit in the process.
  • a thread refers to a single sequential flow of control in a process. Multiple threads can run concurrently in a process, and each thread performs different tasks in parallel. Multiple threads in the same process will share all system resources in the process, such as virtual address space, file descriptors, signal processing, and so on. But multiple threads in the same process have their own call stack, their own register environment, and their own thread local storage.
  • Event queue In the embodiment of the present application, the event queue is a container for storing messages during message transmission.
  • the event queue can be viewed as a linked list of event messages.
  • Network topology performance refers to the link relationship, throughput, available routes, available bandwidth, and delay of the network topology.
  • Network topology refers to the physical layout of various hardware or devices interconnected by transmission media, especially where the hardware is distributed and how cables run through them.
  • Application layer The application layer mainly provides application interfaces for the system.
  • Network layer The network layer is mainly responsible for defining logical addresses and realizing the forwarding process of data from source to destination.
  • the clock frequency of the high-performance processor has not changed much, and the performance improvement is slow.
  • the power consumption per square centimeter has changed from more than ten milliwatts to about one watt, which has also reached the limit, limiting the improvement of performance.
  • Embodiments of the present application provide a data processing system.
  • the data processing system 100 has five core network elements: a fusion computing micro-engine (Convergent Process Engine, XPU), a semantic-driven data sharing (Semantic-Driven Data Sharing, SDS), Semantic-Driven Accelerator Pool (Semantic-Driven Accelerator, SDA), High-elastic Routing Network (Ultra Elastic Network over Chip, UEN) and High-dynamic Operating System (High-dynamic Operating System, HOS).
  • a fusion computing micro-engine Convergent Process Engine, XPU
  • SDS Semantic-Driven Data Sharing
  • SDA Semantic-Driven Accelerator Pool
  • UEN High-elastic Routing Network
  • UEN High-dynamic Operating System
  • HOS High-dynamic Operating System
  • the highly elastic routing network is used to realize the high-speed interconnection of micro-engines, accelerators and event queues, and supports the horizontal expansion of system performance and capacity; the highly dynamic operating system is used to realize flexible scheduling of resources and allocation of computing tasks.
  • the integrated computing micro-engine may also be referred to as a micro-engine for short, and the micro-engine and accelerator may be referred to as a processing unit.
  • a processing unit may be a microengine or an accelerator.
  • the converged computing micro-engine is a processing unit, which is used to process the instruction pipeline.
  • the pipeline is dynamically scalable.
  • the micro-engine can support computing tasks, processes or threads required for heterogeneous computing such as CPU, GPU (Graphics Processing Unit, image processing unit/accelerator), NPU (Neural Processing Unit, neural network processing unit/accelerator).
  • the micro-engine in the embodiment of this application is similar to a hardened container or thread processor, and can dynamically allocate corresponding micro-engines according to the load requirements of computing tasks in different business scenarios to ensure the computing power required by the business and optimized latency.
  • the micro-engine processes the instruction pipeline, and the specific process can be: after adding a new accelerator, the system assigns a corresponding event queue ID number, wherein, if the program corresponding to the new accelerator is installed in the system for the first time, then pass The just-in-time compiler recompiles the program once, replacing the program's machine code with instructions in a common format for sending messages to the event queue.
  • the microengine responds to the accelerator instruction corresponding to the accelerator program and sends the data to be processed to the corresponding event queue.
  • the system assigns the event queue number EQ-ID1 to the fp32 accelerator.
  • the program corresponding to the fp32 accelerator is installed in the data processing system for the first time, the program corresponding to the fp32 accelerator is recompiled by a just-in-time compiler, and the machine code "fp32rx, ax, bx" of fp32 is replaced by Instructions in the general format for sending messages to the event queue shown in Table 1:
  • the microengine After the fp32 program corresponding to the fp32 accelerator shown in Figure 2 is loaded into the microengine XPU-ID1, the microengine responds to the accelerator instruction corresponding to the fp32 program, sends the data to be processed to the event queue EQ-ID1, and then waits The result returned by the event queue EQ-ID1 is written back to the register or memory, and so far, an fp32 floating-point operation is completed.
  • Semantic-driven data sharing is used to continuously transmit data and context information through event queues, enabling data sharing across computing resources within a data processing system.
  • the computing resource can be a fusion computing micro-engine, an accelerator, and the like.
  • asynchronous circuit or asynchronous NOC Network On Chip, network on a chip
  • FFT fast Fourier transform, fast Fourier transform
  • floating point calculation etc.
  • the context information may also be called context; correspondingly, the identifier of the context information may also be called the identifier of the context, or simply called the identifier of the context.
  • Fig. 3 shows a schematic diagram of implementing semantic-driven data sharing provided by an embodiment of the present application.
  • the context of data sharing is defined through the application layer during the software development process.
  • the first computing resource constructs an event message block according to the semantic configuration instruction, and sends an event message to the event queue of the next second computing resource corresponding to the first computing resource through the event queue of the first computing resource, So that when the event queue of the second computing resource receives the event message, the second computing resource is automatically triggered to process the event message.
  • the second computing resource directly constructs an event message from the processing result and sends the event message to the corresponding computing resource through the sending queue.
  • the next computing resource if there is a next computing resource corresponding to the second computing resource, after the calculation is completed, the second computing resource directly constructs an event message from the processing result and sends the event message to the corresponding computing resource through the sending queue. The next computing resource.
  • a data session from ADC Analog-to-digital converter, analog-to-digital conversion
  • FFT accelerator to Framer is created through the application scheduler, so as to obtain data sharing Context;
  • the data session can be decomposed to obtain the semantic configuration instructions of each computing resource related to the context through a mechanism such as a compiler or an acceleration library, such as the semantic configuration instructions of the ADC, FFT accelerator, and framer in Figure 3.
  • ADC constructs event messages according to the configuration information, and then sends event messages to the specified FFT queue through its own event queue; when the event queue of the FFT accelerator receives the event message sent by the event queue of ADC, FFT is automatically triggered
  • the accelerator calculates the data block in the received event message. After the calculation is completed, it directly constructs an event message block with the calculation result and sends the event message constructed according to the calculation result to the framer through the sending queue; the event message of the framer When the queue receives the event message constructed according to the calculation result, the framer is automatically triggered to perform corresponding protocol analysis on the data block of the event message constructed according to the calculation result.
  • the FFT accelerator needs to do double-precision calculation, it can also send an event message to the FP32 accelerator to request double-precision calculation according to the same mechanism as above.
  • a thread may also send an event message to an accelerator A for processing, and accelerator A generates a new event according to the processing result message, and send it to another accelerator B for processing, after the accelerator B finishes processing, it will pass the event message to the next unit of the accelerator B.
  • the data processing system includes a first processing unit and a second processing unit, the first processing unit is a first accelerator, and the second processing unit is a second accelerator; the data processing system processes the message It includes: the first accelerator receives the first event message, the first accelerator processes the first event message to obtain the second event message, and the first accelerator sends the second event message to the second accelerator according to the context information, the context information includes the first Routing information from one accelerator to a second accelerator, the context information is generated based on the processing request of the application program.
  • an application scheduler can also be used to create a sub-accelerator from the first thread, the second A data session between a sub-accelerator Task1_A, a second sub-accelerator Task2_B, and a second thread to the second accelerator obtains a data sharing context CIDO (the context includes routing information of event messages).
  • the first sub-accelerator Task1_A can obtain the event message Mes.A_1 (herein referred to as the first event message) sent by the first thread, process the event message Mes.A_1, and obtain the event message Mes.A_2 ( In order to be distinguished from the first event message, it may be referred to as the second event message here), and the event message Mes.A_2 is sent to the second sub-accelerator Task2_B according to the routing information in the context (for example, the event is sent to The destination event queue identifier of the message Mes.A_2 is set to the identifier of the event queue corresponding to the second sub-accelerator Task2_B).
  • the second sub-accelerator Task2_B can receive the event message Mes.A_2, process the event message Mes.A_2, obtain the event message Mes.A_3, and send the event message Mes.A_3 according to the routing information in the context. .A_3 is sent to the subsequent second thread.
  • the data session persists.
  • the semantic-driven accelerator pool provides a resource scheduling mechanism for accelerators.
  • the converged computing micro-engine or accelerator communicates externally through the event queue to achieve accelerated processing of specific function requests.
  • the specific function corresponding to the FP32 accelerator is "floating point calculation", which communicates externally through the event queue.
  • the system can communicate with the FP32 accelerator through the event queue of the FP32 accelerator, and request the accelerated processing of the floating-point calculation corresponding to the FP32 accelerator in FIG. 4 .
  • a group of accelerators is determined to form a shared accelerator pool, which has a supporting event distributor and accelerator pool event queue.
  • the accelerator pool event queue may be referred to as a pool queue for short.
  • the system when the system requests acceleration, it can directly send an event message to the pool queue for the request without specifying the accelerator; when there is an event message in the pool queue, it will automatically trigger the event dispatcher to pass through the RR according to the idle state of the accelerator Arbitration selects an accelerator in the shared accelerator pool to process the event message, then triggers the gating circuit to open the circuit connection between the pool queue and the accelerator, and at the same time sends a read event message to the pool queue and the accelerator, then transmits the event message from the pool queue to accelerator.
  • the system when the system requests multiple accelerators of the same type at the same time, it can directly send the request to the pool queue without specifying the accelerator; when the pool queue has an event message, it will automatically trigger the event distributor to accelerate according to the multicast Request configuration information and detect corresponding idle accelerators, simultaneously select multiple accelerators, open up the circuit connection between pool queues and accelerators, and send read event messages to pool queues and accelerators at the same time, then send event messages from Pool queues are simultaneously transferred to accelerators.
  • the second processing unit is a second accelerator; the first processing unit sends the second event message to the second processing unit according to the context information, including: the first processing unit sends the second event message according to the routing information
  • the second event message is sent to the event queue corresponding to the accelerator pool.
  • the accelerator pool includes multiple accelerators, and the multiple accelerators are of the same type; according to the status of the multiple accelerators, determine the second accelerator from the multiple accelerators; send the second event message to the second accelerator.
  • the data processing system includes a first processing unit and a second processing unit, wherein the second processing unit is a second accelerator.
  • the first processing unit of the data processing system sends the second event message to the second processing unit according to the context information, specifically through the following process: the first processing unit sends the second event message to the second event message according to the routing information included in the context information Send to the event queue corresponding to the accelerator pool.
  • the accelerator pool includes multiple accelerators, and the multiple accelerators include the second accelerator. The types of multiple accelerators are the same; the event dispatcher selects from the accelerator pool according to the status of the accelerators in the accelerator pool.
  • the second accelerator the event dispatcher sends the second event message in the event queue corresponding to the accelerator pool to the second accelerator.
  • the first processing unit of the data processing system can send the event message Info.i to the corresponding
  • the event queue of the FP32 pool includes at least one accelerator, the at least one accelerator includes FP32 accelerator 1, and the type of the at least one accelerator is the same; the event distributor corresponding to the FP32 pool selects from the FP32 pool according to the state of the accelerator in the FP32 pool Select FP32 accelerator 1; the event dispatcher sends the event message Info.i in the event queue corresponding to the FP32 pool to FP32 accelerator 1.
  • context-based multicast event message processing may be performed.
  • the context can set the multicast mode, and the thread or accelerator can start the multicast function through the event queue of the thread or accelerator according to the multicast mode set by the context to copy the event message that needs downstream processing, and send multiple next A level processing unit, which can be a thread or an accelerator, or an application/CPU.
  • a highly elastic network provides an interconnection mechanism that can be flexibly scheduled.
  • the highly elastic network can realize the common physical connection infrastructure of multiple converged computing micro-engines and multiple accelerators in a single system-on-chip SOC, also known as a single routing domain, which is also the event message, task management of micro-engines and configuration of accelerators
  • a unified bearer layer for management and control channels it also realizes the cascading and routing of converged computing micro-engines and accelerators across SOCs, also known as multi-routing domains, as shown in Figure 6.
  • the embodiment of the present application provides a highly elastic network, wherein routers and computing resources can be directly connected, wherein the computing resources can be integrated computing microengines, accelerators, etc.; each computing resource should integrate a transceiver and be connected back-to-back with the transceiver of the router, Synchronous or asynchronous interface designs can be used.
  • the transceiver uses frames or packets to transmit and receive data, and the transceiver can send packets to the router or receive packets from the router.
  • the transceiver can send packets to the router or receive packets from the router.
  • Figure 8 please refer to Figure 8 for the basic structure of the frame transmitted by the highly elastic network.
  • the router After the router receives the message, it analyzes the corresponding frame and takes out the corresponding destination port number, searches the corresponding routing table to find the corresponding outbound port, and sends the message to the port; if multiple ports send to one port, it needs to Use fair arbitration to send corresponding packets one by one.
  • the non-extended frames transmitted by the highly elastic network are referred to as "basic frames".
  • the basic frame structure of highly elastic network transmission supports dynamic expansion according to application scenarios to adapt to data formats with different semantics.
  • the frame transmitted by the highly elastic network is defined by an extended KLV (Key-Length-Value) format.
  • Key field located at the front of the frame structure, is used to describe the attribute name of this field, which can be fixed length or application can be agreed upon;
  • the Length field is used to describe the length of the field, which can be a fixed length or can be agreed by the application;
  • the Value field followed by the Length field, is used to carry the data to be transmitted, and the length is specified by the Length field.
  • FIG. 9 provides a schematic diagram of a subframe format of a highly elastic network provided by an embodiment of the present application.
  • Subframes are defined hierarchically.
  • the bottom layer is the network subframe, above which is the system subframe, and then the application subframe.
  • Each layer can be defined independently, but the order of transmission is strictly in accordance with the following method to transmit the corresponding subframes: first Network subframes, then system subframes, then application subframes.
  • the network subframe and system subframe are predefined, and the application subframe can be agreed upon by the developer or the accelerator during design.
  • system subframes are predefined using the following types:
  • the data field of this subframe is the routing domain ID where the destination is located;
  • the data field of the subframe is the data session ID to which the frame belongs;
  • the data field of the subframe is the ID of the queue that sent the frame, and if the subframe is transmitted across domains, it is also necessary to carry the routing range in the subframe;
  • the data field of the subframe is the data transmitted by the operating system service, for example: configuration data, program image, etc.
  • the operating system can agree on its own “grandson frame", wherein the "grandson frame” can also follow the KLV format, so that the network can participate in frame analysis and improve forwarding efficiency.
  • Key 4 represents the application layer custom subframe.
  • the data field of this subframe is the data shared between applications.
  • applications can agree on their own “grandson frame”.
  • "It can also follow the KLV format so that the network can participate in frame analysis and improve forwarding efficiency.
  • HOS highly dynamic operating system
  • the highly dynamic operating system provides a resource scheduling and message communication mechanism.
  • the resource scheduling and message communication mechanism allow application developers and hardware developers to better collaborate in design, and can decouple each other. As long as a semantic consensus is reached, interoperability can be achieved, making this system highly dynamic-oriented. On-demand reconstruction of the environment and highly dynamic computing capabilities for on-demand scheduling.
  • FIG. 10 shows a schematic diagram of the composition structure of a highly dynamic operating system.
  • the highly dynamic operating system mainly provides three main services: semantic-driven computing services, semantic-driven data services, and semantic-driven session services.
  • semantic-driven computing services include: acceleration pool management, routing management, just-in-time compilation, and computing management.
  • Acceleration pool management refers to the highly dynamic operating system discovering all connected accelerator pools on the hardware and their supporting semantics and network location, registering the semantics, location and quantity of the accelerator, and using them as input parameters for just-in-time compilation and dynamic routing , which also exposes the Semantic Accelerator Manifest to the Application Layer, Semantic-Driven Session Service, and Semantic-Driven Data Service.
  • Routing management means that the highly dynamic operating system discovers all connected routing networks and routing domains on the hardware, and establishes a system-wide routing table, including the routing domain list, the routing port list of each routing domain, and the unit type connected to the port (including Accelerators, microengines, routers, etc.) as input parameters for just-in-time compilation and calculation management.
  • the port number of the router to which each accelerator or accelerator pool is connected is also the event queue number or the destination port number of the event message.
  • Just-in-time compilation means that the highly dynamic operating system creates a compilation mapping table from semantic accelerator instructions to event queues according to the semantic accelerator and global routing table of accelerator management and routing management.
  • the format of the compilation mapping table is shown in Table 2.
  • the compilation mapping table is used as a check list for the operating system to determine whether to start just-in-time compilation when computing, managing and loading threads or programs.
  • Semantic Accelerator Instructions Semantic accelerator/pool name event queue number Data Format Fp32 floating point calculation EQ-ID1 (ax,bx,cx) FFT Fourier transform EQ-ID2 (ax[],bx[],cx[]) ... ... ... ... ... ...
  • Computing management means that the highly dynamic operating system regards the micro-engine as a thread processor or container, and provides the corresponding resource application API (Application Programming Interface, application program interface) interface to the application, so that the application can dynamically create threads or tasks, and exert massive resources.
  • application API Application Programming Interface, application program interface
  • the high dynamic computing capability of multi-thread and multi-task parallel computing will also expose the interface API of the micro-engine to create tasks to the application layer.
  • semantic-driven data services include: semantic data indexing, data management, memory allocation, and semantic addressing mapping.
  • semantic data index refers to the service of creating structured memory shared data index provided by highly dynamic operating system, which replaces the global address table of page + offset address and its metadata management, and releases semantic information externally, which is more suitable for many-core architecture Massive data sharing in scenarios such as , high-performance computing, and supercomputing.
  • Data management means that the highly dynamic operating system provides a data operation interface for "addition, deletion, modification and query" on the above-created memory shared data index, adding data to the above-mentioned index, and subsequent applications can also modify the data.
  • Memory allocation means that after the highly dynamic operating system adds data, it allocates the memory corresponding to the added data locally and associates it with the corresponding index. Considering the improvement of memory access efficiency, the application layer should try to make semantics share data block particles Block as much as possible, which can help to take advantage of semantic data sharing.
  • Semantic addressing mapping means that when a high-dynamic operating system accesses shared data with external general semantics, it converts the external general semantics into the form of page + offset address in the system to determine the data stored in the local memory.
  • Semantic-Driven Conversation Service The main functions of Semantic-Driven Conversation Service include: Semantic Conversation Index, Semantic Acceleration Library, Semantic Context Management, Conversation Performance Management.
  • the semantic session index means that the highly dynamic operating system provides an interface for the application layer to create a data session and generates a corresponding index, which is also called a context ID (Context ID, CID).
  • Semantic acceleration library refers to the list of semantic acceleration libraries available to the operating system provided by the highly dynamic operating system, which is used to create multiple acceleration libraries involved in the context, and provides automatically and dynamically allocated acceleration pool services, without requiring the application to participate in specifying specific resources , allowing applications to automatically adapt to highly dynamic computing hardware.
  • Semantic context management means that the highly dynamic operating system provides context-related hardware configuration templates and configuration services such as micro-engines, accelerators, and event queues, so that the application layer can flexibly create data sessions with complex logic, so as to achieve high-frequency repetition of software processing
  • the computing tasks are offloaded to the hardware to achieve energy-efficient computing capabilities.
  • Session performance management means that the highly dynamic operating system provides performance monitoring services for sessions created by the application layer, and also provides specified performance requirements of the application layer, such as bandwidth, rate, delay and other parameters, and actively reports to the application layer in the event of performance degradation Exceptions are used for subsequent optimization and adjustment processing, such as triggering routing reconstruction and other operations.
  • the highly dynamic operating system of the data processing system 100 discovers system hardware resources through semantic-driven computing services.
  • a highly dynamic operating system discovers system hardware resources, such as accelerators, micro-engines, and routing networks, through semantic-driven computing services.
  • the highly dynamic operating system can discover system hardware resources through semantic-driven computing services, create and save the corresponding system hardware resource list, and refresh the system hardware resource list if it detects hardware changes after restarting, otherwise the previous system can be used directly Hardware resource inventory for a quick start.
  • the application layer After the data processing system 100 is started, the application layer first creates the required shared memory data through the semantic-driven data service of the highly dynamic operating system of the data processing system 100 and establishes the corresponding semantic data index and the local memory address list of the semantic addressing mapping .
  • the application layer can allocate a micro-engine through the semantic-driven computing service of the highly dynamic operating system of the data processing system 100, and load the code corresponding to the computing task; at the same time, the application layer can also use the The semantic-driven session service of the highly dynamic operating system of the data processing system 100 creates a data session, and exchanges high-frequency computing tasks through multiple semantic accelerators and micro-engines directly through the event queue.
  • the embodiment of the present application provides a message processing method and device, the method includes: the first processing unit processes the first event message to obtain the second event message; the first event message is received by the first processing unit or the first event message is generated by the first processing unit based on the processing request of the application program; the first processing unit sends the second event message to the second processing unit according to the context information, and the context information includes the first processing unit to the second processing unit
  • the routing information of the two processing units and the context information are generated based on the processing request of the application program; wherein, the first processing unit is the first engine, the second processing unit is the second accelerator, or the first processing unit is the first accelerator,
  • the second processing unit is the second engine, or the first processing unit is the first engine and the second processing unit is the second engine, or the first processing unit is the first accelerator and the second processing unit is the second accelerator.
  • the transmission of event messages between different processing units is realized based on context information, compared with the transmission scheduling of event messages using a scheduling method (such as using a scheduler for message scheduling), the above implementation
  • the method can avoid the performance bottleneck caused by the transmission scheduling, and then can improve the system processing performance.
  • the message processing method in the embodiment of the present application may be applied to the data processing system 100 shown in FIG. 1 .
  • the engine is a fusion computing microengine as an example for illustration. It should be noted that, in the embodiment of the present application, the integrated computing micro-engine may also be referred to as a micro-engine for short.
  • the highly dynamic operating system receives the processing request of the application program, acquires the semantics of the processing request, and determines at least two tasks included in the processing request according to the semantics of the processing request.
  • the tasks included in the processing request have a one-to-one correspondence with the task semantics.
  • the semantics of the processing request include at least two task semantics, and a corresponding task is determined according to each of the at least two task semantics.
  • At least two tasks included in the processing request may be a first task and a second task
  • the first task corresponds to the semantics of the first task
  • the second task corresponds to the semantics of the second task
  • the semantics of the processing request includes the semantics of the first task Unlike the second task semantics, the first task is different from the second task, and the first task semantics is different from the second task semantics.
  • the highly dynamic operating system When in response to the received processing request of the application program, at least two tasks belonging to the processing request of the application program are established, the highly dynamic operating system also responds to the received processing request and determines according to the resource configuration information of the application program
  • a computing resource for executing the processing request the computing resource includes at least a first computing resource, a second computing resource, and a third computing resource, and generates an application context, and the context includes at least the first computing resource to the second computing resource, Routing information from the second computing resource to the third computing resource.
  • the system can also open up the communication links of each computing resource according to the context and the event queue of each computing resource. It can be understood that the number of computing resources used to execute the processing request can be 3, or 4 or more, and the technical solution of the present application does not make any difference to the number of computing resources that can be allocated to execute the processing request. Specific limits.
  • the computing resources used for processing the request are computing resource Resource1, computing resource Resource2, computing resource Resource3, and computing resource Resource4 as an example.
  • the computing resource Resource1 and the computing resource Resource3 may be two different microengines, and the computing resource Resource2 and the computing resource Resource4 may be two different microengines.
  • the computing resource Resource1, the computing resource Resource2 and the computing resource Resource3 can also be three different microengines, and the computing resource Resource4 is an accelerator; in some other embodiments, it can also be the computing resource Resource1
  • the computing resource Resource4 and the computing resource are two different microengines, and the computing resource Resource2 and the computing resource Resource3 may be two different accelerators.
  • the highly dynamic operating system also creates at least two threads corresponding to at least two tasks; loads the at least two threads to run on at least two engines, wherein different threads run on different engines, and different threads correspond to different task.
  • the computing resource Resource1 and the computing resource Resource3 are two different microengines, and the computing resource Resource2 and the computing resource Resource4 can be two different accelerators as an example for illustration.
  • computing resource Resource1, computing resource Resource2, computing resource Resource3, and computing resource Resource4 can be recorded as microengine XPU_A, accelerator SDA_A, microengine XPU_B, and accelerator SDA_B, respectively.
  • the computing resources of the first task may include microengine XPU_A and accelerator SDA_A
  • the computing resources of the second task may include microengine XPU_B and accelerator SDA_B.
  • the high dynamic operating system creates a corresponding Based on the first thread of the first task, a second thread corresponding to the second task is created on the microengine XPU_B.
  • the microengine XPU_A is different from the microengine XPU_B
  • the accelerator SDA_A is different from the accelerator SDA_B
  • the accelerator SDA_A corresponds to the first event queue.
  • each thread, accelerator, and application/CPU may have its own corresponding event queue, and the thread or accelerator forwards the event message that needs to be processed downstream to the event queue of the next-level processing unit through its own event queue.
  • Units can be threads or accelerators, or applications/CPUs.
  • two tasks belonging to the processing request of the application program in response to the received processing request, two tasks belonging to the processing request of the application program, that is, the first task and the second task, are established only for the purpose of processing the message in the embodiment of the present application
  • the processing method is illustrated with an example.
  • multiple tasks belonging to the processing request of the application program can also be established, for example: the first task, the second task, ..., the Nth task, and the creation of tasks related to each task the corresponding thread.
  • the computing resources used by the first task and the second task are determined according to the resource configuration information of the application program.
  • the computing resources of the first task include the microengine XPU_A and the accelerator SDA_A
  • the computing resources of the second task include the microengine The XPU_B and the accelerator SDA_B, wherein the number of accelerators in the computing resources used by the first task and the second task is one, is only for illustrating the process of determining the computing resources used by the tasks.
  • the computing resources corresponding to at least one task in the multiple tasks include an engine and at least one accelerator;
  • the computing resource corresponding to the task can be: 0, 1, 2 or more than 2 accelerators. That is, the task of processing requests belonging to the application program can not only use one engine and one accelerator as computing resources; individual tasks can also use only one engine without using any accelerator; individual tasks can also use one engine and multiple accelerators. an accelerator.
  • the resource configuration information is the received parameter sent by the application layer.
  • the resource configuration information includes a trigger event; during the startup process of the application program, in response to a processing request of the application program, determining the task corresponding to the processing request can be achieved in the following manner: in response to A processing request of the application corresponding to the trigger event is determined, and a task corresponding to the processing request is determined.
  • the triggering event is a pre-set event for starting the processing request after the data processing system loads the data processing task software package of the application program.
  • a video call terminal is a typical scenario of edge intelligent computing.
  • video call terminals support artificial intelligence calculations such as face recognition and background replacement, which require higher and higher computing power and low power consumption. It is suitable for scenarios such as mobile office and emergency command.
  • FIG. 11 is a schematic diagram of a design solution of an edge intelligent computing provided in an embodiment of the present application.
  • the video call terminal 1100 is obtained by extending the existing hardware, and it is considered to reuse the existing hardware to the greatest extent.
  • the CPU can fully utilize existing hardware, such as CPUs with x86 architecture, ARM architecture, RISC-V architecture, etc. Compared with the existing hardware, the following extensions are made:
  • PCI-E Peripheral Component Interconnect Express, peripheral component interconnection standard
  • AMBA Advanced Microcontroller Bus Architecture, on-chip bus protocol
  • the calling software should support dispatching center and other capabilities, which can realize the deployment of audio collection, audio and video codec, network conversation and other threads to high dynamic computing hardware;
  • Add high dynamic computing hardware configure corresponding micro-engines, routing networks, accelerators (such as FFT transformation, video rendering, DNN network, etc.) and connect with corresponding peripherals (video memory, camera, network card, microphone, etc.).
  • accelerators such as FFT transformation, video rendering, DNN network, etc.
  • the trigger event may be clicking a call button.
  • dynamic resource allocation is performed based on the trigger event of the "click to talk button".
  • the first computing resource is XPU 3 in FIG. 11
  • the second computing resource is signal processing accelerator 1 in FIG. 11
  • the third computing resource is XPU0 in FIG. 11
  • the fourth computing resource is audio accelerator 1 in FIG. 11 .
  • the application starts, and the data processing system receives the voice call processing request Voice01 corresponding to the "click to talk button", and responds to the voice call processing request Voice01 of the application program to obtain the voice call processing Request the semantics of Voice01, for example, the semantics of the voice call processing request Voice01 may be "voice conversation", assuming that the voice call processing request Voice01 semantics "voice conversation” includes the first task semantics "audio collection” and the second task semantics "audio processing ", the highly dynamic operating system determines multiple tasks corresponding to the voice call processing request Voice01 according to the semantic "voice session” of the voice call processing request Voice01, and the multiple tasks include at least the first task and the second task, assuming that the first The task is an audio collection task, and the second task is an audio processing task, wherein the audio collection task corresponds to the first task semantics "audio collection”, and the audio processing task corresponds to the second task semantics "audio processing".
  • the above-mentioned audio collection task and audio processing task belong to the voice call processing
  • the embodiments of the present application do not limit the number of task semantics included in the processing request semantics.
  • the data processing system may determine the N tasks included in the processing request .
  • the computing resource used to execute the voice call processing request Voice01 is determined according to the resource configuration information of the application program, and the computing resource Including XPU 3, signal processing accelerator 1, XPU 0 and audio accelerator 1 in Figure 11, generating the context of the application program, the context includes XPU 3 to signal processing accelerator 1, signal processing accelerator 1 to XPU 0, XPU 0 to audio accelerator 1 routing information.
  • a communication link is established according to the context and the event queue of each computing resource. For example, a first communication link is established between XPU 3 and signal processing accelerator 1, and a second communication link is established between XPU 0 and audio accelerator 1. Create an audio collection thread for processing audio collection tasks on XPU 3, and create an audio processing thread for processing audio processing tasks on XPU 0; the audio collection thread corresponds to the audio collection task, and the audio processing thread corresponds to the audio processing task.
  • a context identifier can also be set, and the context identifier is used to indicate the context of the application program.
  • the context identifier CID1 may indicate the context of the application program generated by the above-mentioned video call terminal 1100, and the context includes routing information from XPU 3 to signal processing accelerator 1, signal processing accelerator 1 to XPU 0, and XPU 0 to audio accelerator 1.
  • the highly dynamic operating system may determine the computing resources used by the audio collection task and the audio processing task according to the resource configuration information of the application program. For example, it can be determined that the computing resources of the audio collection task include XPU 3 and signal processing accelerator 1 in FIG. 11 , and the computing resources of the audio processing task include XPU 0 and audio accelerator 1 in FIG. 11 .
  • the first processing unit or the second processing unit is selected from the plurality of processing units based on the status information of the plurality of processing units when receiving the processing request of the application program, and the status information of the processing unit Including network topology performance.
  • determining computing resources for executing processing requests is specifically allocating computing resources for processing requests based on hardware state information when processing requests are received, and hardware state information includes network topology performance.
  • hardware state information includes network topology performance.
  • the real-time status of hardware can be considered, and then the optimal hardware can be allocated to them on the premise of meeting the requirements of the first task and the second task.
  • a hardware state table will be established according to all hardware states, and then whenever the state of the hardware changes, the hardware state table will be automatically updated, and then the first task and the second task will be assigned computing resources.
  • the parameters in the hardware status table will be referred to.
  • the parameters of the considered hardware status include network topology performance in addition to resource usage.
  • the network topology performance specifically includes the link relationship, throughput, available routes, available bandwidth, and delay of the network topology.
  • allocating computing resources for the audio collection task and the audio processing task may be based on the hardware state information when the voice call processing request is received, allocating the computing resources for the audio collection task and the audio processing task; wherein the hardware state information includes Network topology performance.
  • the above-mentioned optimally allocated hardware may be the hardware with the best performance currently allocated, or the hardware with the best matching performance allocated, so as to avoid waste of resources.
  • the hardware state information can be obtained by creating a hardware state list and refreshing it in real time, or by obtaining the hardware state of each hardware when computing resources are configured.
  • the process of determining the computing resources corresponding to the audio collection task and the audio processing task may be specifically, when the trigger event of "click the call button" occurs, start the voice call processing request Voice01.
  • an audio collection task and an audio processing task corresponding to the processing request are generated, an audio collection thread for processing the audio collection task is created on XPU 3, and an audio collection thread for processing the audio collection task is created on XPU 0
  • the computing resources corresponding to the audio collection task include XPU 3 and signal processing accelerator 1
  • the computing resources corresponding to the audio processing task include XPU 0 and audio accelerator 1.
  • the computing resource corresponding to a task may include an engine and an accelerator, or may include an engine and multiple accelerators; some tasks in multiple tasks may also include only an engine.
  • a possible implementation manner is to further include the following steps when the application starts:
  • step A1 resource configuration information of the application is acquired in response to the start of the application.
  • the resource configuration information includes engine quantity, accelerator type and accelerator quantity.
  • the accelerator pool Pool1 includes 10 signal processing accelerators
  • the accelerator pool Pool2 includes 10 audio accelerators
  • the total number of microengines is 20.
  • the obtained resource configuration information of the application program includes: the engine is a micro-engine, the number corresponding to the micro-engine is "2”, the accelerator type is “signal processing accelerator” and “audio accelerator ", the number of accelerators corresponding to the accelerator type “signal processing accelerator” is “1”, and the number of accelerators corresponding to "audio accelerator” is "1".
  • step A2 the engine used by the application is selected according to the resource configuration information and the load of the candidate engine.
  • the selected engine includes the first engine and/or the second engine.
  • microengines include microengines XPU 3 and XPU 0, microengine XPU 3 and microengine XPU 0 different.
  • the selection of the engine used by the application may be to select a specified number of micro-engines from the candidate engines according to the order of load rate from low to high; The number of microengines where load requirements can be derived from resource configuration information.
  • Step A3 according to the resource configuration information, select the accelerator used by the application, and the selected accelerator includes the first accelerator and/or the second accelerator.
  • the accelerator pool corresponding to “signal processing accelerator” is accelerator pool Pool1
  • the accelerator pool corresponding to “audio accelerator” is accelerator pool Pool2.
  • the accelerator used by the application program selected from the accelerator pool Pool1 includes the signal processing accelerator 1
  • the accelerator used by the application program selected from the accelerator pool Pool2 includes the audio accelerator 1, wherein the signal processing accelerator 1 is different from the audio accelerator 1.
  • a possible implementation is to establish a first communication link between the XPU 3 and the signal processing accelerator 1, specifically to establish a communication link between the XPU 3 and the event queue 4, and the event queue 4 corresponds to the signal processing accelerator 1 .
  • the audio acquisition thread running on the XPU 3 can send the event message Mes.1 to the event queue 4, and the signal processing accelerator 1 can obtain the event message Mes.1 from the event queue 4.
  • establishing a second communication link between XPU 0 and audio accelerator 1 may specifically be establishing a communication link between XPU 0 and event queue 5, and event queue 5 corresponds to audio accelerator 1.
  • the audio processing thread running on XPU 0 can send the event message Mes.3 to the event queue 5, and the audio accelerator 1 can obtain the event message Mes.3 from the event queue 5.
  • the audio collection thread can send the event message Mes. Sent to event queue 4.
  • the retranslation instruction of the signal processing accelerator 1 is obtained by loading the signal processing accelerator 1 and assigning the identifier of the event queue 4 to the signal processing accelerator 1, and modifying the machine code of the signal processing accelerator 1 according to the identifier of the event queue 4;
  • the audio collection thread sends an event message to the event queue 4 .
  • this application defines a new event message information format, that is, system information transmitted on a highly elastic network through the event queue in FIG. 1 .
  • the event message of the data processing system adopts the subframe format of the highly elastic network shown in FIG.
  • the message attribute information field is used to carry event message routing information, and the event message routing information includes a target event queue identifier.
  • the target event queue identifier can be the identifier of signal processing accelerator 1 event queue 4;
  • the network layer message length field is used for Carry the total length information of the event message Mes.1;
  • the network layer data field is used to carry the payload of the event message Mes.1.
  • the network layer data domain includes the operating system layer event information domain
  • the operating system layer event information domain includes at least one of the following: routing scope, context identifier, source message queue identifier or custom attribute, routing
  • routing scope includes at least one routing domain.
  • the predefinition of the system subframe can adopt the following types:
  • the data field of this subframe is the routing domain ID where the destination is located;
  • the data field of the subframe is the data session ID to which the frame belongs;
  • the data field of the subframe is the ID of the queue that sent the frame, and if the subframe is transmitted across domains, it is also necessary to carry the routing range in the subframe;
  • the data field of the subframe is the data transmitted by the operating system service, for example: configuration data, program image, etc.
  • the network layer data domain includes the application layer event information domain
  • the application layer event information domain includes custom information of the application layer
  • the operating system can agree on its own “grandson frame", wherein the "grandson frame” can also follow the KLV format, so that the network can participate in frame analysis and improve forwarding efficiency.
  • the predefinition of the system subframe may also include the following types:
  • the relationship between the application layer event information domain, the operating system layer event information domain and the network layer data domain can be referred to in FIG. 9 .
  • the embodiment of the present application provides a message processing method, which processes event messages after dynamic resource allocation based on events.
  • the process of processing messages may include the following steps:
  • Step S1201 the first processing unit receives a first event message.
  • the first processing unit may be a first microengine or a first accelerator.
  • the first processing unit may refer to the signal processing accelerator 1, or may refer to the microengine XPU 0.
  • the first processing unit is the signal processing accelerator 1 as an example for description.
  • Video call terminal 1100 can transmit event messages between signal processing accelerator 1 and XPU 0. During the message process of the video call terminal 1100, the event message is transmitted between the signal processing accelerator 1 and the XPU 0. First, the signal processing accelerator 1 obtains the event message Mes.1.
  • the first processing unit is a first microengine
  • the first event message may be generated by the first processing unit based on a processing request of an application program.
  • Step S1202 the first processing unit processes the first event message to obtain the second event message.
  • the signal processing accelerator 1 processes the event message Mes.1 to obtain the event message Mes.2.
  • the context further includes operation configuration information; the first processing unit processes the first event message to obtain the second event message, specifically: the first processing unit acquires the context corresponding to the first processing unit First operation configuration information; the first processing unit processes the first event message according to the first operation configuration information.
  • the context also includes operation configuration information for computing resources; the computing resources include microengines and accelerators; when the application starts, the context and the context identifier are allocated according to the resource configuration information.
  • the context ID is used to indicate the context with the application.
  • the context identifier is included in all event messages corresponding to the same processing request of the application program, for example, the first event message and the second event message, and the context identifier can be used to obtain the context.
  • the context includes operation configuration information CZXX1 for computing resources, where the operation configuration information CZXX1 is "CID1, in: ADC ,via: FFT, via: SHT, out: Fra, bit width, number of sampling points, period, data sub-block time slice, double floating-point precision, ".
  • the context corresponding to the voice call processing request Voice01 and the context identifier CID1 are allocated according to the resource configuration information, and the context identifier CID1 is included in the event message Mes.1, event message Mes.2 and event message Mes.3.
  • the context identifier CID1 may be used to acquire the operation configuration information CZXX1 corresponding to the voice call processing request Voice01.
  • the process for the signal processing accelerator 1 to process the event message Mes.1 is as follows: first, according to the context identifier CID1 included in the event message Mes.1, obtain the corresponding first operation configuration information CZXX1_1 for the signal processing accelerator 1, for example Let the first operation configuration information CZXX1_1 be "perform FFT transformation on the received event message of the context ID"; then, the signal processing accelerator 1 processes Mes.1 according to the first operation configuration information CZXX1_1.
  • the audio accelerator 1 processes the event message Mes.3, which may be that the audio accelerator 1 first obtains the corresponding second operation configuration information for the audio accelerator 1 according to the context identifier CID1 included in the event message Mes.3 CZXX1_2, assuming that the second operation configuration information CZXX1_2 is "encode the received event message of the context ID to MP4", and then process Mes.3 according to the second operation configuration information CZXX1_2 for the audio accelerator 1 .
  • Step S1203 the first processing unit sends the second event message to the second processing unit according to the context information, where the context information includes routing information from the first processing unit to the second processing unit.
  • the second processing unit may be a second microengine or a second accelerator, and the context information is generated based on a processing request of an application program.
  • the first processing unit and the second processing unit transmit event messages, it may be specifically: the first processing unit is the first microengine, the second processing unit is the second accelerator, or the first processing unit is The first accelerator and the second processing unit are the second microengine, or the first processing unit is the first microengine, the second processing unit is the second microengine, or the first processing unit is the first accelerator, and the second processing unit for the second accelerator.
  • the first processing unit is a signal processing accelerator 1
  • the second processing unit is a microengine XPU 0.
  • Signal processing accelerator 1 sends event message Mes.2 to microengine XPU 0 according to the context.
  • the context includes routing information from signal processing accelerator 1 to microengine XPU 0.
  • a possible implementation is that the first processing unit sends the second event message to the second processing unit according to the context information, and the first processing unit may first send the second event message to the second processing unit according to the routing information corresponding event queue; then, the second processing unit acquires the second event message from the event queue.
  • each computing resource including threads and accelerators has its own event queue; a thread or accelerator sends event messages that need to be processed by other computing resources to the downstream microengine/accelerator through its own event queue.
  • the accelerator's event queue sends messages.
  • the application/CPU may also have its own event queue, so that event messages can be transmitted among the application/CPU, threads, and accelerators.
  • a thread sends an event message through its own corresponding event queue, it specifically forwards the event message through the event queue of the microengine it is created in.
  • the event queue of the microengine is the event queue of threads running on the microengine.
  • event queue 4 corresponds to signal processing accelerator 1
  • event queue 3 corresponds to audio collection thread
  • event queue 0 corresponds to audio processing thread
  • audio accelerator 1 corresponds to event queue 5 in FIG. 11 .
  • the audio collection thread on the XPU 3 obtains the data request Data-1, and then sends the event message Mes.1 generated according to the data request Data-1 to the event queue 4 through the event queue 3 according to the routing information included in the context of the application program;
  • event queue 4 receiving event message Mes.1 signal processing accelerator 1 acquires event message Mes.1 from event queue 4, processes event message Mes.1, generates event message Mes.2, and then according to the context of the application Routing information included in , send event message Mes.2 to event queue 0 corresponding to XPU 0, the audio processing thread running on XPU 0 generates event message Mes.3 based on event message Mes.2, and then according to the context of the application Included routing information, Mes.3 is sent to event queue 5 through event queue 0; after Mes.3 is sent to event queue
  • the second event message includes a target event queue identifier
  • the target event queue identifier is a queue identifier of an event queue corresponding to the second processing unit.
  • the first processing unit sends the second event message to the event queue corresponding to the second processing unit according to the routing information, which may be: the first processing unit determines the pending The added event message routing information, the event message routing information includes a target event queue identifier, and the target event queue identifier is the queue identifier of the event queue corresponding to the second processing unit; the first processing unit adds the event message routing information in the second event message; The first processing unit sends the second event message added with the event message routing information, and the second event message added with the event message routing information is sent to an event queue corresponding to the second processing unit.
  • the routing information may be: the first processing unit determines the pending The added event message routing information, the event message routing information includes a target event queue identifier, and the target event queue identifier is the queue identifier of the event queue corresponding to the second processing unit; the first processing unit adds the event message routing information in the second event message; The first processing unit sends the second event message added with the event message routing information, and the second event message
  • the event message routing information may also be referred to as flow information, and the routing information included in the context information may also be referred to as flow sequence information corresponding to the application.
  • the context identifier is used to indicate the context of the application, and can indicate the flow sequence information corresponding to the application.
  • the signal processing accelerator 1 sends the event message Mes.2 to the event queue 0 corresponding to the microengine XPU 0 according to the routing information included in the context of the application program.
  • the context identifier CID1 included in obtains the flow sequence information corresponding to the application program, assuming that the flow sequence information is "CID1, event queue 3, event queue 4, event queue 0, event queue 5", and the representation transfer sequence is the audio collection thread , a signal processing accelerator 1, an audio processing thread, and an audio accelerator 1, and then determine the flow information to be added in the event message Mes.2 according to the flow order information.
  • the transfer information includes the target event queue identifier, and the target event queue identifier included in the transfer information of the event message Mes.2 is the queue identifier of the event queue 0 corresponding to the microengine XPU 0.
  • the signal processing accelerator 1 adds the aforementioned determined flow information to the event message Mes.2.
  • the signal processing accelerator 1 can send the event message Mes.2 with the flow information added, and the event message Mes.2 with the flow information added is sent to the event queue 0 corresponding to the microengine XPU 0.
  • the routing information further includes a target routing field, where the target routing field is used to indicate the target server, the target server is different from the source server, and the source server is the server where the first processing unit is located.
  • the flow sequence information corresponding to the application program also includes a first target routing field.
  • the flow information also includes a first target routing field.
  • the first target routing field uses To indicate the first target server, the first target server is not the same server as the source server where the signal processing accelerator 1 in FIG. 11 is located.
  • a thread or an accelerator can obtain routing information according to the context and forward the event message requiring downstream processing to a next-level processing unit, which can be a thread or an accelerator, or an application/CPU.
  • a next-level processing unit which can be a thread or an accelerator, or an application/CPU.
  • the process of one processing unit sending an event message to another processing unit is similar to the process of signal processing accelerator 1 sending event message Mes.2 to microengine XPU 0, and will not be repeated here.
  • the first processing unit in the unit is a microengine, and threads running on the microengine can obtain data requests and generate the first event message based on the data requests.
  • the data request is request information for requesting a response to specific data corresponding to the processing request of the application program.
  • the processing request may be a data acquisition request or a data processing request. Wherein, the data acquisition request is used to request to obtain the target data corresponding to the data information contained in the request message, and the data processing request is used to request to process the data information contained in the request message.
  • the data request is a data processing request Data-1 as an example for description, and Data-1 is used to request to respond to the digital signal corresponding to the trigger event of "click to talk".
  • the application starts, the data processing system receives the voice call processing request Voice01, and the audio collection thread running on the micro-engine XPU 3 collects the audio signal from the microphone through the ADC, and obtains the audio signal related to the "click to talk” Key” trigger event corresponds to the data request Data-1, and the event message Mes.1 generated according to the data request Data-1, see Figure 13.
  • Video call terminal 1100 can transmit event messages between microengine XPU 0 and audio accelerator 1.
  • the process of transmitting event messages between microengine XPU 0 and audio accelerator 1 is similar to the process of transmitting event messages between signal processing accelerator 1 and XPU 0.
  • the event message is transmitted between the micro-engine XPU 0 and the audio accelerator 1, at first the micro-engine XPU0 obtains the event message Mes.2; the micro-engine XPU 0 processes the event message Mes.2 , get the event message Mes.3; the microengine XPU 0 sends the event message Mes.3 to the audio accelerator 1 according to the context.
  • the context includes routing information from microengine XPU0 to audio accelerator 1.
  • the above-mentioned event message Mes.1, event message Mes.2, and event message Mes.3 include a context identifier, such as a context identifier CID1.
  • the context identifier CID1 is used to indicate the context of the application program.
  • the message processing method further includes releasing the first thread, and the first thread is one of at least two threads; To run without threads, close the engine that the first thread was on before it was released.
  • the first thread running on the engine in response to receiving an instruction to release the first thread, the first thread running on the engine is released; if after the first thread is released, there is no thread running on the engine where the first thread was before being released, then the first thread is closed. The engine the thread was on before it was released.
  • the instruction to release the first thread may be generated in response to a release event corresponding to the trigger event.
  • the data processing system releases the first thread running on the first microengine.
  • the release event is an event set to stop data processing corresponding to the processing request after the processing request is started.
  • the release event may be clicking the stop call key or hanging up the video call call.
  • the video call terminal 1100 releases the audio collection thread running on the XPU 3 in response to receiving an instruction to release the audio collection thread corresponding to the second event "click the stop call button" that occurs.
  • the XPU 3 After releasing the audio collection thread running on the XPU 3, if there are no more running threads on the XPU 3, the XPU 3 will be turned off to achieve near-zero standby power consumption.
  • the data request is a request for responding to specific data corresponding to the processing request of the application program.
  • the processing request may be a data acquisition request, and may also be a data processing request, wherein the data acquisition request is used to request to obtain data information, and the data processing request is used to request to process the data information included in the request message.
  • the data request may be a request to obtain data according to the specific data corresponding to the processing request of the application program; in some other embodiments, the data request may be a request for processing the application program The specific data corresponding to the request is processed.
  • a possible implementation manner is that the data request is used to request acquisition of target data, the target data is stored in the memory of the second server, and the computing resource for executing the processing request further includes a third processing unit and a fourth processing unit; at least two An engine includes a first processing unit, a second processing unit and a third processing unit; the fourth processing unit is an accelerator; the first event message and the second event message include the identification of the target data, and the first processing unit and the second processing unit Located on the first server, the third processing unit and the fourth processing unit are located on the second server; the context also includes routing information from the second processing unit to the third processing unit, and from the third processing unit to the fourth processing unit;
  • the method further includes:
  • the second processing unit encapsulates the second event message based on the second event message to generate a third event message
  • the second processing unit sends the third event message to the third processing unit located in the second server according to the context;
  • the third processing unit decapsulates the third event message based on the third event message to obtain a fourth event message, and sends the fourth event message to the fourth processing unit according to the context;
  • the fourth processing unit acquires the identifier of the target data from the received fourth event message, acquires the target data from the memory of the second server according to the identifier of the target data, and obtains the fifth event message according to the target data; the fifth event message is used for The target data is sent to the first server.
  • the data request can be a data acquisition request Req1, and Req1 is used to request acquisition of target data
  • the target data is stored in the memory of the second server S2
  • the computing resources for executing the processing request include microengine XPU 3 ', microengine Engine XPU 1', microengine XPU 0" and semantic memory accelerator 1"
  • event message Mes.1' and event message Mes.2' include the target data identifier DTM1, and microengine XPU 3' and microengine XPU 1' are located at The first server S1, microengine XPU 0" and semantic memory accelerator 1" are located in the second server S2
  • the context includes at least microengine XPU 3' to microengine XPU 1', microengine XPU 1' to microengine XPU 0", microengine Routing information from engine XPU 0" to semantic memory accelerator 1"
  • event message processing methods include: sending event message Mes.1' to microengine XPU 1' in microengine XPU 3' according to the context; Based on
  • the transmission of event messages between different processing units is realized based on the context.
  • this method can avoid Performance bottlenecks caused by transmission scheduling can improve system processing performance.
  • the message processing method of this application can be applied to scenarios such as edge intelligent computing, high-performance supercomputing centers, self-driving cars, robots, unmanned factories, unmanned mines, etc., requiring both large computing power and high energy efficiency.
  • the message processing method provided by the embodiment of the present application will be further described in combination with edge intelligent computing and high-performance supercomputing as two main scenarios.
  • video call terminals support artificial intelligence calculations such as face recognition and background replacement, which require higher and higher computing power and low power consumption, especially in scenarios such as mobile office and emergency command.
  • a video call terminal is used as a typical scenario of edge intelligent computing.
  • the video call terminal is configured with a data processing system.
  • the following introduces an implementation scheme for dynamically deploying call-related threads based on an event-triggered method to implement data sessions for voice sessions, thereby offloading software computing loads.
  • the event may be a call connection.
  • the voice session of the video call terminal may involve audio collection, transformation such as FFT, audio codec, and data exchange with the call peer through a TCP/IP connection.
  • the voice call application program of this application creates three threads to different micro-engines through the highly dynamic operating system, among which,
  • the audio collection thread is mainly responsible for collecting audio signals from the microphone through the ADC, collecting audio digital signals according to a fixed time slice, such as 1ms, and packaging them into event messages;
  • the audio processing thread is mainly responsible for converting the audio signal after denoising and other processing into an audio transmission message according to the MP3 or H264 encoding format;
  • the TCP/IP thread is mainly responsible for establishing and maintaining the IP session connection with the call peer, and the voice session will have an independent port number.
  • the resource configuration information of the data processing software package is loaded and registered through the highly dynamic operating system.
  • a voice call application program is installed on the video call terminal.
  • the resource configuration information includes but not limited to some or all of the following items: accelerator type, number of accelerators, number of micro-engines, operation configuration information, flow sequence information, and trigger events.
  • the flow sequence information represents the order in which each computing resource corresponding to the processing request of the application program responds to the processing request.
  • the operation configuration information and the flow sequence information may be obtained through the data session information set by the application layer.
  • the types of accelerators in the resource configuration information of the voice call application program may be: signal processing accelerators, audio processing accelerators, and session connection accelerators, and the accelerator numbers corresponding to the three types of accelerators may be "1, 1, 1 "; The number of microengines can be "3".
  • the accelerator number of the signal processing accelerator is "1", indicating that the high dynamic operating system will configure one signal processing accelerator for the voice call application according to the accelerator number "1" of the signal processing accelerator.
  • the configured signal processing accelerators, audio processing accelerators, and session connection accelerators are respectively: signal processing accelerator A, audio processing accelerator B, and session connection accelerator C.
  • the triggering event of the voice call application program may be a call connection.
  • the call connection is a pre-set event for initiating a session processing request after the data processing system loads the data processing software package of the voice call application.
  • a session processing request Chat01 is sent, and the voice call application starts.
  • the following is a detailed introduction to the configuration process of computing resources by the high dynamic operating system when the voice call application is started:
  • Step K1 in response to the instruction to start the application program, the high dynamic operating system determines the computing resources used by the application program according to the resource configuration information of the application program, and in response to the session processing request, determines the task corresponding to the processing request: audio collection task, audio Handle tasks, session connection tasks.
  • computing resources include microengine XPU 3, signal processing accelerator A, microengine XPU 0, audio processing accelerator B, microengine XPU 2, session connection accelerator C; signal processing accelerator A corresponds to event queue EQ1; audio Processing accelerator B corresponds to event queue EQ2; session connection accelerator C corresponds to event queue EQ4; microengine XPU 3 corresponds to event queue EQ0; microengine XPU 0 corresponds to event queue EQ3; microengine XPU 2 corresponds to event queue EQ5.
  • the task includes at least the first The first task, the second task and the third task, for example, the first task is an audio collection task, the second task is an audio processing task, and the third task is a session connection task.
  • the resource configuration information includes the number of engines, the accelerator type and the number of accelerators; when the application starts, in response to the start of the application, the resource configuration information of the application is obtained, and the application is selected according to the resource configuration information and the load of the candidate engine. engine, and select the accelerator used by the application program according to the resource configuration information, and the selected accelerator includes the first accelerator and the second accelerator.
  • the accelerator type in the resource configuration information of the voice call application may include "signal processing accelerator", and the accelerator number corresponding to the "signal processing accelerator” type accelerator is "3".
  • the configuration information determines the computing resources used by the voice call application
  • the accelerator pool corresponding to the accelerator type "signal processing accelerator” can be determined according to the accelerator type "signal processing accelerator”, and 3 are selected from the aforementioned accelerator pool according to the number of accelerators "3".
  • the three signal processing accelerators can be respectively: signal processing accelerator A, audio processing accelerator B, and session connection accelerator C; similar to the process of determining the accelerator, it is assumed that the number of microengines included in the resource configuration information is "3 ", the highly dynamic operating system selects 3 micro-engines according to the number of micro-engines "3" and the load of candidate engines, for example, micro-engine XPU 3, micro-engine XPU 0, and micro-engine XPU 2 are obtained.
  • selecting the engine used by the application program may be to select a specified number of micro-engines from the candidate engines according to the order of load rate from low to high; in other embodiments, it may also be based on load requirements from A specified number of micro-engines meeting the load requirements are selected from the candidate engines, where the load requirements can be obtained from resource configuration information.
  • Step K2 after generating the audio collection task corresponding to the processing request, the audio processing task and the session connection task in response to the session processing request Chat01 of the application program, an audio collection thread for processing the audio collection task is created on the XPU 3, Create an audio processing thread on XPU 0 for processing audio processing tasks, create a TCP/IP thread on XPU 2 for processing session connection tasks, and determine the computing resources corresponding to audio collection tasks, audio processing tasks, and session connection tasks .
  • the computing resources corresponding to the audio collection task include XPU 3 and signal processing accelerator A
  • the computing resources corresponding to the audio processing task include XPU 0 and audio processing accelerator B
  • the computing resources corresponding to the session connection task include XPU 2 and the session connection accelerator C, for example Figure 14 shows.
  • the process of configuring computing resources by the highly dynamic operating system may be after allocating computing resources to multiple tasks including the first task and the second task in response to the received processing request , create threads corresponding to each task; it is also possible to first create threads corresponding to each task, and then determine computing resources corresponding to multiple tasks including the first task and the second task.
  • Step K3 assigning a context ID for indicating the context according to the resource configuration information.
  • the context includes operation configuration information corresponding to the application program.
  • the resource configuration information includes operation configuration information for computing resources; the computing resources include microengines and accelerators; when the application starts, a context identifier is allocated according to the resource configuration information.
  • the context identifier is used to indicate the operation configuration information corresponding to the same processing request of the application program.
  • the context ID is included in all event messages corresponding to the same processing request of the application.
  • the operation configuration information may be a data session set by the user through the application layer
  • the context identifier used to indicate the context of the voice call application may be based on the data session set by the user through the application layer, such as "Create Session(CID2, in: ADC, via: FFT,..., out: Framer, bit width, number of sampling points, period, data sub-block time slice, double floating-point precision,...)" to get CID2.
  • the context identifier is also used to indicate the flow sequence information corresponding to the application; the computing resource used by the application sends the event message to the next station according to the flow sequence information.
  • Step K4 establish the first route Line1 between XPU 3 and signal processing accelerator A, establish the second route Line2 between XPU 0 and audio processing accelerator B, the third route Line3 between signal processing accelerator A and XPU 0, The third route Line4 between audio processing accelerator B and XPU 2, and the third route Line5 between XPU 2 and session connection accelerator C.
  • the establishment of the first route Line1 between the XPU 3 and the signal processing accelerator A may be to set the first route information Line1_LM1 corresponding to the audio collection thread
  • the first route information Line1_LM1 includes the first target event queue identifier Line1_TQM1
  • the first The target event queue identifier Line1_TQM1 is the event queue EQ1 shown in Figure 14
  • the event message Mes.1 includes the first routing information Line1_TQM1, that is, a communication link is established between the audio collection thread and the event queue EQ1
  • the event queue EQ1 corresponds to The communication link established between the signal processing accelerator A, the audio collection thread and the event queue EQ1 is the first route Line1.
  • Establishing the second route Line2 between XPU 0 and audio processing accelerator B can be the second routing information Line2_LM2 corresponding to the audio processing thread, the second routing information Line2_LM2 includes the second target event queue identification Line2_TQM2, the second target event queue identification Line2_TQM2 is the event queue EQ2, and the second event message Mes.3 includes the second routing information Line2_LM2.
  • Line3-Line5 The establishment process of Line3-Line5 is similar to the establishment process of Line1 and Line2, and will not be repeated here.
  • the event message also includes routing domain information.
  • the first routing information Line1_LM1 further includes a first target routing field, which is used to indicate the first target server, and the first target server may be a server different from the source server where the XPU 3 in FIG. 14 is located.
  • the data processing system can run normally.
  • the following describes an example of data processing after the voice call application is started.
  • the voice call application program After the voice call application program is started, the following data processing process is performed when the audio data corresponding to the user's call connection is received:
  • Step L1 in response to receiving the data request Data-1' of the audio collection task, the audio collection thread used to process the audio collection task sends the event message Mes.1_1 generated according to the data request Data-1' to the audio collection task according to the context
  • the corresponding event queue EQ1 referring to Figure 15, responds to the event queue EQ1 receiving the event message Mes.1_1, the signal processing accelerator A corresponding to the audio collection task processes Mes.1_1, generates an event message Mes.2_1 according to the processing result, and According to the context, the event message Mes.2_1 is sent to the audio processing thread for processing the audio processing task.
  • the context identifier CID2 is used to indicate the corresponding context of the application, and the context includes the representation between microengine XPU 3, signal processing accelerator A, microengine XPU 0, audio processing accelerator B, microengine XPU 2, and session connection accelerator C Routing information for event messaging in turn.
  • the routing information included in the context can also be referred to as the flow sequence information corresponding to the application; each event message includes a context identifier, such as event message Mes.1_1, event message Mes.2_1, event message Mes.3_1 etc. contains the context ID CID2.
  • the audio collection thread obtains the first flow information for the audio collection thread in the flow sequence information corresponding to the application program according to the context identifier CID2 included in the event message Mes.1_1, and according to the first flow information for the audio collection thread To transfer information, send the event message Mes.1_1 generated according to the data request Data-1' to the event queue EQ1 corresponding to the audio collection task.
  • the flow information may be an identifier of the event queue.
  • the first flow information for the audio collection thread may be the identifier of the event queue EQ1;
  • the second flow information for the signal processing accelerator A may be the identifier of the event queue EQ3 corresponding to the audio processing thread.
  • a possible implementation manner is that the signal processing accelerator A processes the first event message in the event queue EQ1, specifically: the signal processing accelerator A acquires the corresponding Accelerator A's first operation configuration information, and process the first event message according to the signal processing accelerator A's first operation configuration information.
  • the context includes operation configuration information for computing resources; the computing resources include microengines and accelerators; when the application starts, the context and the context identifier are allocated according to the operation configuration information.
  • the context ID is used to indicate the context corresponding to the same processing request of the application.
  • the context identification is included in the first event message and the second event message.
  • the first operation configuration information for the signal processing accelerator A specifies to perform transformation such as FFT on the received event message of the context ID.
  • the signal processing accelerator A obtains the corresponding first operation configuration information for the signal processing accelerator A according to the context identifier CID2 included in the first event message Mes. FFT and other transformations", and perform FFT and other transformations on the first event message Mes.
  • the event queue of the signal processing accelerator A when it receives an event message, it can use an asynchronous handshake signal to trigger the signal processing accelerator A to respond to the event message in real time, and find the corresponding event message according to CID2.
  • Step L2 the audio processing thread generates an event message Mes.3_1 based on the event message Mes.2_1, and sends the event message Mes.3_1 to the event queue EQ2 corresponding to the audio processing task according to the context, and receives the event message Mes in response to the event queue EQ2.
  • the audio processing accelerator B processes the event message Mes.3_1, generates an event message Mes5_1 according to the processing result, and sends the event message Mes.5_1 to the TCP/IP thread for processing the session connection task according to the context.
  • the audio processing thread sends the event message Mes.3_1 to the process of the event queue EQ2 corresponding to the audio processing task according to the context, and the audio processing accelerator B sends the event message Mes.5_1 to the TCP/IP thread for processing the session connection task according to the context
  • the process is similar to the process in which the audio collection thread sends the event message Mes.1_1 to the event queue EQ1 corresponding to the audio collection task according to the context, and will not be repeated here.
  • the second operation configuration information for the audio processing accelerator B may specify to perform transformation such as FFT on the received event message of the context ID.
  • the process of processing the event message Mes.3_1 by the audio processing accelerator B is similar to the process of processing the first event message in the event queue EQ1 by the aforementioned signal processing accelerator A, and details will not be repeated here.
  • Step L3 the TCP/IP thread generates an event message Mes.6_1 based on the event message Mes.5_1, and sends the event message Mes.6_1 to the event queue EQ4 corresponding to the session connection task according to the context, and receives the event message Mes in response to the event queue EQ4 .6_1, the session connection accelerator C corresponding to the session connection task processes the event message Mes.6_1.
  • the session connection accelerator C can also send the processing result data to the corresponding next station according to the context. For example, it may be to generate a new event message, assuming that the new event message is the event message Mes.7_1, and send the event message Mes.7_1 to the following nodes according to the context, such as network card, application/CPU or other threads or accelerators wait.
  • the release event of the voice call application in this embodiment may be "call rejection".
  • the voice call application program releases the audio collection thread running on the XPU 3 in response to the "call reject" release event that occurs. After releasing the audio collection threads running on XPU 3, if there are no more running threads on XPU 3, XPU 3 will be further shut down to achieve near-zero standby power consumption.
  • Above-mentioned embodiment adopts high dynamic computing mode, does not need CPU and PCI-E bus of high main frequency, and system manufacturing cost can be greatly reduced; It has longer battery life; resources such as micro-engines and accelerators will remain unchanged once allocated, which can ensure a deterministic business experience.
  • New data-driven computing technologies such as machine learning will be widely adopted by high-performance supercomputing centers such as weather forecasting, oil exploration, and pharmaceuticals, which exposes a key problem, that is, the problem of massive data sharing.
  • Thousands or even tens of thousands of servers It is necessary to share static data and dynamic data.
  • the requirement for cross-server transmission delay is getting shorter and shorter, and it is expected to be less than microseconds.
  • This embodiment describes the technical solution of large-scale parallel computing that uses high dynamic computing to realize massive data sharing, focusing on the implementation mechanism of data sharing. Other mechanisms can completely reuse the implementation of edge intelligent computing, including data.
  • High dynamic computing adopts the semantic-driven data sharing method.
  • Massive shared data is structured and loaded into the memory through the application layer definition data semantic context, and then computing tasks are deployed to servers closer to the data and adjusted by defining the computing semantic context through the application layer.
  • the corresponding routing optimizes the delay of network transmission, reduces the delay of data transmission, improves the performance of parallel computing and reduces power consumption.
  • the semantic mapping mechanism between the application layer and the hardware layer is shown in Figure 16.
  • the application layer defines the hierarchical semantics of multi-scale data through the administrative area, as shown in Figure 16 from the root to the layer; then specifies the event queue ID of the corresponding storage server, and assigns the corresponding object ID, grid ID, etc.
  • the event queue ID will send the storage message request for data access to the corresponding server, and then the shared memory accelerator of the server will parse the storage message, find the corresponding page table data by ID, and then package it into a corresponding storage message The event message is sent back to the data request service.
  • this solution uses a network card or smart network card to connect to the data center network.
  • the network card is connected to a micro-engine, and an accelerator for semantically driven memory is added.
  • the micro-engine deploys the Ethernet processing protocol to identify event messages of accelerators, such as identifying event messages of semantic memory accelerators, etc.; once identified, the event messages are forwarded to the semantic memory accelerator through the routing network according to the local data context, such as request messages, according to The semantics defined above find the corresponding data, and then send the data message back to the source server as a corresponding event queue message.
  • Each server corresponds to a routing domain, and the semantic creation of the application layer is assigned to the event queue ID of a specific semantic accelerator.
  • the parallel computing thread of server 1 finds the corresponding semantic ID according to the object required for calculation, and then constructs an event queue message according to the event queue ID of the opposite end of the semantic ID and the routing domain of the server to which it belongs, and forwards the message according to the remote data session context Forward to the Ethernet protocol processing thread;
  • the Ethernet protocol processing thread of server 1 receives the event queue, finds the MAC address of the other party and the dedicated VLAN ID (Virtual Local Area Network, virtual local area network number) for data sharing according to the routing domain of the routing range field, and constructs an Ethernet protocol
  • the event message is carried after the frame header, forwarded to the network card, forwarded by the data center switch through the network card, and finally delivered to server 2;
  • the Ethernet protocol processing thread of the server 2 parses the Ethernet protocol frame received by the network card of the server 2 to take out the event message, and forwards it to the semantic memory accelerator according to the event queue ID to the internal routing network;
  • the Semantic Memory Accelerator of Server 2 parses the event message, extracts the object ID and maps it to the local memory, obtains the corresponding data, and then forwards it to the server requesting the data according to the source routing information of the event queue.
  • the subsequent process is consistent with the above, here No longer.
  • the semantic data sharing mechanism of the highly dynamic computing mode is used to reduce software processing overhead, shorten the transmission delay of cross-server data sharing, and the parallelism of multiple computing tasks inside the server, thereby improving the performance of the entire supercomputing center and reducing power consumption. consumption.
  • the processing device 1900 includes: a first running module 1901, and the device 1900 can be used to implement the method described in the above message processing method embodiment.
  • the first running module 1901 is configured to process the first event message through the first processing unit to obtain the second event message, the first event message is received by the first processing unit, or the first event message is received by the first processing unit Generated based on processing requests by the Application;
  • the first processing unit sends the second event message to the second processing unit according to the context information, the context information includes routing information from the first processing unit to the second processing unit, and the context information is generated based on the processing request of the application program;
  • the first processing unit is the first engine
  • the second processing unit is the second accelerator
  • the first processing unit is the first accelerator
  • the second processing unit is the second engine
  • the first processing unit is the first engine
  • the second processing unit is a second engine, or the first processing unit is a first accelerator
  • the second processing unit is a second accelerator.
  • the message processing apparatus 1900 further includes a resource configuration module 1902, and the resource configuration module 1902 is configured to:
  • Context information is generated based on processing requests from the application.
  • the first processing unit or the second processing unit is selected from multiple processing units by the resource configuration module 1902 based on the status information of the multiple processing units when the processing request of the application program is received, and the processing The status information of the unit includes network topology properties.
  • the resource configuration module 1902 is also used to:
  • At least two threads are loaded to run on at least two engines, wherein different threads run on different engines.
  • the resource configuration module 1902 is specifically used for:
  • the semantics of processing requests includes at least two task semantics
  • a corresponding task is determined.
  • each functional unit in each embodiment of the present application It can be integrated in one processing unit, or physically exist separately, or two or more units can be integrated in one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the embodiment of the present application also provides a schematic structural diagram of a message processing device 2000 .
  • the device 2000 may be used to implement the method described in the above embodiment of the message processing method applied to the data processing system, and reference may be made to the description in the above method embodiment.
  • Device 2000 may be in, or be, a data processing system.
  • the Device 2000 includes one or more processors 2001 .
  • the processor 2001 may be a general-purpose processor or a special-purpose processor. For example, it could be a central processing unit.
  • the central processing unit may be used to control a message processing device (such as a terminal, or a chip, etc.), execute a software program, and process data of the software program.
  • the message processing device may include a transceiver unit to implement input (reception) and output (transmission) of signals.
  • the transceiver unit may be a transceiver, a radio frequency chip, and the like.
  • the device 2000 includes one or more processors 2001, and the one or more processors 2001 can implement the methods of the data processing system in the above-mentioned embodiments.
  • the processor 2001 may also implement other functions in addition to implementing the methods in the above-mentioned embodiments.
  • the processor 2001 may execute instructions, so that the device 2000 executes the methods described in the foregoing method embodiments.
  • the instructions may be stored in whole or in part in the processor, such as instruction 2003, or may be stored in whole or in part in the memory 2002 coupled to the processor, such as instruction 2004, and the instructions 2003 and 2004 may jointly cause the device 2000 to execute the above method. method described in the example.
  • the message processing device 2000 may also include a circuit, and the circuit may implement the functions of the data processing system in the foregoing method embodiments.
  • the device 2000 may include one or more memories 2002 on which instructions 2004 are stored, and the instructions may be executed on a processor, so that the device 2000 executes the methods described in the above method embodiments.
  • data may also be stored in the memory.
  • Instructions and/or data may also be stored in the optional processor.
  • one or more memories 2002 may store the correspondence described in the foregoing embodiments, or the relevant parameters or tables involved in the foregoing embodiments. Processor and memory can be set separately or integrated together.
  • the device 2000 may further include a transceiver 2005 and an antenna 2006 .
  • the processor 2001 may be called a processing unit, and controls the device.
  • the transceiver 2005 may be called a transceiver, a transceiver circuit, or a transceiver unit, etc., and is used to realize the transceiver function of the device through the antenna 2006 .
  • the processor in the embodiment of the present application may be an integrated circuit chip, which has a signal processing capability.
  • each step of the above-mentioned method embodiments may be completed by an integrated logic circuit of hardware in a processor or instructions in the form of software.
  • the above-mentioned processor can be general-purpose processor, digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other available Program logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • Program logic devices discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable Read memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM direct memory bus random access memory
  • direct rambus RAM direct rambus RAM
  • the embodiment of the present application also provides a computer-readable medium, on which a computer program is stored, and when the computer program is executed by a computer, the message processing method of any one of the above method embodiments applied to a data processing system is implemented.
  • An embodiment of the present application further provides a computer program product, which implements the message processing method in any of the above method embodiments applied to a data processing system when the computer program product is executed by a computer.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • a computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application are generated in whole or in part.
  • a computer can be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server, a data center, etc. integrated with one or more available media. Available media can be magnetic media (e.g., floppy disk, hard disk, magnetic tape), optical media (e.g., high-density digital video disc (digital video disc, DVD)), or semiconductor media (e.g., solid state disk (SSD) )wait.
  • magnetic media e.g., floppy disk, hard disk, magnetic tape
  • optical media e.g., high-density digital video disc (digital video disc, DVD)
  • semiconductor media e.g., solid state disk (SSD)
  • the embodiment of the present application also provides a processing device, including a processor and an interface; the processor is configured to execute the message processing method in any one of the above method embodiments applied to a data processing system.
  • the above-mentioned processing device may be a chip, and the processor may be implemented by hardware or by software.
  • the processor When implemented by hardware, the processor may be a logic circuit, an integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor, and may be implemented by reading software codes stored in a memory.
  • the memory may be integrated in the processor, or may be located outside the processor and exist independently.
  • the embodiment of the present application also provides a chip 2100, including an input and output interface 2101 and a logic circuit 2102, the input and output interface 2101 is used to receive/output code instructions or information, and the logic circuit 2102 is used to execute code instructions Or according to the information, execute the message processing method in any method embodiment above applied to the data processing system.
  • the chip 2100 may implement the functions shown by the processing unit and/or the transceiver unit in the foregoing embodiments.
  • the input and output interface 2101 is used to input resource configuration information of the data processing system, and the input and output interface 2101 is also used to output request information for acquiring target data stored in the shared memory.
  • the input and output interface 2101 may also be used to receive a code instruction, where the code instruction is used to instruct to obtain a data request from an application program.
  • An embodiment of the present application further provides a data processing system, including the message processing device in the foregoing embodiments, and the message processing device is configured to execute the message processing method in any one of the foregoing method embodiments.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • a unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage media may be any available media that can be accessed by a computer.
  • computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or may be used to carry or store information in the form of instructions or data structures desired program code and any other medium that can be accessed by a computer.
  • Any connection can suitably be a computer-readable medium.
  • the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave
  • coaxial cable , fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, wireless, and microwave
  • disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disc, and blu-ray disc, where discs usually reproduce data magnetically, and discs Lasers are used to optically reproduce the data. Combinations of the above should also be included within the scope of computer-readable media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请公开了一种消息处理方法及装置,用以提升数据处理系统的资源利用率。一种实施例中,第一处理单元对第一事件消息进行处理,得到第二事件消息;第一事件消息是第一处理单元接收到的,或者第一事件消息是第一处理单元基于应用程序的处理请求生成的;第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,上下文信息包括第一处理单元到第二处理单元的路由信息,上下文信息是基于应用程序的处理请求生成的;其中,第一处理单元可以为引擎或加速器;第二处理单元也可以为引擎或加速器;第一处理单元与第二处理单元不同。在该方法中,由于事件消息在不同处理单元之间的传输,是基于上下文信息实现的,可以提高系统处理性能。

Description

一种消息处理方法及装置 技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种消息处理方法及装置。
背景技术
高性能处理器(Central Processing Unit,CPU)的时钟频率一直没有太大的变化,性能提升缓慢。在功耗方面,每平方厘米的功耗从十几毫瓦变到一瓦左右,也到达了极限,限制了性能的提高。
为了提升CPU的性能,业界希望将CPU的通用算力和专业的计算芯片的加速算力融合进行异构计算。通常地,异构计算的任务依赖CPU来调度,异构计算资源需要等待CPU把数据搬移上,数据处理系统对异构资源的调度利用存在性能瓶颈。
因此,提供一种消息处理方法,以解决数据处理系统在调度异构资源时存在的资源利用率低的问题,具有现实意义。
发明内容
本申请实施例提供一种消息处理方法及装置,用以提升数据处理系统的资源利用率。
第一方面,提供一种消息处理方法,包括:
第一处理单元对第一事件消息进行处理,得到第二事件消息,所述第一事件消息是所述第一处理单元接收到的,或者所述第一事件消息是所述第一处理单元基于应用程序的处理请求生成的;
所述第一处理单元根据上下文信息,将所述第二事件消息发送给第二处理单元,所述上下文信息包括所述第一处理单元到所述第二处理单元的路由信息,所述上下文信息是基于所述应用程序的处理请求生成的;
其中,所述第一处理单元为第一引擎、所述第二处理单元为第二加速器,或者,所述第一处理单元为第一加速器、所述第二处理单元为第二引擎,或者,所述第一处理单元为第一引擎、所述第二处理单元为第二引擎,或者所述第一处理单元为第一加速器,所述第二处理单元为第二加速器。
本申请提供了一种方法,包括:第一处理单元对第一事件消息进行处理,得到第二事件消息;第一事件消息是第一处理单元接收到的,或者第一事件消息是第一处理单元基于应用程序的处理请求生成的;第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,上下文信息包括第一处理单元到第二处理单元的路由信息,上下文信息是基于应用程序的处理请求生成的;其中,第一处理单元可以为引擎或加速器;第二处理单元也可以为引擎或加速器;第一处理单元与第二处理单元不同。在该方法中,由于事件消息在不同处理单元之间的传输,是基于上下文信息实现的,相比于采用调度的方式(比如使用调度器等进行消息调度)进行事件消息的传输调度,上述实现方式可以避免传输调度所导致的性能瓶颈,进而可以提高系统处理性能。
在一种可能的设计中,所述第一处理单元根据上下文信息,将所述第二事件消息发送 给第二处理单元,包括:
所述第一处理单元根据所述路由信息,将所述第二事件消息发送给所述第二处理单元对应的事件队列;
所述第二处理单元从所述事件队列获取所述第二事件消息。
通过上述设计,基于事件队列在不同处理单元之间传递消息,比如,线程可将需要加速器处理的数据通过事件消息发送到加速器对应的事件队列,从而由对应的加速器对该事件消息进行处理,降低了线程与加速器之间的耦合度,进而可以提高资源分配的灵活性,提升数据处理过程的资源利用率。
在一种可能的设计中,所述第二事件消息包括目标事件队列标识,所述目标事件队列标识为所述第二处理单元对应的事件队列的队列标识。
通过上述设计,可以根据上下文信息,在消息中添加“目标消息队列标识”,从而基于事件队列实现消息的路由传输,相比传统总线,可实现在动态调度的计算资源之间的数据通讯,转发效率更高,进一步提升数据处理过程的资源利用率。
在一种可能的设计中,所述路由信息还包括目标路由域,所述目标路由域用于指示目标服务器,所述目标服务器与源服务器不同,所述源服务器是所述第一处理单元所在的服务器。
通过上述设计,路由信息还包括目标路由域,该目标路由域用于指示目标服务器,从而使得目标服务器与源服务器可以不同。该方法能够以跨路由域的方式形成通信链路,可以组建跨路由域的通信链路网络,具有更好的调度弹性和可扩展性。
在一种可能的设计中,所述第二处理单元为第二加速器;所述第一处理单元根据上下文信息,将所述第二事件消息发送给第二处理单元,包括:
所述第一处理单元根据所述路由信息,将所述第二事件消息发送给加速器池对应的事件队列,所述加速器池中包括多个加速器,所述多个加速器的类型相同;根据所述多个加速器的状态,从所述多个加速器中确定所述第二加速器;
将所述第二事件消息发送给所述第二加速器。
通过上述设计,通过加速器池、加速器池的事件分配器、加速器池的事件队列实现加速器发送事件消息,提供一种共享加速器的资源调度机制,可以提高系统处理性能。
在一种可能的设计中,所述第一处理单元接收第一事件消息之前,还包括:
接收来自于应用程序的处理请求;
根据所述应用程序的处理请求,确定计算资源,所述计算资源包括所述第一处理单元和所述第二处理单元;
根据所述应用程序的处理请求,生成所述上下文信息。
通过上述设计,基于事件触发,进行计算资源动态分配,并生成上下文(即创建会话),提供了基于事件触发的实时动态调度资源机制,进而可以实现在动态调度的计算资源之间的数据通讯,资源利用率更高。
在一种可能的设计中,所述第一处理单元或所述第二处理单元是基于接收到所述应用程序的处理请求时多个处理单元的状态信息,从所述多个处理单元中选择的,所述处理单元的状态信息包括网络拓扑性能。
通过上述设计,计算资源分配时,获取硬件(线程、加速器等)的硬件状态信息,依据当前的硬件状态分配最优的硬件,从而使得分配的计算资源更合理,其中,硬件状态信 息包括网络拓扑性能,最优的硬件可能是分配当前性能最优的硬件,也有可能是分配性能最匹配的硬件。该方法可以基于与接收到的处理请求对应的事件,触发实时动态调度资源,避免资源浪费从而可进一步提高系统性能。
在一种可能的设计中,所述接收来自于应用程序的处理请求之后,还包括:
确定所述处理请求包括的至少两个任务;
创建所述至少两个任务对应的至少两个线程;
将所述至少两个线程加载到至少两个引擎上运行,其中,不同的线程运行在不同的引擎上。
通过上述设计,基于事件触发,进行任务划分,将不同任务对应的线程分配到不同的引擎上运行,从而可以提高系统性能,并可以提高计算资源的利用率。
在一种可能的设计中,所述确定所述处理请求包括的至少两个任务,包括:
获取所述处理请求的语义,所述处理请求的语义包括至少两个任务语义;
根据所述至少两个任务语义中的每个任务语义,确定对应的一个任务。
通过上述设计,可以基于处理请求的语义,构建归属于该处理请求的多个任务,不同的任务具有不同的任务语义,可以根据实时事件动态创建计算任务,并高效地将复杂计算任务拆分为多个任务,简单易实现,减少资源浪费。
在一种可能的设计中,所述方法还包括:
释放第一线程,所述第一线程为所述至少两个线程中的一个;
若释放所述第一线程后,所述第一线程被释放前所在的引擎上已无线程运行,关闭所述第一线程被释放前所在的引擎。
通过上述设计,该方法可以根据需要停止线程或关闭相应硬件,可以实现近零的待机功耗,确保消息处理方法的低功耗。
在一种可能的设计中,所述处理请求用于请求获取目标数据,所述目标数据存储于第二服务器的内存中;用于执行所述处理请求的所述计算资源还包括第三处理单元和第四处理单元;所述至少两个引擎包括所述第一处理单元、所述第二处理单元和所述第三处理单元;所述第四处理单元为加速器;所述第一事件消息和所述第二事件消息中包括所述目标数据的标识,所述第一处理单元和所述第二处理单元位于第一服务器,所述第三处理单元和所述第四处理单元位于所述第二服务器;所述上下文还包括所述第二处理单元到所述第三处理单元、所述第三处理单元到所述第四处理单元的路由信息;
在所述第一处理单元根据上下文,将所述第二事件消息发送给第二处理单元之后,所述方法还包括:
所述第二处理单元基于所述第二事件消息将所述第二事件消息封装,以生成第三事件消息;
所述第二处理单元根据所述上下文,将所述第三事件消息发送给位于所述第二服务器的所述第三处理单元;
所述第三处理单元基于所述第三事件消息对所述第三事件消息解封装,得到第四事件消息,并根据所述上下文,将所述第四事件消息发送给所述第四处理单元;
所述第四处理单元从接收到的所述第四事件消息获取所述目标数据的标识,根据所述目标数据的标识从所述第二服务器的内存中获取所述目标数据,并根据所述目标数据得到所述第五事件消息;所述第五事件消息用于将所述目标数据发送给所述第一服务器。
通过上述设计,提供了获取存储于共享内存中的目标数据的方法,通过目标数据的标识获取对应的内存地址,并根据内存地址从共享内存中获取目标数据,该方法可避免出现采用全局页面共享方式时存在的占用大量内存的问题,进一步提升数据处理过程的资源利用率。
在一种可能的设计中,所述上下文信息还包括操作配置信息;
所述第一处理单元对所述第一事件消息进行处理,得到第二事件消息,包括:
所述第一处理单元根据所述操作配置信息对所述第一事件消息进行处理,得到第二事件消息。
通过上述设计,上下文中还包括操作配置信息(比如位宽,点数等),以使得处理单元可根据该操作配置信息进行处理,可以在收到事件消息后自动触发相应的处理的机制,提升了事件驱动的高能效优势,提升资源利用率。
在一种可能的设计中,所述第一事件消息和所述第二事件消息中包括所述上下文信息的标识,所述上下文信息的标识用于获取所述上下文信息。
通过上述设计,事件消息中包括上下文信息的标识(CID),上下文信息的标识用于指示应用程序的上下文信息,从而使处理单元可以快速高效地获取对应的操作配置信息或路由信息,提高了数据处理过程的资源利用率。
在一种可能的设计中,所述第二事件消息,包括:
消息属性信息域,包括事件消息路由信息,所述事件消息路由信息包括目标事件队列标识,所述目标事件队列标识为所述第二处理单元对应的事件队列的队列标识;
消息长度域,包括所述第二事件消息的总长度信息;
数据域,包括所述第二事件消息的负荷。
在一种可能的设计中,所述数据域中包括第一事件信息域,所述第一事件信息域包括以下至少一项:
路由范围、所述上下文信息的标识、源消息队列标识或者自定义属性,所述路由范围包括至少一个路由域。
在一种可能的设计中,所述数据域中包括第二事件信息域,所述第二事件信息域包括应用层的自定义信息。
通过上述设计,定义了事件消息的帧结构,该帧结构可以从最外层开始可以依次包括:网络层子帧、操作系统层子帧,应用层子帧,事件消息的帧结构支持根据应用场景做动态扩展,不同的场景下封装不同格式的事件消息,进一步使得本申请提供的方案灵活地应用到不同的应用场景中,提升了在数据处理时的适应性,提升数据转发效率。
在一种可能的设计中,所述方法还包括:
获取所述应用程序的资源配置信息,所述资源配置信息包括引擎数量,以及加速器类型或加速器数量中的一种或多种;
根据所述资源配置信息,确定所述应用程序使用的引擎,所述应用程序使用的引擎包括所述第一引擎和/或所述第二引擎;
根据所述资源配置信息,确定所述应用程序使用的加速器,所述应用程序使用的加速器中包括所述第一加速器和/或所述第二加速器。
通过上述设计,可以根据接收到的处理请求,获取应用程序的资源配置信息,确定所述应用程序使用的加速器和引擎,资源配置信息包括但不限于引擎数量、加速器类型和加 速器数量,可以根据所述资源配置信息和候选计算资源的资源状态,选取应用程序使用的引擎和加速器,从而实现即时适应资源状态的实时动态分配,即保证性能要求,又保证低功耗。
在一种可能的设计中,所述第一处理单元为第一引擎;所述第二处理单元为第二加速器;所述第一单元将所述第二事件消息发送给所述第二处理单元对应的事件队列,包括:
所述第一引擎执行所述第二加速器的第一重译指令,以将所述第二事件消息发送给所述第二加速器对应的事件队列;所述第一重译指令是通过加载所述第二加速器,并为所述第二加速器分配所述第二加速器对应的事件队列的标识后,根据所述第二加速器对应的事件队列的标识,修改所述第二加速器的机器码得到的;所述第一重译指令被执行时,所述第一引擎向所述第二加速器对应的事件队列发送所述第二事件消息。
通过上述设计,通过根据加速器的事件队列的标识,修改加速器的指令集,修改后的指令集中的指令被引擎上运行的线程执行时,引擎的事件队列发送事件消息,例如,可以响应于第二加速器被加载,为第二加速器分配第二事件队列的标识;根据第二事件队列的标识,修改第二加速器的指令集,修改后的指令集中的指令被第一引擎上的第一线程执行时,第一线程向第二事件队列发送第二事件消息,该方法通过事件队列的标识代替加速器的指令,从而在不断扩展不同的加速器时,微引擎不需要进行修改就可以重用。
第二方面,本申请实施例还提供了一种消息处理装置,包括:
第一运行模块,所述第一运行模块用于:通过第一处理单元对第一事件消息进行处理,得到第二事件消息,所述第一事件消息是所述第一处理单元接收到的,或者所述第一事件消息是所述第一处理单元基于应用程序的处理请求生成的;
通过所述第一处理单元根据上下文信息,将所述第二事件消息发送给第二处理单元,所述上下文信息包括所述第一处理单元到所述第二处理单元的路由信息,所述上下文信息是基于所述应用程序的处理请求生成的;
其中,所述第一处理单元为第一引擎、所述第二处理单元为第二加速器,或者,所述第一处理单元为第一加速器、所述第二处理单元为第二引擎,或者,所述第一处理单元为第一引擎、所述第二处理单元为第二引擎,或者所述第一处理单元为第一加速器,所述第二处理单元为第二加速器。
第三方面,本申请提供实施例提供一种消息处理设备,包括处理器和存储器,
所述存储器,用于存储可执行程序;
所述处理器,用于执行存储器中的计算机可执行程序,使得第一方面中任一项所述的方法被执行。
第四方面,本申请提供实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序在被计算机调用时,使所述计算机执行如第一方面中任一项所述的方法。
第五方面,本申请实施例还提供了一种芯片,包括:逻辑电路和输入输出接口,所述输入输出接口用于接收代码指令或信息,所述逻辑电路用于执行所述代码指令或根据所述信息,以执行如第一方面中任一项所述的方法。
第六方面,本申请实施例还提供了一种数据处理系统,所述数据处理系统包括如第二方面所述的消息处理装置。
第七方面,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括计 算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行如第一方面中任一项所述的方法。
上述第二方面至第七方面中任一方面及其任一方面中任意一种可能的实现可以达到的技术效果,请参照上述第一方面及其第一方面中相应实现可以带来的技术效果描述,这里不再重复赘述。
附图说明
图1为本申请实施例中提供的一种数据处理系统的结构示意图;
图2为本申请实施例中提供的一种微引擎对指令的流水线进行处理的流程示意图;
图3为本申请实施例中提供的一种实现语义驱动数据共享的示意图;
图4为本申请实施例中提供的一种加速器池的选通模式示意图;
图5为本申请实施例中提供的一种加速器池的多播模式示意图;
图6为本申请实施例中提供的一种多路由域的高弹性网络的架构示意图;
图7为本申请实施例中提供的一种高弹性网络的异步接口设计的示意图;
图8为本申请实施例中提供的一种高弹性网络传输的帧的基本结构的示意图;
图9为本申请实施例中提供的一种高弹性网络传输的子帧的结构的示意图;
图10为本申请实施例中提供的一种高动态操作系统的组成结构的示意图;
图11为本申请实施例中提供的一种边缘智能计算的设计方案示意图;
图12为本申请实施例中提供的一种消息处理方法的流程示意图;
图13为本申请实施例中提供的一种边缘智能计算的计算资源调用示意图;
图14为本申请实施例中提供的一种视频通话的设计方案示意图;
图15为本申请实施例中提供的一种视频通话的计算资源调用示意图;
图16为本申请实施例中提供的一种超算中心的语义定义共享数据机制的示意图;
图17为本申请实施例中提供的一种超算服务器的设计方案示意图;
图18为本申请实施例中提供的一种超算中心的计算资源调用示意图;
图19为本申请实施例中提供的一种消息处理装置的结构示意图;
图20为本申请实施例中提供的一种消息处理设备的结构示意图;
图21为本申请实施例中提供的一种芯片的结构示意图。
具体实施方式
在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请实施例的描述中,除非另有说明,“多个”的 含义是两个或两个以上。
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
应注意到本申请实施例中,相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
在本申请的描述中,还需要说明的是,除非另有明确的规定和限定,术语“设置”、“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本申请中的具体含义。下面对本申请实施例涉及到的一些名词和术语进行解释。
(1)应用程序:应用程序指为完成某项或多项特定工作的计算机程序,它运行在用户模式,可以和用户进行交互,具有可视的用户界面。
(2)异构计算:异构计算是将CPU的通用算力和专业的芯片的定向加速算力融合在一起的新计算模式,最终达到性能、功耗和灵活性的统一。
(3)加速器:异构计算要使用不同类型的处理器来处理不同类型的计算任务。常见的计算单元包括CPU、ASIC(Application-Specific Integrated Circuit,应用定制集成电路)、GPU(Graphics Processing Unit,图像处理单元/加速器)、NPU(Neural Processing Unit,神经网络处理单元/加速器)、FPGA(Field Programmable Gate Arrays,可编程逻辑阵列)等。加速器是指上述的ASIC、GPU、NPU、FPGA等专业的芯片。异构计算架构中,CPU负责逻辑复杂的调度和串行任务,加速器负责并行度高的任务,实现计算加速。例如,本申请的实施例中,fp32加速器是一种负责fp32浮点运算的加速器。
(4)事件:事件是可以被控件识别的操作,如按下确定按钮,选择某个单选按钮或者复选框。每一种控件有自己可以识别的事件,如窗体的加载、单击、双击等事件,编辑框(文本框)的文本改变事件,等等。
(5)引擎:本申请实施例中提到的引擎是指融合计算微引擎(Convergent Process Engine,XPU),也可以称为微引擎。微引擎是一个处理单元,该处理单元用于对指令的流水线进行处理。其中,该流水线为可动态可扩展的。微引擎可以支持CPU、GPU、NPU等异构计算所需的计算任务、进程或线程。
(6)线程:线程是操作系统能够进行运算调度的最小单位。它被包含在进程之中,是进程中的实际运作单位。一条线程指的是进程中一个单一顺序的控制流,一个进程中可以并发多个线程,每条线程并行执行不同的任务。同一进程中的多条线程将共享该进程中的全部系统资源,如虚拟地址空间,文件描述符和信号处理等等。但同一进程中的多个线程有各自的调用栈,自己的寄存器环境,自己的线程本地存储。
(7)事件队列:本申请实施例中,事件队列是在消息的传输过程中保存消息的容器。事件队列可以看作一个事件消息的链表。
(8)网络拓扑性能:网络拓扑性能是指网络拓扑的链接关系、吞吐量、可用路由、可用带宽、时延等。网络拓扑是指用传输媒体互连各种硬件或设备的物理布局,特别是硬 件分布的位置以及电缆如何通过它们。
(9)应用层:应用层主要为系统提供应用接口。
(10)网络层:网络层主要负责定义逻辑地址,实现数据从源到目的地的转发过程。
基于背景技术中的介绍,高性能处理器的时钟频率一直没有太大的变化,性能提升缓慢。在功耗方面,每平方厘米的功耗从十几毫瓦变到一瓦左右,也到达了极限,限制了性能的提高。
为了提升CPU的性能,业界希望将CPU的通用算力和专业的计算芯片的加速算力融合进行异构计算。通常地,异构计算的任务依赖CPU来调度,异构计算资源需要等待CPU把数据搬移上,数据处理系统对异构资源的调度利用存在性能瓶颈。
因此,提供一种消息处理方法,以解决数据处理系统在调度异构资源时存在的资源利用率低的问题,具有现实意义。
为了便于理解,首先介绍本申请实施例涉及的技术特征。
本申请的实施例提供一种数据处理系统,参阅图1,数据处理系统100有五个核心网元:融合计算微引擎(Convergent Process Engine,XPU)、语义驱动数据共享(Semantic-Driven Data Sharing,SDS)、语义驱动加速器池(Semantic-Driven Accelerator,SDA)、高弹性路由网络(Ultra Elastic Network over Chip,UEN)和高动态操作系统(High-dynamic Operating System,HOS)。其中,高弹性路由网络用于实现微引擎、加速器和事件队列的高速互连,并支撑系统性能和容量可水平扩展;高动态操作系统用于实现灵活调度资源和分配计算任务。本申请的以下实施例中,融合计算微引擎也可以简称为微引擎,微引擎和加速器可以称为处理单元。通常地,在没有特殊指定时,一个处理单元可以是微引擎,也可以是加速器。
下面对图1的数据处理系统100的结构进行简要说明,以便于更清楚地理解本申请实施例。下面介绍图1中的各个核心网元的技术特征。
(1)、融合计算微引擎(XPU)。
融合计算微引擎是一个处理单元,该处理单元用于对指令的流水线进行处理。其中,该流水线为可动态可扩展的。微引擎可以支持CPU、GPU(Graphics Processing Unit,图像处理单元/加速器)、NPU(Neural Processing Unit,神经网络处理单元/加速器)等异构计算所需的计算任务、进程或线程。
针对应用而言,本申请实施例中的微引擎类似于硬化的容器或线程处理器,可以根据不同业务场景的计算任务的负载要求来动态分配对应的微引擎,保证该业务所需的算力和优化的时延。
需要指出的是,本申请的实施例中的微引擎在处理指令的流水线时,通过事件队列ID(Identity Document,身份标识号)代替不同的指令。
微引擎对指令的流水线进行处理,具体过程可以是:在增加新的加速器之后,系统分配对应的事件队列ID号,其中,若该新的加速器对应的程序为第一次在系统安装,则通过即时编译器对该程序重新编译一次,将该程序的机器码替换为向事件队列发消息的通用格式的指令。当加速器程序加载到微引擎时,微引擎响应于与加速器程序对应的加速器指令,将所需处理的数据送到对应的事件队列。
以fp32加速器为例,如图2所示,在新增加了一个fp32加速器时,系统为该fp32加速器分配事件队列号为EQ-ID1。假设该fp32加速器对应的程序为第一次在数据处理系统中安装,则通过即时编译器对与该fp32加速器对应的程序重新编译一次,将fp32的机器 码“fp32rx,ax,bx”替换为如表1所示的向事件队列发消息的通用格式的指令:
表1
Figure PCTCN2021133267-appb-000001
其中,表1中所示的内容中包含的“Insteq EQ-ID1,v”,表示向事件队列号为EQ-ID1的事件队列发送包含数据“v”的消息。
在与图2示出的fp32加速器对应的fp32程序加载到微引擎XPU-ID1之后,微引擎响应于与fp32程序对应的加速器指令,将所需处理的数据送到事件队列EQ-ID1,然后等待事件队列EQ-ID1的返回的结果,回写到寄存器或内存中,至此,完成一次fp32浮点运算。
(2)、语义驱动数据共享(SDS)。
语义驱动数据共享用于通过事件队列连续传输数据和上下文信息,实现数据处理系统内跨计算资源的数据共享。其中,计算资源可以是融合计算微引擎,加速器等。
本申请的实施例中,采用异步电路或异步NOC(Networks On Chip,片上网络)来实现事件消息的收发器,同时在收到完整事件消息之后,自动采用事件触发相应的处理机制,如FFT(fast Fourier transform,快速傅里叶变换),浮点计算等。
需要指出的是,本申请的实施例中,上下文信息也可以称为上下文;相对应地,上下文信息的标识也可以称为上下文的标识,或者简称为上下文标识。
图3示出了本申请一种实施例提供的实现语义驱动数据共享的示意图。参见图3,为了实现数据处理系统内跨计算资源的数据共享,在软件开发过程中,通过应用层对数据共享的上下文进行定义。在创建该数据会话之后,第一计算资源根据语义配置指令构建事件消息块,并通过第一计算资源的事件队列向第一计算资源对应的下一个的第二计算资源的事件队列发送事件消息,以使当第二计算资源的事件队列接收到事件消息时,自动触发该第二计算资源对该事件消息进行处理。
具体实施时,若存在与该第二计算资源对应的下一个计算资源,则在计算完成之后,该第二计算资源直接将处理结果构建事件消息并通过发送队列向与该第二计算资源对应的该下一个计算资源。
以语音FFT变换为例,如图3所示,通过应用调度器创建一个从ADC(Analog-to-digital converter,模拟数字转化)、FFT加速器到成帧器Framer的数据会话,从而得到数据共享的上下文;该数据会话通过编译器或加速库等机制,可以分解得到该上下文相关的各个计算资源的语义配置指令,例如图3中的ADC、FFT加速器和成帧器的语义配置指令。
在创建该数据会话之后,ADC根据配置信息构建事件消息,再通过自身的事件队列向指定的FFT队列发送事件消息;FFT加速器的事件队列接收到ADC的事件队列发送的事件消息时,自动触发FFT加速器对该接收到的事件消息中的数据块做计算,在计算完成之后直接将计算结果构建事件消息块并通过发送队列向成帧器发送该根据计算结果构建的事件消息;成帧器的事件队列接收到该根据计算结果构建的事件消息时,自动触发成帧器对该根据计算结果构建的事件消息的数据块进行对应的协议分析。
其中,如果FFT加速器需要做双精度计算,也可以按照上述同样机制向FP32加速器发送事件消息请求做双精度计算。
作为一种示例,如图3所示,假设FFT加速器需要做双精度计算,可以将需要进行双精度计算的数据包构建事件消息块,再通过自身的事件队列向FP32加速器的事件队列发送事件消息;FP32加速器的事件队列接收到FFT加速器的事件队列发送的事件消息时,自动触发FP32加速器对该接收到的事件消息中的数据块做计算,在计算完成之后直接将双精度计算结果构建事件消息块并通过自身的发送队列向FFT加速器发送该根据双精度计算结果构建的事件消息;FFT加速器的事件队列接收到FP32加速器的事件队列发送的事件消息时,可以对接收到的事件消息做进一步处理后,将处理结果构建事件消息块并通过发送队列向成帧器发送该FFT加速器根据计算结果构建的事件消息;成帧器的事件队列接收到该FFT加速器根据计算结果构建的事件消息时,自动触发成帧器对该FFT加速器根据计算结果构建的事件消息的数据块进行对应的协议分析。
与图3示出的加速器与加速器级联连接相类似地,在本申请的一些实施例中,还可以是一个线程将事件消息发送给一个加速器A处理,加速器A根据处理结果生成的新的事件消息,并发送给另一个加速器B进行处理,加速器B处理完后,再向加速器B的下一个单元传递事件消息。
在一些可选的实施例中,数据处理系统包括第一处理单元和第二处理单元,第一处理单元为第一加速器,第二处理单元为第二加速器;数据处理系统对消息进行处理的过程包括:第一加速器接收第一事件消息,第一加速器对第一事件消息进行处理,得到第二事件消息,第一加速器根据上下文信息,将第二事件消息发送给第二加速器,上下文信息包括第一加速器到第二加速器的路由信息,该上下文信息是基于应用程序的处理请求生成的。
示例性地,以第一处理单元为第一子加速器Task1_A,第二处理单元为第二子加速器Task2_B为例,在一种实施例中,还可以通过应用调度器创建一个从第一线程、第一子加速器Task1_A、第二子加速器Task2_B、第二线程到第二加速器的数据会话,从而得到数据共享的上下文CID0(该上下文中包含事件消息的路由信息)。在创建该数据会话之后,第一子加速器Task1_A可以获取第一线程发送的事件消息Mes.A_1(这里称为第一事件消息),对事件消息Mes.A_1进行处理,得到事件消息Mes.A_2(为了与第一事件消息区别,这里可以称为第二事件消息),并根据上下文中的路由信息将事件消息Mes.A_2发送给第二子加速器Task2_B(比如,根据该上下文中的路由信息将事件消息Mes.A_2的目的事件队列标识设置为第二子加速器Task2_B对应的事件队列的标识)。在此之后,与前述过程类似地,第二子加速器Task2_B可以接收事件消息Mes.A_2,对事件消息Mes.A_2进行处理,得到事件消息Mes.A_3,并根据上下文中的路由信息将事件消息Mes.A_3发送给后续的第二线程。
在一种实现方式中,若接收到应用层的删除数据会话的指示,删除数据会话。
在一种实现方式中,若应用层没有删除数据会话,则数据会话持续存在。
示例性地,图3中,如果系统配置需要拆除该会话,需要指示软件主动删除该数据会话并回收相应的资源。
(3)、语义驱动加速器池(SDA)。
语义驱动加速池提供一种加速器的资源调度机制。融合计算微引擎或加速器都通过事件队列对外进行通讯,以实现对特定功能请求加速处理。
例如,FP32加速器对应的特定功能为“浮点计算”,通过事件队列对外进行通讯。系统可以通过FP32加速器的事件队列对FP32加速器进行通讯,请求进行与图4中的FP32 加速器对应的浮点计算的加速处理。
语义驱动加速器池的资源调度机制的工作原理如下:
根据SOC(System on Chip,片上系统)芯片规划确定一组加速器组成一个共享加速器池,该共享加速器池有配套的事件分配器和加速器池事件队列。本申请的以下实施例中,加速器池事件队列可以简称为池队列。
事件消息从池队列到加速器有两种,一种是选通模式,以实现对加速器的多选一,参见图4;另一种是多播模式,以实现加速器选择时的一进多出,见图5。
下面分别介绍两种加速器的调用模式:
在采用选通模式下,当系统请求加速时,可直接向池队列发送事件消息进行请求,无需具体指定加速器;当池队列有事件消息时,会自动触发事件分配器根据加速器的空闲状态通过RR仲裁选择共享加速器池中的一个加速器来处理该事件消息,然后触发选通电路打通池队列和加速器的电路连接,同时向池队列和加速器发出读事件的消息,则将事件消息从池队列传输到加速器。
在采用多播模式下,当系统同时请求多个同类型的加速器时,可直接向池队列发送请求,无需具体指定加速器;当池队列有事件消息时,会自动触发事件分配器根据多播加速请求的配置信息并检测到相应的处在空闲状态的加速器,将同时选通多个加速器,打通池队列和加速器的电路连接,同时向池队列和加速器发出读事件的消息,则将事件消息从池队列同时传输到加速器上。
在一些可选的实施例中,第二处理单元为第二加速器;第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,包括:第一处理单元根据路由信息,将第二事件消息发送给加速器池对应的事件队列,加速器池中包括多个加速器,多个加速器的类型相同;根据多个加速器的状态,从多个加速器中确定第二加速器;将第二事件消息发送给第二加速器。
具体来说,数据处理系统包括第一处理单元和第二处理单元,其中,第二处理单元为第二加速器。数据处理系统的第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,具体为通过下述过程实现:第一处理单元根据上下文信息中包括的路由信息,将第二事件消息发送给加速器池对应的事件队列,加速器池中包括多个加速器,多个加速器中包括第二加速器,多个加速器的类型相同;事件分配器根据加速器池中的加速器的状态,从加速器池中选取第二加速器;事件分配器将加速器池对应的事件队列中的第二事件消息发送给第二加速器。
示例性地,以第二处理单元为图4中的FP32加速器1为例进行说明,数据处理系统的第一处理单元可以根据上下文中包括的路由信息,将事件消息Info.i发送给FP32池对应的事件队列,FP32池中包括至少一个加速器,该至少一个加速器中包括FP32加速器1,且该至少一个加速器的类型相同;FP32池对应的事件分配器根据FP32池中的加速器的状态,从FP32池中选取FP32加速器1;事件分配器将FP32池对应的事件队列中的事件消息Info.i发送给FP32加速器1。
在本申请的一些实施例中,可以进行基于上下文的多播事件消息处理。具体地,上下文可设定多播方式,线程或加速器可以根据上下文设定的多播方式,通过线程或加速器的事件队列启动多播功能对需要下游处理的事件消息进行复制,发送多个下一级处理单元,该单元可以是线程或加速器,也可以是应用/CPU。
(4)、高弹性网络(UEN)。
高弹性网络提供一种可弹性调度的互连机制。高弹性网络可实现单一片上系统SOC内的多个融合计算微引擎和多个加速器的公共物理连接基础设施,也称为单一路由域,同时也是事件消息、微引擎的任务管理和加速器的配置管理等管控通道的统一承载层;还是实现跨SOC间融合计算微引擎和加速器的级联和路由,也称为多路由域,如图6。
本申请实施例提供了高弹性网络,其中,路由器与计算资源可以直连,其中计算资源可以是融合计算微引擎,加速器等;每个计算资源要集成一个收发器与路由器的收发器背靠背连接,可采用同步或异步接口设计。
本申请的一种实施例中,每个计算资源要集成一个收发器与路由器的收发器背靠背连接时,采用异步接口设计,参见图7,因不同微引擎和加速器可能工作在不同的主频,该连接方式可以显著减少高弹性网络在传输和接收数据时的阻塞和超时。
高弹性网络中,收发器采用帧形式或报文来传输和接收数据,收发器可以向路由器发送报文或接受路由器来的报文。其中,高弹性网络传输的帧的基本结构,请参阅图8。
路由器在接收到报文之后,根据解析对应的帧并取出对应的目的端口号,查找对应的路由表找到对应的出端口,向该端口发送报文;如出现多个端口向一个端口发送,需要采用公平仲裁逐个相应的发送报文。
本申请的实施例中,将高弹性网络传输的未经扩展的帧称为“基本帧”。高弹性网络传输的基本帧的结构支持根据应用场景做动态扩展,以适应不同语义的数据格式。
在一种实施例中,高弹性网络传输的帧采用KLV(Key-Length-Value)的扩展格式定义。
其中,
Key字段,在帧的结构中位于最前面,用于描述该字段的属性名称,可以是固定长度或应用可约定;
Length字段,紧接着Key字段,用于描述该字段的长度,可以是固定长度或应用可约定;
Value字段,紧接着length字段,用于承载要传输的数据,长度由Length字段约定。
以下实施例将扩展后得到的帧称为“子帧”,图9提供了本申请实施例提供的一种高弹性网络的子帧的格式示意图。
子帧是分层定义的,最底层是网络子帧,其上是系统子帧,之后是应用子帧,每层都可以独立定义,但传输的顺序严格按照以下方式传输相应的子帧:先网络子帧,再系统子帧,然后应用子帧。网络子帧和系统子帧是预先定义,应用子帧可由开发者或加速器设计时自行约定。
在本申请的一种实施例中,系统子帧的预定义,采用如下类型:
Key=0,代表路由范围,该子帧的数据域是目的地所处的路由域ID;
Key=1,代表上下文会话,该子帧的数据域是该帧所属的数据会话ID;
Key=2,代表源路由地址,该子帧的数据域是发出该帧的队列ID,如果跨域传输子帧,还需要在该子帧内携带路由范围;
Key=3,代表操作系统自定义子帧,该子帧的数据域是操作系统服务所传输的数据,例如:配置数据,程序镜像等。该子帧内,操作系统可以约定自己的“孙帧”,其中,“孙帧”也可以是遵循KLV格式,以便网络可参与帧解析,提高转发效率。
Key=4,代表应用层自定义子帧,该子帧的数据域是应用之间共享的数据,这个子帧 内,应用之间可以约定自己的“孙帧”,其中,应用的“孙帧”也可以是遵循KLV格式以便网络可参与帧解析,提高转发效率。
(5)、高动态操作系统(HOS)。
高动态操作系统提供一种资源调度以及消息通信机制。其中,该资源调度以及消息通信机制,让应用开发者和硬件开发者更好协同设计,又可以相互之间解耦,只要在语义上达成共识就可以实现互操作,让这个系统具备面向高动态环境的按需重构和按需调度的高动态计算能力。
图10示出了一种高动态操作系统的组成结构的示意图,高动态操作系统主要提供三个主要服务:语义驱动计算服务,语义驱动数据服务,语义驱动会话服务。
下面分别介绍三个主要服务的主要功能:
1)语义驱动计算服务。
语义驱动计算服务的主要功能包括:加速池管理、路由管理、即时编译和计算管理。
其中,加速池管理是指高动态操作系统发现硬件上所有连接的加速器池及其支持语义和所在的网络位置,登记该加速器的语义、位置及数量,将其作为即时编译和动态路由的输入参数,同时也会将语义加速器清单暴露给应用层、语义驱动会话服务和语义驱动数据服务。
路由管理是指高动态操作系统发现硬件上所有连接的路由网路及路由域,建立系统全局的路由表,包含路由域列表、每个路由域的路由端口列表以及端口所连的单元类型(含加速器、微引擎、路由器等),将之作为即时编译和计算管理的输入参数。其中,每个加速器或加速器池其所连接的路由器的端口号也就是事件队列号或事件消息的目的端口号。
即时编译是指高动态操作系统根据加速器管理和路由管理的语义加速器及全局路由表,建立语义加速器指令到事件队列的编译映射表,编译映射表的格式如表2所示。将编译映射表作为操作系统在计算管理加载线程或程序时候判断是否启动即时编译的检查列表。
表2
语义加速器指令 语义加速器/池名称 事件队列号 数据格式
Fp32 浮点计算 EQ-ID1 (ax,bx,cx)
FFT 傅里叶变换 EQ-ID2 (ax[],bx[],cx[])
计算管理是指高动态操作系统将微引擎视为线程处理器或容器,提供相应的资源申请API(Application Programming Interface,应用程序接口)接口给应用,让应用可以动态创建线程或任务,发挥海量的多线程多任务并行计算的高动态计算能力,同时也会将微引擎创建任务的接口API暴露给应用层。
2)语义驱动数据服务。
语义驱动数据服务的主要功能包括:语义数据索引、数据管理、内存分配、语义寻址映射。
其中,语义数据索引是指高动态操作系统提供创建结构化的内存共享数据索引的服务,代替页面+偏移地址的全局地址表及其元数据管理,对外发布语义信息,更适合于众核架构、高性能计算和超算等场景的海量数据共享。
数据管理是指高动态操作系统提供在上述创建的内存共享数据索引上做“增删改查”的数据操作接口,添加数据到上述的索引上,后续应用还可以对数据做修改操作。
内存分配是指高动态操作系统在添加数据之后,在本地分配与该添加的数据对应的内存并关联到对应的索引上,考虑到提升内存的访问效率,应用层应尽量让语义共享数据块颗粒尽可能块,这样可有利于发挥语义数据共享优势。
语义寻址映射是指高动态操作系统在外部通用语义访问共享数据时,在系统内将外部通用语义转换到页面+偏移地址的形式,以确定存储到本地内存的数据。
3)语义驱动会话服务。
语义驱动会话服务的主要功能包括:语义会话索引、语义加速库、语义上下文管理、会话性能管理。
其中,语义会话索引是指高动态操作系统提供应用层创建数据会话的接口并生成对应索引,也称为上下文ID(Context ID,CID)。
语义加速库是指高动态操作系统提供应用层该操作系统可使用的语义加速库清单,用于创建上下文所涉及多个加速库,提供自动地动态分配的加速池服务,无需应用参与指定具体资源,让应用程序可以自动适配到高动态计算的硬件上。
语义上下文管理是指高动态操作系统提供上下文相关的微引擎、加速器和事件队列等相关的硬件配置模板及配置服务,让应用层可灵活创建复杂逻辑的数据会话,从而达到将软件处理高频重复性计算任务卸载到硬件处理,实现高能效的计算能力。
会话性能管理是指高动态操作系统提供应用层所创建会话的性能监控服务,也提供应用层指定性能要求,如带宽、速率、时延等参数,在出现性能劣化情况下,主动向应用层上报异常以便做后续优化和调整处理,如触发路由重建等操作。
以图1的数据处理系统100为例,在数据处理系统100首次启动时,数据处理系统100的高动态操作系统通过语义驱动计算服务发现系统硬件的资源。例如,高动态操作系统通过语义驱动计算服务发现系统硬件的资源,如:加速器、微引擎和路由网络等。高动态操作系统可以根据通过语义驱动计算服务发现系统硬件的资源,建立相应的系统硬件资源清单并保存,再次启动如检查到硬件有变更则刷新该系统硬件资源清单,否则可直接使用之前的系统硬件资源清单进行快速启动。
在数据处理系统100启动之后,应用层首先通过数据处理系统100的高动态操作系统的语义驱动数据服务创建所需共享的内存数据并建立相应的语义数据索引及语义寻址映射的本地内存地址清单。
在数据处理系统100的共享数据创建完毕之后,应用层就可以通过数据处理系统100的高动态操作系统的语义驱动计算服务分配微引擎,并加载计算任务对应的代码;同时,应用层也可以通过数据处理系统100的高动态操作系统的语义驱动会话服务创建数据会话,将高频计算任务通过多个语义加速器和微引擎直接通过事件队列交换。
本申请实施例的上述描述的数据处理系统架构以及业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定。本领域普通技术人员可知,随着数据处理系统架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请实施例中提供了一种消息处理方法及装置,该方法中,包括:第一处理单元对第一事件消息进行处理,得到第二事件消息;第一事件消息是第一处理单元接收到的,或 者第一事件消息是第一处理单元基于应用程序的处理请求生成的;第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,上下文信息包括第一处理单元到第二处理单元的路由信息,上下文信息是基于应用程序的处理请求生成的;其中,第一处理单元为第一引擎、第二处理单元为第二加速器,或者,第一处理单元为第一加速器、第二处理单元为第二引擎,或者,第一处理单元为第一引擎、第二处理单元为第二引擎,或者第一处理单元为第一加速器,第二处理单元为第二加速器。在该方法中,由于事件消息在不同处理单元之间的传输,是基于上下文信息实现的,相比于采用调度的方式(比如使用调度器等进行消息调度)进行事件消息的传输调度,上述实现方式可以避免传输调度所导致的性能瓶颈,进而可以提高系统处理性能。
下面结合具体实施例对本申请提供的方案进行详细说明。
本申请实施例中的消息处理方法可以应用于图1所示的数据处理系统100中。
本申请实施例提供的消息处理方法,在进行消息处理之前,先基于事件进行动态的资源分配。下面首先介绍数据处理系统的动态资源分配过程。在本申请的以下实施例中,以引擎是融合计算微引擎为例进行说明。需要指出的是,本申请实施例中,融合计算微引擎也可以简称为微引擎。
具体地,应用程序启动时,高动态操作系统接收到应用程序的处理请求,获取处理请求的语义,根据处理请求的语义,确定处理请求包括的至少两个任务。
具体实施时,处理请求包括的任务与任务语义一一对应。该处理请求的语义包括至少两个任务语义,且根据该至少两个任务语义中的每个任务语义,确定对应的一个任务。
例如,该处理请求包括的至少两个任务可以为第一任务和第二任务,第一任务对应于第一任务语义,第二任务对应于第二任务语义,处理请求的语义包括第一任务语义和第二任务语义,第一任务与第二任务不同,第一任务语义和第二任务语义不同。
在响应于接收到的应用程序的处理请求,建立归属于应用程序的该处理请求的至少两个任务时,高动态操作系统还响应于接收到的该处理请求,根据应用程序的资源配置信息确定用于执行该处理请求的计算资源,该计算资源至少包括第一计算资源、第二计算资源和第三计算资源,生成应用程序的上下文,该上下文至少包括第一计算资源到第二计算资源、第二计算资源到第三计算资源的路由信息。对已分配的计算资源,系统还可以根据上下文和各个计算资源的事件队列,打通各个计算资源的通信链路。可以理解地,用于执行该处理请求的计算资源的数量可以是3,也可以是4个或者更多个,本申请的技术方案对可以分配的用于执行该处理请求的计算资源的数量不作具体限定。
为了更清楚地对本申请实施例的技术方案进行说明,下面以该处理请求使用的计算资源为计算资源Resource1、计算资源Resource2、计算资源Resource3、计算资源Resource4为例进行说明。为了与前述的处理请求的至少两个任务分别对应,在一些实施例中,可以是计算资源Resource1和计算资源Resource3为两个不同的微引擎,计算资源Resource2和计算资源Resource4可以为两个不同的加速器;在另一些实施例中,也可以是计算资源Resource1、计算资源Resource2和计算资源Resource3为三个不同的微引擎,计算资源Resource4为加速器;在其他一些实施例中,还可以是计算资源Resource1和计算资源Resource4为两个不同的微引擎,计算资源Resource2和计算资源Resource3可以为两个不同的加速器。
高动态操作系统还创建至少两个任务对应的至少两个线程;将该至少两个线程加载到 至少两个引擎上运行,其中,不同的线程运行在不同的引擎上,不同的线程对应于不同的任务。
以计算资源Resource1和计算资源Resource3为两个不同的微引擎,计算资源Resource2和计算资源Resource4可以为两个不同的加速器为例,进行说明。为了更清楚,根据Resource1~Resource4的类型,可以将计算资源Resource1、计算资源Resource2、计算资源Resource3、计算资源Resource4分别记为微引擎XPU_A、加速器SDA_A、微引擎XPU_B、加速器SDA_B。第一任务的计算资源可以包括微引擎XPU_A和加速器SDA_A,第二任务的计算资源包括微引擎XPU_B和加速器SDA_B,高动态操作系统在基于上述过程动态分配计算资源之后,在微引擎XPU_A上创建对应于第一任务的第一线程,在微引擎XPU_B上创建对应于第二任务的第二线程。其中,微引擎XPU_A与微引擎XPU_B不同,加速器SDA_A和加速器SDA_B不同,加速器SDA_A与第一事件队列对应。
本申请实施例中,每个线程、加速器和应用/CPU可以都有自己对应的事件队列,线程或加速器通过自己的事件队列对需要下游处理的事件消息转发下一级处理单元的事件队列,该单元可以是线程或加速器,也可以是应用/CPU。
需要说明的是,上述实施例中,响应于接收到的处理请求,建立归属于应用程序的处理请求的两个任务,即第一任务和第二任务,仅仅是为了对本申请实施例中的消息处理方法进行示例说明。在其他实施例中,响应于接收到的处理请求,还可以建立归属于应用程序的处理请求的多个任务,例如:第1任务、第2任务、…、第N任务,并创建与各个任务对应的线程。
另外,上述实施例中,根据应用程序的资源配置信息确定第一任务和第二任务使用的计算资源,第一任务的计算资源包括微引擎XPU_A和加速器SDA_A,第二任务的计算资源包括微引擎XPU_B和加速器SDA_B,其中,第一任务和第二任务使用的计算资源中的加速器的数量是1个,仅仅是为了对确定任务使用的计算资源的过程进行示例说明。在其他一些实施例中,对于归属于应用程序的处理请求的多个任务,与该多个任务中的至少一个任务对应的计算资源包括一个引擎和至少一个加速器;与该多个任务中的其他的任务对应的计算资源除了包括一个引擎之外,加速器的数量可以是:0个、1个、2个或2个以上。也即,归属于应用程序的处理请求的任务,不但可以使用一个引擎和一个加速器作为计算资源;个别任务也可以只使用一个引擎,而不使用任何加速器;个别任务还可以是使用一个引擎和多个加速器。
本申请的实施例中,资源配置信息是接收到的应用层发送的参数。
需要说明的是,用户可以针对不同的应用场景,通过本申请实施例提供的数据处理系统100的应用层,开发数据处理任务软件包,以得到用于进行数据处理的应用程序的安装文件。
一种可能的实现方式为,资源配置信息包括触发事件;在应用程序启动的过程中,响应于应用程序的一个处理请求,确定与该处理请求对应的任务,可以通过以下方式实现:响应于与触发事件对应的应用程序的处理请求,确定与该处理请求对应的任务。其中,触发事件是用于在数据处理系统装载应用程序的数据处理任务软件包后,启动处理请求的预先设置的事件。
示例性地,视频通话终端是边缘智能计算的一种典型场景,目前视频通话终端支持人脸识别、背景替换等人工智能计算,要求越来越高的算力,同时也需要低功耗,特别是移 动办公,应急指挥等场景。
图11为本申请实施例中提供的一种边缘智能计算的设计方案示意图。参见图11,视频通话终端1100为在现有硬件基础上进行扩展得到,考虑最大程度重用现有硬件。视频通话终端1100中,CPU可以完全利用现有硬件,例如x86架构、ARM架构、RISC-V架构等CPU,相比现有硬件做如下扩展:
1)PCI-E(Peripheral Component Interconnect Express,外设部件互连标准)或AMBA(Advanced Microcontroller Bus Architecture,片上总线协议)等总线上扩展支持事件队列的传输机制,作为高弹性路由网络的端口;
2)操作系统层面可在Linux基础上增加高动态操作系统的三大服务并对上开放应用API;
3)通话软件要支持调度中心等能力,可实现将音频采集、音视频编解码、网络会话等线程部署到高动态计算硬件上;
4)增加高动态计算的硬件,配置相应的微引擎、路由网络、加速器(如FFT变换、视频渲染、DNN网络等)及与相应的外设(显存、摄像头、网卡、话筒等)相连。
对于图11示出的视频通话终端1100,触发事件可以是点击通话键。在进行数据处理前,先基于“点击通话键”的触发事件进行动态的资源分配。假设第一计算资源为图11中的XPU 3,第二计算资源为图11中的信号处理加速器1,第三计算资源为图11中的XPU0,第四计算资源为图11中的音频加速器1。当发生“点击通话键”的触发事件时,应用程序启动,数据处理系统接收到与“点击通话键”对应的语音通话处理请求Voice01,响应于应用程序的语音通话处理请求Voice01,获取语音通话处理请求Voice01的语义,例如该语音通话处理请求Voice01的语义可以是“语音会话”,假定语音通话处理请求Voice01的语义“语音会话”包括第一任务语义“音频采集”和第二任务语义“音频处理”,高动态操作系统根据语音通话处理请求Voice01的语义“语音会话”,确定与该语音通话处理请求Voice01对应的多个任务,该多个任务至少包括第一任务和第二任务,假设第一任务为音频采集任务、第二任务为音频处理任务,其中,音频采集任务对应于第一任务语义“音频采集”,音频处理任务对应于第二任务语义“音频处理”。上述的音频采集任务和音频处理任务归属于语音通话处理请求Voice01。
可以理解地,本申请的实施例对处理请求的语义包含的任务语义的数量不作限定,当处理请求的语义包含的任务语义的数量为N时,数据处理系统可以确定处理请求包括的N个任务。
进一步地,建立音频采集任务和音频处理任务时,还响应于接收到的该语音通话处理请求Voice01,根据应用程序的资源配置信息确定用于执行该语音通话处理请求Voice01的计算资源,该计算资源包括图11中的XPU 3、信号处理加速器1、XPU 0和音频加速器1,生成应用程序的上下文,该上下文包括XPU 3到信号处理加速器1、信号处理加速器1到XPU 0、XPU 0到音频加速器1的路由信息。并对上述已分配的计算资源,根据上下文和各个计算资源的事件队列,打通通信链路。例如,在XPU 3、信号处理加速器1之间建立第一通信链路,在XPU 0、音频加速器1之间建立第二通信链路。在XPU 3上创建用于处理音频采集任务的音频采集线程,在XPU 0上创建用于处理音频处理任务的音频处理线程;音频采集线程对应于音频采集任务,音频处理线程对应于音频处理任务。
在本申请的实施例中,还可以设置上下文的标识,该上下文标识用于指示应用程序的 上下文。例如上下文标识CID1可以指示上述的视频通话终端1100生成的应用程序的上下文,该上下文包括XPU 3到信号处理加速器1、信号处理加速器1到XPU 0、XPU 0到音频加速器1的路由信息。
在本申请的一些实施例中,高动态操作系统可以根据应用程序的资源配置信息确定音频采集任务和音频处理任务使用的计算资源。例如,可以确定音频采集任务的计算资源包括图11中的XPU 3和信号处理加速器1,音频处理任务的计算资源包括图11中的XPU 0和音频加速器1。
在本申请的一些实施例中,第一处理单元或第二处理单元是基于接收到应用程序的处理请求时多个处理单元的状态信息,从多个处理单元中选择的,处理单元的状态信息包括网络拓扑性能。
具体实施时,确定用于执行处理请求的计算资源,具体是基于接收到处理请求时的硬件状态信息,为处理请求分配计算资源,硬件状态信息包括网络拓扑性能。为第一任务和第二任务配置计算资源,可以考虑硬件(线程、加速器等)的实时状态,进而在满足第一任务和第二任务的需求的前提下,为其分配最优的硬件。在操作系统启动的时候会根据全部硬件状态建立一个硬件状态表,然后每当有硬件的状态发生变化时,就自动更新这个硬件状态表,然后在为第一任务和第二任务分配计算资源的时候,会参考该硬件状态表里面的参数。本申请的实施例中,考虑的硬件状态的参数,除了包括资源本身的使用率之外,还包括网络拓扑性能。网络拓扑性能具体包括网络拓扑的链接关系、吞吐量、可用路由、可用带宽、时延等。
作为一种示例,为音频采集任务和音频处理任务分配计算资源,可以是基于接收到语音通话处理请求时的硬件状态信息,为音频采集任务和音频处理任务分配计算资源;其中,硬件状态信息包括网络拓扑性能。
需要说明的是,上述的分配最优的硬件,可以是分配当前性能最优的硬件,也有可以是分配性能最匹配的硬件,避免资源浪费。另外,硬件状态信息既可以通过建立硬件状态的列表并实时刷新的方式获得,也可以是在配置计算资源的时候获取每个硬件的硬件状态。
在另一种实现方式中,确定音频采集任务和音频处理任务对应的计算资源的过程,可以具体为,当发生“点击通话键”的触发事件时,启动语音通话处理请求Voice01。响应于应用程序的语音通话处理请求Voice01,生成与该处理请求对应的音频采集任务和音频处理任务,在XPU 3上创建用于处理音频采集任务的音频采集线程,在XPU 0上创建用于处理音频处理任务的音频处理线程后,确定音频采集任务和音频处理任务对应的计算资源。其中,音频采集任务对应的计算资源包括XPU 3和信号处理加速器1,音频处理任务对应的计算资源包括XPU 0和音频加速器1。
需要特别指出的是,本申请的实施例中,一个任务对应的计算资源可以包括一个引擎和一个加速器,也可以是包括一个引擎和多个加速器;多个任务中的部分任务还可以是只包括一个引擎。
一种可能的实现方式为,在应用程序启动时,还包括以下步骤:
步骤A1,响应于应用程序启动,获取应用程序的资源配置信息。
其中,资源配置信息包括引擎数量,以及加速器类型和加速器数量。
示例性地,假设加速器池Pool1中包含10个信号处理加速器,加速器池Pool2中包含10个音频加速器,微引擎的总数量是20个。响应于应用程序启动,为了构建视频通话终 端1100,获取到的应用程序的资源配置信息包括:引擎为微引擎,微引擎对应的数量“2”、加速器类型为“信号处理加速器”和“音频加速器”、与加速器类型“信号处理加速器”对应的加速器数量为“1”,与“音频加速器”对应的加速器数量为“1”。
步骤A2,根据资源配置信息,以及候选引擎的负载,选取应用程序使用的引擎。
其中,选取的引擎里包括第一引擎和/或第二引擎。
示例性地,根据微引擎数量“2”,以及候选引擎的负载,选取2个微引擎,其中,该2个微引擎中包括微引擎XPU 3和XPU 0,微引擎XPU 3与微引擎XPU 0不同。
具体实施时,选取应用程序使用的引擎,可以是在候选引擎中按照负载率从低到高的顺序选取指定数量的微引擎;还可以是根据负载要求从候选引擎中选取满足该负载要求的指定数量的微引擎,其中负载要求可以是由资源配置信息获取的。
步骤A3,根据资源配置信息,选取应用程序使用的加速器,选取的加速器中包括第一加速器和/或第二加速器。
示例性地,根据加速器类型“信号处理加速器”和“音频加速器”,可以确定:与“信号处理加速器”对应的加速器池为加速器池Pool1,与“音频加速器”对应的加速器池为加速器池Pool2。其中,从加速器池Pool1选取的应用程序使用的加速器中包括信号处理加速器1,从加速器池为Pool2选取的应用程序使用的加速器中包括音频加速器1,其中,信号处理加速器1和音频加速器1不同。
一种可能的实现方式为,在XPU 3和信号处理加速器1之间建立第一通信链路,具体为在XPU 3和事件队列4之间建立通信链路,事件队列4对应于信号处理加速器1。这样,XPU 3上运行音频采集线程可以将事件消息Mes.1发送给事件队列4,信号处理加速器1可以从事件队列4中获取事件消息Mes.1。同样地,在XPU 0和音频加速器1之间建立第二通信链路,可以具体为在XPU 0和事件队列5之间建立通信链路,事件队列5对应于音频加速器1。这样,XPU 0上运行的音频处理线程可以将事件消息Mes.3发送给事件队列5,音频加速器1可以从事件队列5中获取事件消息Mes.3。
在一种可选的实施例中,在音频采集线程可以将事件消息Mes.1发送给事件队列4时,具体地,音频采集线程执行信号处理加速器1的重译指令,以将事件消息Mes.1发送给事件队列4。其中,信号处理加速器1的重译指令是通过加载信号处理加速器1,并为信号处理加速器1分配事件队列4的标识后,根据事件队列4的标识,修改信号处理加速器1的机器码得到的;第一重译指令被执行时,音频采集线程向事件队列4发送事件消息。
为了更高效地进行事件消息的传递,本申请定义了一种新的事件消息信息格式,即通过图1中的事件队列在高弹性网络传输的系统信息。
在一种可选的实施例中,数据处理系统的事件消息采用图9示出的高弹性网络的子帧的格式,仅以事件消息Mes.1为例,事件消息Mes.1包括:网络层消息属性信息域,用于承载事件消息路由信息,事件消息路由信息包括目标事件队列标识,例如,该目标事件队列标识可以为信号处理加速器1事件队列4的标识;网络层消息长度域,用于承载事件消息Mes.1的总长度信息;网络层数据域,用于承载事件消息Mes.1的负荷。
一种可能的实现方式为,网络层数据域中包括操作系统层事件信息域,操作系统层事件信息域包括以下至少一项:路由范围、上下文的标识、源消息队列标识或者自定义属性,路由范围包括至少一个路由域。
示例性地,系统子帧的预定义,可以采用如下类型:
Key=0,代表路由范围,该子帧的数据域是目的地所处的路由域ID;
Key=1,代表上下文会话,该子帧的数据域是该帧所属的数据会话ID;
Key=2,代表源路由地址,该子帧的数据域是发出该帧的队列ID,如果跨域传输子帧,还需要在该子帧内携带路由范围;
Key=3,代表操作系统自定义子帧,该子帧的数据域是操作系统服务所传输的数据,例如:配置数据,程序镜像等。
一种可能的实现方式为,网络层数据域中包括应用层事件信息域,应用层事件信息域包括应用层的自定义信息。
具体实施时,在系统子帧中,操作系统可以约定自己的“孙帧”,其中,“孙帧”也可以是遵循KLV格式,以便网络可参与帧解析,提高转发效率。
示例性地,系统子帧的预定义,还可以包括如下类型:
Key=4,代表应用层自定义子帧,该子帧的数据域是应用之间共享的数据,这个子帧内,应用之间可以约定自己的“孙帧”,其中,应用的“孙帧”也可以是遵循KLV格式。
应用层事件信息域、操作系统层事件信息域与网络层数据域的关系可以参见图9。
本申请实施例提供一种消息处理方法,在基于事件进行动态的资源分配之后,进行事件消息的处理。
在一些可选的实施例中,结合本申请实施例提供的数据处理系统,例如图11示出的视频通话终端1100,对消息进行处理的过程,如图12所示,可以包括如下步骤:
步骤S1201,第一处理单元接收第一事件消息。
其中,第一处理单元可以为第一微引擎或第一加速器。
示例性地,在图11示出的视频通话终端1100中,第一处理单元可以指信号处理加速器1,也可以指微引擎XPU 0。以第一处理单元为信号处理加速器1为例,进行说明。视频通话终端1100可以将事件消息在信号处理加速器1与XPU 0之间传输。该视频通话终端1100的消息过程中,将事件消息在信号处理加速器1与XPU 0之间传输,首先是信号处理加速器1获取事件消息Mes.1。
在另外的实施例中,第一处理单元为第一微引擎,第一事件消息可以是第一处理单元基于应用程序的处理请求生成的。
步骤S1202,第一处理单元对第一事件消息进行处理,得到第二事件消息。
示例性地,信号处理加速器1对事件消息Mes.1进行处理,得到事件消息Mes.2。
一种可能的实现方式为,上下文还包括操作配置信息;第一处理单元对第一事件消息进行处理,得到第二事件消息,具体为:第一处理单元获取上下文中,第一处理单元对应的第一操作配置信息;第一处理单元根据第一操作配置信息对第一事件消息进行处理。
具体实施时,上下文还包括用于计算资源的操作配置信息;计算资源包括微引擎和加速器;应用程序启动时,根据资源配置信息分配上下文和上下文标识。上下文标识用于指示与应用程序的上下文。上下文标识被包括在与该应用程序的同一处理请求相对应的所有事件消息中,例如第一事件消息和第二事件消息中,上下文标识可用于获取上下文。
示例性地,以应用程序的与“点击通话键”对应的语音通话处理请求Voice01为例,假设上下文包括用于计算资源的操作配置信息CZXX1,其中,操作配置信息CZXX1为“CID1,in:ADC,via:FFT,via:SHT,out:Fra,位宽,采样点数,周期,数据子块时间片,双浮点精度,…”。应用程序启动时,根据资源配置信息分配语音通话处理请求Voice01 对应的上下文以及上下文标识CID1,该上下文标识CID1被包括在事件消息Mes.1、事件消息Mes.2和事件消息Mes.3中。上下文标识CID1可以用于获取与语音通话处理请求Voice01对应的操作配置信息CZXX1。
信号处理加速器1对事件消息Mes.1进行处理的过程,具体为:先根据事件消息Mes.1中包括的上下文标识CID1,获取对应的用于信号处理加速器1的第一操作配置信息CZXX1_1,例设第一操作配置信息CZXX1_1为“对接收到的该上下文ID的事件消息进行FFT变换”;然后,信号处理加速器1根据第一操作配置信息CZXX1_1对Mes.1进行处理。类似地,音频加速器1对事件消息Mes.3进行处理的过程,可以是音频加速器1先根据事件消息Mes.3中包括的上下文标识CID1,获取对应的用于音频加速器1的第二操作配置信息CZXX1_2,假设第二操作配置信息CZXX1_2是“对接收到的该上下文ID的事件消息进行MP4编码”,再根据用于音频加速器1的第二操作配置信息CZXX1_2对Mes.3进行处理。
步骤S1203,第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,上下文信息包括第一处理单元到第二处理单元的路由信息。
其中,第二处理单元可以为第二微引擎或第二加速器,上下文信息是基于应用程序的处理请求生成的。
具体实施时,第一处理单元与第二处理单元进行事件消息的传输时,可以具体为:第一处理单元为第一微引擎、第二处理单元为第二加速器,或者,第一处理单元为第一加速器、第二处理单元为第二微引擎,或者,第一处理单元为第一微引擎、第二处理单元为第二微引擎,或者第一处理单元为第一加速器,第二处理单元为第二加速器。
示例性地,第一处理单元为信号处理加速器1时,第二处理单元为微引擎XPU 0。信号处理加速器1根据上下文,将事件消息Mes.2发送给微引擎XPU 0。上下文包括信号处理加速器1到微引擎XPU 0的路由信息。
一种可能的实现方式为,第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,可以先由第一处理单元根据路由信息,将第二事件消息发送给第二处理单元对应的事件队列;然后,第二处理单元从事件队列获取第二事件消息。
本申请的实施例中,每个包括线程、加速器在内的计算资源都有自己的事件队列;一个线程或加速器对需要其他计算资源处理的事件消息,通过自己的事件队列向下游的微引擎/加速器的事件队列发送消息。可以理解地,应用/CPU也可以有自己的事件队列,从而能够在应用/CPU、线程、加速器三者之间进行事件消息的传递。线程通过自己对应的事件队列发送事件消息时,具体为通过其所创建在的微引擎的事件队列进行事件消息的转发。在本申请的实施例中,微引擎的事件队列就是微引擎上运行的线程的事件队列。
参见图11,事件队列4对应于信号处理加速器1,事件队列3对应于音频采集线程,事件队列0对应于音频处理线程,音频加速器1与图11中的事件队列5对应。XPU 3上的音频采集线程获取数据请求Data-1,然后根据应用程序的上下文中包括的路由信息,将根据数据请求Data-1生成的事件消息Mes.1通过事件队列3发送给事件队列4;响应于事件队列4接收事件消息Mes.1,信号处理加速器1从事件队列4中获取事件消息Mes.1,对事件消息Mes.1进行处理,生成事件消息Mes.2,然后根据应用程序的上下文中包括的路由信息,将事件消息Mes.2发送给XPU 0对应的事件队列0,XPU 0上运行的音频处理线程基于事件消息Mes.2生成事件消息Mes.3,然后根据应用程序的上下文中包括的路由信息, 通过事件队列0将Mes.3发送给事件队列5;Mes.3被发送到事件队列5后,响应于事件队列5接收事件消息Mes.3,音频加速器1从事件队列5获取事件消息Mes.3,并对事件消息Mes.3进行处理。
一种可能的实现方式为,第二事件消息包括目标事件队列标识,目标事件队列标识为第二处理单元对应的事件队列的队列标识。
具体地,第一处理单元根据路由信息,将第二事件消息发送给第二处理单元对应的事件队列,可以是:第一处理单元根据上下文信息中包括的路由信息,确定第二事件消息中待添加的事件消息路由信息,事件消息路由信息包括目标事件队列标识,目标事件队列标识为第二处理单元对应的事件队列的队列标识;第一处理单元在第二事件消息中添加事件消息路由信息;第一处理单元发送添加了事件消息路由信息的第二事件消息,添加了事件消息路由信息的第二事件消息被发送到第二处理单元对应的事件队列。
在本申请实施例中,事件消息路由信息也可以称为流转信息,上下文信息包括的路由信息也可以称为应用程序对应的流转顺序信息。上下文标识用于指示应用程序的上下文,可以指示应用程序对应的流转顺序信息。
示例性地,信号处理加速器1根据应用程序的上下文中包括的路由信息,将事件消息Mes.2发送给微引擎XPU 0对应的事件队列0,可以是:信号处理加速器1根据事件消息Mes.2中包括的上下文标识CID1,获取应用程序对应的流转顺序信息,假设该流转顺序信息为“CID1,事件队列3,事件队列4,事件队列0,事件队列5”,表征传递顺序依次为音频采集线程、信号处理加速器1、音频处理线程、音频加速器1,进而根据该流转顺序信息确定事件消息Mes.2中待添加的流转信息。该流转信息包括目标事件队列标识,该事件消息Mes.2的流转信息包括的目标事件队列标识为微引擎XPU 0对应的事件队列0的队列标识。然后,信号处理加速器1在事件消息Mes.2中添加前述确定的流转信息。接下来,信号处理加速器1可以发送添加了流转信息的事件消息Mes.2,该添加了流转信息的事件消息Mes.2被发送到微引擎XPU 0对应的事件队列0。
本申请的一些实施例中,路由信息还包括目标路由域,目标路由域用于指示目标服务器,目标服务器与源服务器不同,源服务器是第一处理单元所在的服务器。
示例性地,应用程序对应的流转顺序信息还包括第一目标路由域,确定事件消息Mes.2中待添加的流转信息时,该流转信息还包括第一目标路由域,第一目标路由域用于指示第一目标服务器,第一目标服务器与图11中的信号处理加速器1所在的源服务器不是同一服务器。
可以理解地,本申请实施例中,线程或加速器可以根据上下文获取路由信息并对需要下游处理的事件消息转发下一级处理单元,该单元可以是线程或加速器,也可以是应用/CPU。一个处理单元向另一个处理单元发送事件消息的过程与信号处理加速器1向微引擎XPU 0发送事件消息Mes.2的过程相似,在此不再赘述。
需要说明的是,数据处理系统对事件消息的处理过程中,在基于事件进行动态的资源分配之后,根据上下文实现事件消息在不同的多个处理单元之间的顺序传输时,对于该多个处理单元中的首个处理单元,是一个微引擎,该微引擎上运行的线程可以获取数据请求,并基于数据请求生成首个事件消息。数据请求是请求信息,用于请求对与应用程序的处理请求对应的具体数据进行响应。需要说明的是,处理请求可以是数据获取请求,还可以是数据处理请求。其中,数据获取请求用于请求获取与该请求消息中包含的数据信息相对应 的目标数据,数据处理请求用于请求对该请求消息中包含的数据信息进行处理。
示例性地,以数据请求是数据处理请求Data-1为例进行说明,Data-1用于请求对与“点击通话键”的触发事件相对应的数字信号进行响应。当发生“点击通话键”的触发事件时,应用程序启动,数据处理系统接收到语音通话处理请求Voice01,微引擎XPU 3上运行的音频采集线程从话筒通过ADC采集音频信号,获取与“点击通话键”的触发事件相对应的数据请求Data-1,并根据数据请求Data-1生成的事件消息Mes.1,参见图13。
可以理解地,在图11示出的视频通话终端1100中,若第一处理单元指微引擎XPU 0,则第二处理单元是指音频加速器1。视频通话终端1100可以将事件消息在微引擎XPU 0与音频加速器1之间传输。将事件消息在微引擎XPU 0与音频加速器1之间传输的过程,与将事件消息在信号处理加速器1与XPU 0之间传输的过程相似。该视频通话终端1100的消息过程中,将事件消息在微引擎XPU 0与音频加速器1之间传输,首先是微引擎XPU0获取事件消息Mes.2;微引擎XPU 0对事件消息Mes.2进行处理,得到事件消息Mes.3;微引擎XPU 0根据上下文,将事件消息Mes.3发送给音频加速器1。上下文包括微引擎XPU0到音频加速器1的路由信息。
上述的事件消息Mes.1、事件消息Mes.2、事件消息Mes.3包括上下文的标识,例如上下文标识CID1。该上下文标识CID1用于指示应用程序的上下文。
需要指出的是,事件消息在不同处理单元之间传输的方式,均与事件消息在从加速器到微引擎、从微引擎到加速器的传输过程相似。因此,对于事件消息在从加速器到加速器、从微引擎到微引擎的传输过程,具体不再赘述。
在一些可选的实施例中,消息处理方法还包括释放第一线程,该第一线程为至少两个线程中的一个;若释放第一线程后,第一线程被释放前所在的引擎上已无线程运行,关闭第一线程被释放前所在的引擎。
具体实施时,响应于接收到释放第一线程的指令,释放引擎上运行的第一线程;若释放第一线程后,第一线程被释放前所在的引擎上已无线程运行,则关闭第一线程被释放前所在的引擎。
其中,释放第一线程的指令可以响应于发生的与触发事件对应的释放事件生成的,接收到释放第一线程的指令后,数据处理系统释放第一微引擎上运行的第一线程。其中,释放事件是设置的用于在处理请求启动后,停止与处理请求对应的数据处理的事件。
示例性地,对于图11示出的视频通话终端1100,释放事件可以是点击停止通话键或视频通话呼叫挂断。当用户点击停止通话键时,视频通话终端1100响应于接收到与发生的第二事件“点击停止通话键”相对应的释放音频采集线程的指令,释放XPU 3上运行的音频采集线程。释放XPU 3上运行的音频采集线程之后,若XPU 3上不再有运行的线程,则关闭XPU 3,实现近零的待机功耗。
本申请的实施例中,数据请求是对与应用程序的处理请求对应的具体数据进行响应的请求。需要说明的是,处理请求可以是数据获取请求,还可能是数据处理请求,其中,数据获取请求用于请求获取数据信息,数据处理请求用于请求对该请求消息中包含的数据信息进行处理。与处理请求相对应地,在一些实施例中,数据请求可以是请求根据与应用程序的处理请求对应的具体数据获取数据;在其他一些实施例中,数据请求可以是请求对与应用程序的处理请求对应的具体数据进行处理。
一种可能的实现方式为,数据请求用于请求获取目标数据,目标数据存储于第二服务 器的内存中,用于执行处理请求的计算资源还包括第三处理单元和第四处理单元;至少两个引擎包括第一处理单元、第二处理单元和第三处理单元;第四处理单元为加速器;第一事件消息和第二事件消息中包括目标数据的标识,第一处理单元和第二处理单元位于第一服务器,第三处理单元和第四处理单元位于第二服务器;上下文还包括第二处理单元到第三处理单元、第三处理单元到第四处理单元的路由信息;
在第一处理单元根据上下文,将第二事件消息发送给第二处理单元之后,方法还包括:
第二处理单元基于第二事件消息将第二事件消息封装,以生成第三事件消息;
第二处理单元根据上下文,将第三事件消息发送给位于第二服务器的第三处理单元;
第三处理单元基于第三事件消息对第三事件消息解封装,得到第四事件消息,并根据上下文,将第四事件消息发送给第四处理单元;
第四处理单元从接收到的第四事件消息获取目标数据的标识,根据目标数据的标识从第二服务器的内存中获取目标数据,并根据目标数据得到第五事件消息;第五事件消息用于将目标数据发送给第一服务器。
示例性地,假定数据请求可以是数据获取请求Req1,Req1用于请求获取目标数据,目标数据存储于第二服务器S2的内存中,用于执行处理请求的计算资源包括微引擎XPU 3’、微引擎XPU 1’、微引擎XPU 0”和语义内存加速器1”;事件消息Mes.1’和事件消息Mes.2’中包括目标数据的标识DTM1,微引擎XPU 3’、微引擎XPU 1’位于第一服务器S1,微引擎XPU 0”和语义内存加速器1”位于第二服务器S2;上下文至少包括微引擎XPU 3’到微引擎XPU 1’、微引擎XPU 1’到微引擎XPU 0”、微引擎XPU 0”到语义内存加速器1”的路由信息;事件消息处理的方法包括:在微引擎XPU 3’根据上下文,将事件消息Mes.1’发送给微引擎XPU 1’;微引擎XPU 1’基于事件消息Mes.1’将事件消息Mes.1’封装,以生成事件消息Mes.2’;例如事件消息Mes.2’可以是第一以太网帧YTZ01;微引擎XPU 1’根据上下文,将事件消息Mes.2’发送给位于第二服务器S2的微引擎XPU 0”;微引擎XPU 0”基于事件消息Mes.2’对事件消息Mes.2’解封装,得到事件消息Mes.3’,并根据上下文,将事件消息Mes.3’发送给语义内存加速器1”;语义内存加速器1”从接收到的事件消息Mes.3’获取目标数据的标识DTM1,根据目标数据的标识DTM1从第二服务器S2的内存中获取目标数据Tar_Data1,并将目标数据Tar_Data1发送给第一服务器S1。
基于本申请提供的实施方式,基于上下文实现事件消息在不同处理单元之间的传输,相比于采用调度的方式(比如使用调度器等进行消息调度)进行事件消息的传输调度,该方法可以避免传输调度所导致的性能瓶颈,进而可以提高系统处理性能。
本申请的消息处理方法可适用于边缘智能计算,高性能超算中心,自动驾驶汽车,机器人,无人工厂、无人矿山等场景,既需要大算力又需要高能效。以下结合边缘智能计算和高性能超算作为两个主要的场景,对本申请实施例提供的消息处理方法做进一步说明。
实施例一
目前视频通话终端支持人脸识别、背景替换等人工智能计算,对算力要求越来越高,同时也需要低功耗,特别是移动办公,应急指挥等场景。本实施例以视频通话终端作为边缘智能计算的典型场景,视频通话终端配置有数据处理系统,视频通话终端的计算资源的结构关系请参考图14。
下面介绍一种对基于事件触发方式动态部署通话相关的线程,实现语音会话的数据会话,从而卸载软件计算负载的实现方案。其中事件可以是通话呼叫接续。
视频通话终端的语音会话可以涉及音频采集、FFT等变换、音频编解码、通过TCP/IP连接与通话对端进行数据交换。本申请的语音通话应用程序,通过高动态操作系统创建三个线程到不同的微引擎上,其中,
音频采集线程,主要负责从话筒通过ADC采集音频信号,按固定的时间片,如1ms等采集音频数字信号打包成事件消息;
音频处理线程,主要负责将经去噪等处理后的音频信号,按MP3或H264编码格式转换为音频传输报文;
TCP/IP线程,主要负责建立和维护与通话对端的IP会话连接,语音会话会有独立的端口号。
通过应用层开发语音通话的数据处理软件包后,通过高动态操作系统装载并注册该数据处理软件包的资源配置信息。经过上述配置操作后,视频通话终端上安装有语音通话应用程序。
其中,资源配置信息包括但不限以下项目的部分或全部:加速器类型、加速器数量、微引擎数量、操作配置信息、流转顺序信息、触发事件。其中,流转顺序信息表征与应用程序的处理请求对应的各个计算资源响应处理请求的顺序。其中,操作配置信息和流转顺序信息可以是通过应用层设置的数据会话信息获得的。
示例性地,语音通话应用程序的资源配置信息中的加速器类型可以是:信号处理加速器、音频处理加速器、会话连接加速器,该三种类型的加速器对应的加速器数量可以分别为“1、1、1”;微引擎数量可以是“3”。其中,信号处理加速器的加速器数量为“1”,表征高动态操作系统将根据信号处理加速器的加速器数量“1”,为该语音通话应用程序配置1个信号处理加速器。假设配置的信号处理加速器、音频处理加速器、会话连接加速器分别为:信号处理加速器A、音频处理加速器B、会话连接加速器C。语音通话应用程序的触发事件可以是通话呼叫接续。通话呼叫接续是用于在数据处理系统装载语音通话应用程序的数据处理软件包后,启动会话处理请求的预先设置的事件。
当用户进行通话呼叫接续时,发出会话处理请求Chat01,该语音通话应用程序启动。下面详细介绍语音通话应用程序启动时,高动态操作系统对计算资源的配置过程:
步骤K1,响应于启动应用程序的指令,高动态操作系统根据应用程序的资源配置信息确定应用程序使用的计算资源,响应于会话处理请求,确定与该处理请求对应的任务:音频采集任务、音频处理任务、会话连接任务。
如图14所示,计算资源包括微引擎XPU 3、信号处理加速器A、微引擎XPU 0、音频处理加速器B、微引擎XPU 2、会话连接加速器C;信号处理加速器A与事件队列EQ1对应;音频处理加速器B与事件队列EQ2对应;会话连接加速器C与事件队列EQ4对应;微引擎XPU 3与事件队列EQ0对应;微引擎XPU 0与事件队列EQ3对应;微引擎XPU 2与事件队列EQ5对应。当发生“通话呼叫接续”的触发事件时,启动会话处理请求Chat01,响应于应用程序的与“通话呼叫接续”对应的会话处理请求Chat01,确定与会话处理请求Chat01对应的任务,任务至少包括第一任务、第二任务和第三任务,例如,第一任务为音频采集任务、第二任务为音频处理任务、第三任务为会话连接任务。
具体实施时,资源配置信息包括引擎数量,以及加速器类型和加速器数量;应用程序启动时,响应于应用程序启动,获取应用程序的资源配置信息,根据资源配置信息以及候选引擎的负载选取应用程序使用的引擎,并根据资源配置信息选取应用程序使用的加速器, 选取的加速器中包括第一加速器和第二加速器。
作为一种示例,语音通话应用程序的资源配置信息中的加速器类型可以包括“信号处理加速器”,“信号处理加速器”类型的加速器对应的加速器数量为“3”,在根据语音通话应用程序的资源配置信息确定语音通话应用程序使用的计算资源时,可以是根据加速器类型“信号处理加速器”确定与加速器类型“信号处理加速器”对应的加速器池,根据加速器数量“3”从前述加速器池中选取3个信号处理加速器,该3个信号处理加速器可以分别为:信号处理加速器A、音频处理加速器B、会话连接加速器C;与确定加速器的过程类似地,假设资源配置信息包括的微引擎数量是“3”,高动态操作系统根据微引擎数量“3”以及候选引擎的负载,选取3个微引擎,例如,得到微引擎XPU 3、微引擎XPU 0、微引擎XPU 2。其中,选取应用程序使用的引擎,在一些实施例中可以是选取候选引擎中按照负载率从低到高的顺序选取指定数量的微引擎;在另一些实施例中,还可以是根据负载要求从候选引擎中选取满足该负载要求的指定数量的微引擎,其中负载要求可以是由资源配置信息获取的。
步骤K2,在响应于应用程序的会话处理请求Chat01,生成与该处理请求对应的音频采集任务、音频处理任务和会话连接任务之后,在XPU 3上创建用于处理音频采集任务的音频采集线程,在XPU 0上创建用于处理音频处理任务的音频处理线程、在XPU 2上创建用于处理会话连接任务的TCP/IP线程,并确定音频采集任务、音频处理任务和会话连接任务对应的计算资源。
其中,音频采集任务对应的计算资源包括XPU 3和信号处理加速器A,音频处理任务对应的计算资源包括XPU 0和音频处理加速器B,会话连接任务对应的计算资源包括XPU2和会话连接加速器C,如图14所示。
需要指出的是,本申请实施例中,线程通过自己对应的事件队列发送事件消息时,具体为通过其所创建在的微引擎的事件队列进行事件消息的转发。本申请的消息处理方法,高动态操作系统对计算资源进行配置的过程中,可以是在响应于接收到的处理请求,为包括第一任务和第二任务在内的多个任务分配计算资源之后,创建对应于各个任务的线程;还可以是,先创建对应于各个任务的线程,再确定与包括第一任务和第二任务在内的多个任务对应的计算资源。
步骤K3,根据资源配置信息分配用于指示上下文的上下文标识。
其中,上下文包括应用程序对应的操作配置信息。
资源配置信息包括用于计算资源的操作配置信息;计算资源包括微引擎和加速器;应用程序启动时,根据资源配置信息分配上下文标识。上下文标识用于指示与应用程序的同一处理请求对应的操作配置信息。上下文标识被包括在与该应用程序的同一处理请求相对应的所有事件消息中。
例如,操作配置信息可以是用户通过应用层设置的数据会话,该语音通话应用程序的用于指示上下文的上下文标识可以是根据用户通过应用层设置的数据会话,例如“Create Session(CID2,in:ADC,via:FFT,…,out:Framer,位宽,采样点数,周期,数据子块时间片,双浮点精度,…)”得到的CID2。
在一些实施例中,上下文标识还用于指示应用程序对应的流转顺序信息;应用程序使用的计算资源根据流转顺序信息将事件消息发送给下一站。
假设根据用户通过应用层设置的数据会话,例如“Create Session(CID2,in:ADC,via: FFT,…,out:Framer,位宽,采样点数,周期,数据子块时间片,双浮点精度,…)”得到的流转顺序信息为“CID2,事件队列EQ0,事件队列EQ1,事件队列EQ3,事件队列EQ2,事件队列EQ5,事件队列EQ4,…”表征传递顺序依次为音频采集线程、信号处理加速器A、音频处理线程、音频处理加速器B、TCP/IP线程、会话连接加速器C。语音通话应用程序使用的计算资源根据由CID2确定流转顺序信息将事件消息发送给下一站。
步骤K4,建立XPU 3和信号处理加速器A之间的第一路由Line1,建立XPU 0与音频处理加速器B之间的第二路由Line2、信号处理加速器A与XPU 0之间的第三路由Line3、音频处理加速器B与XPU 2之间的第三路由Line4、XPU 2与会话连接加速器C与之间的第三路由Line5。
具体实施时,建立XPU 3和信号处理加速器A之间的第一路由Line1,可以是设置音频采集线程对应的第一路由信息Line1_LM1,第一路由信息Line1_LM1包括第一目标事件队列标识Line1_TQM1,第一目标事件队列标识Line1_TQM1为图14所示的事件队列EQ1,事件消息Mes.1中包括第一路由信息Line1_TQM1,也即在音频采集线程和事件队列EQ1之间建立通信链路,事件队列EQ1对应于信号处理加速器A,音频采集线程和事件队列EQ1之间建立的通信链路为第一路由Line1。
建立XPU 0与音频处理加速器B之间的第二路由Line2,可以是设置音频处理线程对应的第二路由信息Line2_LM2,第二路由信息Line2_LM2包括第二目标事件队列标识Line2_TQM2,第二目标事件队列标识Line2_TQM2为事件队列EQ2,第二事件消息Mes.3中包括第二路由信息Line2_LM2。
Line3~Line5的建立过程与Line1和Line2的建立过程相似,在此不再赘述。
本申请的实施例中,事件消息还包括路由域信息。
例如,第一路由信息Line1_LM1还包括第一目标路由域,第一目标路由域用于指示第一目标服务器,该第一目标服务器可以是与图14中的XPU 3所在的源服务器不同的服务器。
通过上述的对应用程序的配置之后,数据处理系统就可以正常运行了。下面对语音通话应用程序启动后的数据处理进行示例说明。
当语音通话应用程序启动后,接收到与用户通话呼叫接续相对应的音频数据时执行下述数据处理过程:
步骤L1,响应于接收到音频采集任务的数据请求Data-1’,用于处理音频采集任务的音频采集线程根据上下文将根据数据请求Data-1’生成的事件消息Mes.1_1发送给音频采集任务对应的事件队列EQ1,参见图15,响应于事件队列EQ1接收事件消息Mes.1_1,与音频采集任务对应的信号处理加速器A对Mes.1_1进行处理,根据处理结果生成事件消息Mes.2_1,并根据上下文将事件消息Mes.2_1发送给用于处理音频处理任务的音频处理线程。
具体实施时,上下文标识CID2用于指示应用程序对应的上下文,上下文包括表征微引擎XPU 3、信号处理加速器A、微引擎XPU 0、音频处理加速器B、微引擎XPU 2、会话连接加速器C之间依次进行事件消息传递的路由信息。本申请实施例中,上下文包括的路由信息也可以称为应用程序对应的流转顺序信息;各个事件消息中均包括上下文标识,例如事件消息Mes.1_1、事件消息Mes.2_1、事件消息Mes.3_1等中包含上下文标识CID2。
示例性地,音频采集线程根据事件消息Mes.1_1中包括的上下文标识CID2,获取应用程序对应的流转顺序信息中用于音频采集线程的第一流转信息,并根据用于音频采集线程 的第一流转信息,将根据数据请求Data-1’生成的事件消息Mes.1_1发送给音频采集任务对应的事件队列EQ1。
其中,流转信息可以是事件队列的标识。具体地,用于音频采集线程的第一流转信息可以是事件队列EQ1的标识;用于信号处理加速器A的第二流转信息可以是与音频处理线程对应的事件队列EQ3的标识。
一种可能的实现方式为,信号处理加速器A对事件队列EQ1中的第一事件消息进行处理,具体为:信号处理加速器A根据第一事件消息中包括的上下文标识,获取对应的用于信号处理加速器A的第一操作配置信息,并根据用于信号处理加速器A的第一操作配置信息对第一事件消息进行处理。
具体实施时,上下文包括用于计算资源的操作配置信息;计算资源包括微引擎和加速器;应用程序启动时,根据操作配置信息分配上下文和上下文标识。上下文标识用于指示与应用程序的同一处理请求对应的上下文。上下文标识被包括在第一事件消息和第二事件消息中。
例如,假设用于信号处理加速器A的第一操作配置信息是指定对接收到的该上下文ID的事件消息进行FFT等变换。具体执行时,信号处理加速器A根据第一事件消息Mes.1_1中包括的上下文标识CID2,获取对应的用于信号处理加速器A的第一操作配置信息“对接收到的该上下文ID的事件消息进行FFT等变换”,并根据第一操作配置信息“对接收到的该上下文ID的事件消息进行FFT等变换”对第一事件消息Mes.1_1进行FFT等变换。
可以理解地,从事件队列的角度而言,具体执行时,信号处理加速器A的事件队列收到事件消息时,可采用异步握手信号方式触发信号处理加速器A对事件消息实时响应,根据CID2找到对应的操作配置信息并按约定进行FFT等变换。
步骤L2,音频处理线程基于事件消息Mes.2_1生成事件消息Mes.3_1,并根据上下文将事件消息Mes.3_1发送给与音频处理任务对应的事件队列EQ2,响应于事件队列EQ2接收事件消息Mes.3_1,音频处理加速器B对事件消息Mes.3_1进行处理,根据处理结果生成事件消息Mes5_1,并根据上下文将事件消息Mes.5_1发送给用于处理会话连接任务的TCP/IP线程。
音频处理线程根据上下文将事件消息Mes.3_1发送给与音频处理任务对应的事件队列EQ2的过程、音频处理加速器B根据上下文将事件消息Mes.5_1发送给用于处理会话连接任务的TCP/IP线程的过程,均与音频采集线程根据上下文将事件消息Mes.1_1发送给音频采集任务对应的事件队列EQ1的过程类似,在此不再赘述。
示例性地,用于音频处理加速器B的第二操作配置信息可是指定对接收到的该上下文ID的事件消息进行FFT等变换。音频处理加速器B对事件消息Mes.3_1进行处理的过程与前述的信号处理加速器A对事件队列EQ1中的第一事件消息进行处理的过程相似,具体不再赘述。
步骤L3,TCP/IP线程基于事件消息Mes.5_1生成事件消息Mes.6_1,并根据上下文将事件消息Mes.6_1发送给与会话连接任务对应的事件队列EQ4,响应于事件队列EQ4接收事件消息Mes.6_1,与会话连接任务对应的会话连接加速器C对事件消息Mes.6_1进行处理。
在此之后,会话连接加速器C还可以根据上下文将处理结果数据发送给对应的下一站。例如,可以是生成新的事件消息,假定该新的事件消息为事件消息Mes.7_1,并根据上下 文将事件消息Mes.7_1发送给后面的节点,例如网卡、应用/CPU或其他的线程或加速器等。
与触发事件是“通话呼叫接续”相对应地,本实施例的语音通话应用程序的释放事件可以是“通话呼叫拒接”。当用户进行通话呼叫拒接操作时,语音通话应用程序响应于发生的“通话呼叫拒接”的释放事件,释放XPU 3上运行的音频采集线程。在释放XPU 3上运行的音频采集线程之后,若XPU 3上不再有运行的线程,进一步关闭XPU 3,实现近零的待机功耗。
上述实施例,采用高动态计算模式,不需要高主频的CPU和PCI-E总线,系统制造成本可大幅度降低;微引擎、加速器等可以动态启动和关闭,系统功耗大幅度降低,可具备更长的续航能力;微引擎、加速器等资源一旦分配就保持不变,这样可以保证业务确定性体验。
实施例二:
机器学习等数据驱动的新计算技术将被气象预报、石油探测、制药等高性能超算中心广泛地采用,其中暴露一个关键问题,那就是海量数据共享问题,几千甚至上万台服务器之间需要共享静态数据和动态数据,这里对跨服务器的传输时延要求越来越短,期望可以小于微秒。这个实施例描述采用高动态计算实现海量数据共享的大规模并行计算的技术方案,重点描述数据共享的实现机制,其他机制完全可以重用边缘智能计算的实施方案,包含数据。
首先,高动态计算采用语义驱动数据共享的方式,通过应用层定义数据语义上下文将海量共享数据结构化加载内存,然后在通过应用层定义计算语义上下文将计算任务部署到更接近数据的服务器并调整对应的路由优化网络传输的时延,降低数据传输时延,提升并行计算的性能并降低功耗。其中,语义在应用层和硬件层之间的映射机制,参见图16。应用层通过行政区域对多尺度的数据做分层的语义定义,如图16中从根到图层;之后指定对应的存储服务器的事件队列ID,并分配相应的对象ID、网格ID等分层存储位置的定义,事件队列ID会将数据访问的存储消息的请求送到对应的服务器,再由服务器的共享内存加速器解析存储消息,将ID找到对应的页表数据,然后打包成存储消息相应的事件消息再回送到数据请求服务。
为了尽可能重用数据中心的网络,该方案采用网卡或智能网卡与数据中心网络相连,超算服务器的高动态计算系统方案请参见图17,网卡连接到一个微引擎,同时增加语义驱动内存的加速器。该微引擎部署以太网处理协议识别加速器的事件消息,如识别语义内存加速器等的事件消息;一旦识别出来则根据本地的数据上下文将事件消息通过路由网络转发到语义内存加速器,例如请求消息,根据上面定义语义找到对应的数据,然后将数据报文成相应的事件队列消息送回源服务器。每个服务器相应于一个路由域,应用层语义创建是要指定到具体的语义加速器的事件队列ID上。
下面以并行计算线程远程访问语义数据为例,描述数据共享的交互流程,具体参见图18,主要步骤如下:
1)服务器1的并行计算线程根据计算所需的对象找到对应的语义ID,再根据语义ID的对端的事件队列ID及其所属服务器的路由域构造事件队列消息,根据远程数据会话上下文将报文转发给以太网协议处理线程;
2)服务器1的以太网协议处理线程收到事件队列,根据路由范围字段的路由域找到对方的MAC地址和数据共享专用的VLAN ID(Virtual Local Area Network,虚拟局域网编 号),构造以太网的协议帧头之后承载该事件消息,转发到网卡,通过网卡转发数据中心交换机最终送达服务器2;
3)服务器2的以太网协议处理线程解析服务器2的网卡收到的以太网协议帧取出事件消息,根据事件队列ID向内部路由网络转发给语义内存加速器;
4)服务器2的语义内存加速器解析事件消息取出对象ID等映射到本地内存,获得相应的数据,然后根据事件队列的源路由信息转发给请求该数据的服务器,后续流程与上述的一致,此处不再赘述。
上述实施例,采用高动态计算模式的语义数据共享机制,减少软件处理开销,缩短跨服务器数据共享的传输时延以及服务器内部多计算任务的并行度,从而提升整个超算中心的性能并降低功耗。
以上结合图1至图18详细说明了本申请实施例的消息处理方法,基于与上述消息处理方法的同一技术构思,本申请实施例还提供一种消息处理装置1900,如图19所示,消息处理装置1900包括:第一运行模块1901,装置1900可用于实现上述消息处理的方法实施例中描述的方法。
第一运行模块1901,用于通过第一处理单元对第一事件消息进行处理,得到第二事件消息,第一事件消息是第一处理单元接收到的,或者第一事件消息是第一处理单元基于应用程序的处理请求生成的;
通过第一处理单元根据上下文信息,将第二事件消息发送给第二处理单元,上下文信息包括第一处理单元到第二处理单元的路由信息,上下文信息是基于应用程序的处理请求生成的;
其中,第一处理单元为第一引擎、第二处理单元为第二加速器,或者,第一处理单元为第一加速器、第二处理单元为第二引擎,或者,第一处理单元为第一引擎、第二处理单元为第二引擎,或者第一处理单元为第一加速器,第二处理单元为第二加速器。
在一种可能的设计中,消息处理装置1900还包括资源配置模块1902,资源配置模块1902用于:
接收来自于应用程序的处理请求;
根据应用程序的处理请求,确定计算资源,计算资源包括第一处理单元和第二处理单元;
根据应用程序的处理请求,生成上下文信息。
在一种可能的设计中,第一处理单元或第二处理单元是通过资源配置模块1902基于接收到应用程序的处理请求时多个处理单元的状态信息,从多个处理单元中选择的,处理单元的状态信息包括网络拓扑性能。
在一种可能的设计中,资源配置模块1902,还用于:
确定处理请求包括的至少两个任务;
创建至少两个任务对应的至少两个线程;
将至少两个线程加载到至少两个引擎上运行,其中,不同的线程运行在不同的引擎上。
在一种可能的设计中,资源配置模块1902,具体用于:
获取处理请求的语义,处理请求的语义包括至少两个任务语义;
根据至少两个任务语义中的每个任务语义,确定对应的一个任务。
需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分, 实际实现时可以有另外的划分方式,另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
基于与上述消息处理方法相同的构思,如图20所示,本申请实施例还提供了一种消息处理设备2000的结构示意图。设备2000可用于实现上述应用于数据处理系统的消息处理方法实施例中描述的方法,可以参见上述方法实施例中的说明。设备2000可以处于数据处理系统中,或者,为数据处理系统。
设备2000包括一个或多个处理器2001。处理器2001可以是通用处理器或者专用处理器等。例如可以是中央处理器。中央处理器可以用于对消息处理装置(如,终端、或芯片等)进行控制,执行软件程序,处理软件程序的数据。消息处理设备可以包括收发单元,用以实现信号的输入(接收)和输出(发送)。例如,收发单元可以为收发器,射频芯片等。
设备2000包括一个或多个处理器2001,一个或多个处理器2001可实现上述所示的实施例中数据处理系统的方法。
可选的,处理器2001除了实现上述所示的实施例的方法,还可以实现其他功能。
可选的,一种设计中,处理器2001可以执行指令,使得设备2000执行上述方法实施例中描述的方法。指令可以全部或部分存储在处理器内,如指令2003,也可以全部或部分存储在与处理器耦合的存储器2002中,如指令2004,也可以通过指令2003和2004共同使得设备2000执行上述方法实施例中描述的方法。
在又一种可能的设计中,消息处理设备2000也可以包括电路,电路可以实现前述方法实施例中数据处理系统的功能。
在又一种可能的设计中设备2000中可以包括一个或多个存储器2002,其上存有指令2004,指令可在处理器上被运行,使得设备2000执行上述方法实施例中描述的方法。可选的,存储器中还可以存储有数据。可选的处理器中也可以存储指令和/或数据。例如,一个或多个存储器2002可以存储上述实施例中所描述的对应关系,或者上述实施例中所涉及的相关的参数或表格等。处理器和存储器可以单独设置,也可以集成在一起。
在又一种可能的设计中,设备2000还可以包括收发器2005以及天线2006。处理器2001可以称为处理单元,对设备进行控制。收发器2005可以称为收发机、收发电路、或者收发单元等,用于通过天线2006实现设备的收发功能。
应注意,本申请实施例中的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(digital signal  processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本申请实施例还提供了一种计算机可读介质,其上存储有计算机程序,该计算机程序被计算机执行时实现上述应用于数据处理系统的任一方法实施例的消息处理方法。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品被计算机执行时实现上述应用于数据处理系统的任一方法实施例的消息处理方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机指令时,全部或部分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(digital video disc,DVD))、或者半导体介质(例如,固态硬盘(solid state disk,SSD))等。
本申请实施例还提供了一种处理装置,包括处理器和接口;处理器,用于执行上述应用于数据处理系统的任一方法实施例的消息处理方法。
应理解,上述处理装置可以是一个芯片,处理器可以通过硬件来实现也可以通过软件 来实现,当通过硬件实现时,该处理器可以是逻辑电路、集成电路等;当通过软件来实现时,该处理器可以是一个通用处理器,通过读取存储器中存储的软件代码来实现,该存储器可以集成在处理器中,可以位于处理器之外,独立存在。
如图21所示,本申请实施例还提供了一种芯片2100,包括输入输出接口2101和逻辑电路2102,输入输出接口2101用于接收/输出代码指令或信息,逻辑电路2102用于执行代码指令或根据信息,以执行上述应用于数据处理系统的任一方法实施例的消息处理方法。
芯片2100可以实现上述实施例中处理单元和/或收发单元所示的功能。
例如,输入输出接口2101用于输入数据处理系统的资源配置信息,输入输出接口2101还用于输出获取存储于共享内存中的目标数据的请求信息。可选的,输入输出接口2101还可以用于接收代码指令,该代码指令用于指示获取来自应用程序的数据请求。
本申请实施例还提供了一种数据处理系统,包括上述实施例中的消息处理装置,消息处理装置用于执行上述任一方法实施例的消息处理方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可以用硬件实现,或固件实现,或它们的组合方式来实现。当使用软件实现时,可以将上述功能存储在计算机可读介质中或作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是计算机能够存取的任何可用介质。以此为例但不限于:计算机可读介质可以包括RAM、ROM、EEPROM、CD-ROM或其他光盘存储、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质。此外。任何连接可以适当的 成为计算机可读介质。例如,如果软件是使用同轴电缆、光纤光缆、双绞线、数字用户线(DSL)或者诸如红外线、无线电和微波之类的无线技术从网站、服务器或者其他远程源传输的,那么同轴电缆、光纤光缆、双绞线、DSL或者诸如红外线、无线和微波之类的无线技术包括在所属介质的定影中。如本申请所使用的,盘(Disk)和碟(disc)包括压缩光碟(CD)、激光碟、光碟、数字通用光碟(DVD)、软盘和蓝光光碟,其中盘通常磁性的复制数据,而碟则用激光来光学的复制数据。上面的组合也应当包括在计算机可读介质的保护范围之内。
总之,以上仅为本申请技术方案的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (21)

  1. 一种消息处理方法,其特征在于,包括:
    第一处理单元对第一事件消息进行处理,得到第二事件消息,所述第一事件消息是所述第一处理单元接收到的,或者所述第一事件消息是所述第一处理单元基于应用程序的处理请求生成的;
    所述第一处理单元根据上下文信息,将所述第二事件消息发送给第二处理单元,所述上下文信息包括所述第一处理单元到所述第二处理单元的路由信息,所述上下文信息是基于所述应用程序的处理请求生成的;
    其中,所述第一处理单元为第一引擎、所述第二处理单元为第二加速器,或者,所述第一处理单元为第一加速器、所述第二处理单元为第二引擎,或者,所述第一处理单元为第一引擎、所述第二处理单元为第二引擎,或者所述第一处理单元为第一加速器,所述第二处理单元为第二加速器。
  2. 如权利要求1所述的方法,其特征在于,所述第一处理单元根据上下文信息,将所述第二事件消息发送给第二处理单元,包括:
    所述第一处理单元根据所述路由信息,将所述第二事件消息发送给所述第二处理单元对应的事件队列;
    所述第二处理单元从所述事件队列获取所述第二事件消息。
  3. 如权利要求2所述的方法,其特征在于,所述第二事件消息包括目标事件队列标识,所述目标事件队列标识为所述第二处理单元对应的事件队列的队列标识。
  4. 如权利要求3所述的方法,其特征在于,所述路由信息还包括目标路由域,所述目标路由域用于指示目标服务器,所述目标服务器与源服务器不同,所述源服务器是所述第一处理单元所在的服务器。
  5. 如权利要求1所述的方法,其特征在于,所述第二处理单元为第二加速器;所述第一处理单元根据上下文信息,将所述第二事件消息发送给第二处理单元,包括:
    所述第一处理单元根据所述路由信息,将所述第二事件消息发送给加速器池对应的事件队列,所述加速器池中包括多个加速器,所述多个加速器的类型相同;根据所述多个加速器的状态,从所述多个加速器中确定所述第二加速器;
    将所述第二事件消息发送给所述第二加速器。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述第一处理单元接收第一事件消息之前,还包括:
    接收来自于应用程序的处理请求;
    根据所述应用程序的处理请求,确定计算资源,所述计算资源包括所述第一处理单元和所述第二处理单元;
    根据所述应用程序的处理请求,生成所述上下文信息。
  7. 如权利要求6所述的方法,其特征在于,所述第一处理单元或所述第二处理单元是基于接收到所述应用程序的处理请求时多个处理单元的状态信息,从所述多个处理单元中选择的,所述处理单元的状态信息包括网络拓扑性能。
  8. 如权利要求6或7所述的方法,其特征在于,所述接收来自于应用程序的处理请求之后,还包括:
    确定所述处理请求包括的至少两个任务;
    创建所述至少两个任务对应的至少两个线程;
    将所述至少两个线程加载到至少两个引擎上运行,其中,不同的线程运行在不同的引擎上。
  9. 如权利要求8所述的方法,其特征在于,所述确定所述处理请求包括的至少两个任务,包括:
    获取所述处理请求的语义,所述处理请求的语义包括至少两个任务语义;
    根据所述至少两个任务语义中的每个任务语义,确定对应的一个任务。
  10. 如权利要求8或9所述的方法,其特征在于,所述方法还包括:
    释放第一线程,所述第一线程为所述至少两个线程中的一个;
    若释放所述第一线程后,所述第一线程被释放前所在的引擎上已无线程运行,关闭所述第一线程被释放前所在的引擎。
  11. 如权利要求8-10任一项所述的方法,其特征在于,所述处理请求用于请求获取目标数据,所述目标数据存储于第二服务器的内存中;用于执行所述处理请求的所述计算资源还包括第三处理单元和第四处理单元;所述至少两个引擎包括所述第一处理单元、所述第二处理单元和所述第三处理单元;所述第四处理单元为加速器;所述第一事件消息和所述第二事件消息中包括所述目标数据的标识,所述第一处理单元和所述第二处理单元位于第一服务器,所述第三处理单元和所述第四处理单元位于所述第二服务器;所述上下文还包括所述第二处理单元到所述第三处理单元、所述第三处理单元到所述第四处理单元的路由信息;
    在所述第一处理单元根据上下文,将所述第二事件消息发送给第二处理单元之后,所述方法还包括:
    所述第二处理单元基于所述第二事件消息将所述第二事件消息封装,以生成第三事件消息;
    所述第二处理单元根据所述上下文,将所述第三事件消息发送给位于所述第二服务器的所述第三处理单元;
    所述第三处理单元基于所述第三事件消息对所述第三事件消息解封装,得到第四事件消息,并根据所述上下文,将所述第四事件消息发送给所述第四处理单元;
    所述第四处理单元从接收到的所述第四事件消息获取所述目标数据的标识,根据所述目标数据的标识从所述第二服务器的内存中获取所述目标数据,并根据所述目标数据得到 所述第五事件消息;所述第五事件消息用于将所述目标数据发送给所述第一服务器。
  12. 如权利要求1-11任一项所述的方法,其特征在于,所述上下文信息还包括操作配置信息;
    所述第一处理单元对所述第一事件消息进行处理,得到第二事件消息,包括:
    所述第一处理单元根据所述操作配置信息对所述第一事件消息进行处理,得到第二事件消息。
  13. 如权利要求1-12任一项所述的方法,其特征在于,所述第一事件消息和所述第二事件消息中包括所述上下文信息的标识,所述上下文信息的标识用于获取所述上下文信息。
  14. 如权利要求1-13任一项所述的方法,其特征在于,所述第二事件消息,包括:
    消息属性信息域,包括事件消息路由信息,所述事件消息路由信息包括目标事件队列标识,所述目标事件队列标识为所述第二处理单元对应的事件队列的队列标识;
    消息长度域,包括所述第二事件消息的总长度信息;
    数据域,包括所述第二事件消息的负荷。
  15. 如权利要求14所述的方法,其特征在于,所述数据域中包括第一事件信息域,所述第一事件信息域包括以下至少一项:
    路由范围、所述上下文信息的标识、源消息队列标识或者自定义属性,所述路由范围包括至少一个路由域。
  16. 如权利要求15所述的方法,其特征在于,所述数据域中包括第二事件信息域,所述第二事件信息域包括应用层的自定义信息。
  17. 一种消息处理装置,其特征在于,包括:
    第一运行模块,所述第一运行模块用于:通过第一处理单元对第一事件消息进行处理,得到第二事件消息,所述第一事件消息是所述第一处理单元接收到的,或者所述第一事件消息是所述第一处理单元基于应用程序的处理请求生成的;
    通过所述第一处理单元根据上下文信息,将所述第二事件消息发送给第二处理单元,所述上下文信息包括所述第一处理单元到所述第二处理单元的路由信息,所述上下文信息是基于所述应用程序的处理请求生成的;
    其中,所述第一处理单元为第一引擎、所述第二处理单元为第二加速器,或者,所述第一处理单元为第一加速器、所述第二处理单元为第二引擎,或者,所述第一处理单元为第一引擎、所述第二处理单元为第二引擎,或者所述第一处理单元为第一加速器,所述第二处理单元为第二加速器。
  18. 一种消息处理设备,其特征在于,包括处理器和存储器,
    所述存储器,用于存储可执行程序;
    所述处理器,用于执行存储器中的计算机可执行程序,使得权利要求1-16中任一项所 述的方法被执行。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序在被计算机调用时,使所述计算机执行如权利要求1-16任一项所述的方法。
  20. 一种芯片,其特征在于,包括:逻辑电路和输入输出接口,所述输入输出接口用于接收代码指令或信息,所述逻辑电路用于执行所述代码指令或根据所述信息,以执行如权利要求1-16任一项所述的方法。
  21. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行权利要求1-16任一项所述的方法。
PCT/CN2021/133267 2021-11-25 2021-11-25 一种消息处理方法及装置 WO2023092415A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180104290.2A CN118265973A (zh) 2021-11-25 2021-11-25 一种消息处理方法及装置
PCT/CN2021/133267 WO2023092415A1 (zh) 2021-11-25 2021-11-25 一种消息处理方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/133267 WO2023092415A1 (zh) 2021-11-25 2021-11-25 一种消息处理方法及装置

Publications (1)

Publication Number Publication Date
WO2023092415A1 true WO2023092415A1 (zh) 2023-06-01

Family

ID=86538534

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133267 WO2023092415A1 (zh) 2021-11-25 2021-11-25 一种消息处理方法及装置

Country Status (2)

Country Link
CN (1) CN118265973A (zh)
WO (1) WO2023092415A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579914A (zh) * 2023-07-14 2023-08-11 南京砺算科技有限公司 一种图形处理器引擎执行方法、装置、电子设备及存储介质
CN117076140A (zh) * 2023-10-17 2023-11-17 浪潮(北京)电子信息产业有限公司 一种分布式计算方法、装置、设备、系统及可读存储介质
CN117294347A (zh) * 2023-11-24 2023-12-26 成都本原星通科技有限公司 一种卫星信号接收处理方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170243119A1 (en) * 2010-11-24 2017-08-24 LogRhythm Inc. Advanced intelligence engine
CN109313606A (zh) * 2016-07-06 2019-02-05 英特尔公司 用于共享虚拟存储器在异构处理系统中管理数据一致性的方法和装置
US20190354462A1 (en) * 2018-05-16 2019-11-21 Texas Instruments Incorporated Managing and maintaining multiple debug contexts in a debug execution mode for real-time processors
WO2020236318A1 (en) * 2019-05-23 2020-11-26 Xilinx, Inc. Dataflow graph programming environment for a heterogenous processing system
CN112054963A (zh) * 2019-06-07 2020-12-08 英特尔公司 用于异构计算环境中的数据传输的网络接口
CN112506568A (zh) * 2016-12-31 2021-03-16 英特尔公司 用于异构计算的系统、方法和装置
CN112540948A (zh) * 2019-09-23 2021-03-23 萨思学会有限公司 通过事件流处理集群管理器进行路由管理

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170243119A1 (en) * 2010-11-24 2017-08-24 LogRhythm Inc. Advanced intelligence engine
CN109313606A (zh) * 2016-07-06 2019-02-05 英特尔公司 用于共享虚拟存储器在异构处理系统中管理数据一致性的方法和装置
CN112506568A (zh) * 2016-12-31 2021-03-16 英特尔公司 用于异构计算的系统、方法和装置
US20190354462A1 (en) * 2018-05-16 2019-11-21 Texas Instruments Incorporated Managing and maintaining multiple debug contexts in a debug execution mode for real-time processors
WO2020236318A1 (en) * 2019-05-23 2020-11-26 Xilinx, Inc. Dataflow graph programming environment for a heterogenous processing system
CN112054963A (zh) * 2019-06-07 2020-12-08 英特尔公司 用于异构计算环境中的数据传输的网络接口
CN112540948A (zh) * 2019-09-23 2021-03-23 萨思学会有限公司 通过事件流处理集群管理器进行路由管理

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116579914A (zh) * 2023-07-14 2023-08-11 南京砺算科技有限公司 一种图形处理器引擎执行方法、装置、电子设备及存储介质
CN116579914B (zh) * 2023-07-14 2023-12-12 南京砺算科技有限公司 一种图形处理器引擎执行方法、装置、电子设备及存储介质
CN117076140A (zh) * 2023-10-17 2023-11-17 浪潮(北京)电子信息产业有限公司 一种分布式计算方法、装置、设备、系统及可读存储介质
CN117076140B (zh) * 2023-10-17 2024-01-23 浪潮(北京)电子信息产业有限公司 一种分布式计算方法、装置、设备、系统及可读存储介质
CN117294347A (zh) * 2023-11-24 2023-12-26 成都本原星通科技有限公司 一种卫星信号接收处理方法
CN117294347B (zh) * 2023-11-24 2024-01-30 成都本原星通科技有限公司 一种卫星信号接收处理方法

Also Published As

Publication number Publication date
CN118265973A (zh) 2024-06-28

Similar Documents

Publication Publication Date Title
WO2023092415A1 (zh) 一种消息处理方法及装置
US11334382B2 (en) Technologies for batching requests in an edge infrastructure
CN111542064B (zh) 一种用于无线接入网的容器编排管理系统及编排方法
CN107566153B (zh) 一种自管理的微服务实现方法
US11240155B2 (en) Technologies for network device load balancers for accelerated functions as a service
WO2023284830A1 (zh) 管理和调度方法、装置、节点及存储介质
WO2022171083A1 (zh) 基于物联网设备的信息处理方法、相关设备及存储介质
CN111538605B (zh) 一种分布式数据访问层中间件及命令执行方法和装置
US20230133020A1 (en) Accelerator or accelerated functions as a service using networked processing units
CN118227343A (zh) 一种数据处理方法、系统、装置、设备、介质及产品
CN114710571A (zh) 数据包处理系统
CN116680209A (zh) 基于wasm的多智能合约实例管理方法
CN109922139A (zh) 异构网络的动态服务发现与发布方法
CN114510323A (zh) 在容器中运行虚机的网络优化实现方法
Brasilino et al. Data Distillation at the Network's Edge: Exposing Programmable Logic with InLocus
CN113742646A (zh) 将单语言复合函数编译为单个实体
CN111416872A (zh) 基于mp和rdma的高速缓存文件系统通信方法及系统
CN118606079B (zh) 一种基于socket接口的通信方法和系统
WO2024087663A1 (zh) 作业调度方法、装置和芯片
US12021921B2 (en) Data plane reduction for eventing components
CN118568277B (zh) 电力图像数据的并发处理和存储系统、电子设备及方法
Zhu et al. Application and analysis of three common high-performance network data processing frameworks
US20240231906A1 (en) Distributed Computing Topology with Energy Savings
WO2022193108A1 (zh) 一种集成芯片及数据搬运方法
Pelosi CAST: a Declarative Language and its Execution Platform for Large-Scale Cloud Simulations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21965150

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180104290.2

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE