WO2023159652A1 - 一种ai系统、内存访问控制方法及相关设备 - Google Patents

一种ai系统、内存访问控制方法及相关设备 Download PDF

Info

Publication number
WO2023159652A1
WO2023159652A1 PCT/CN2022/078504 CN2022078504W WO2023159652A1 WO 2023159652 A1 WO2023159652 A1 WO 2023159652A1 CN 2022078504 W CN2022078504 W CN 2022078504W WO 2023159652 A1 WO2023159652 A1 WO 2023159652A1
Authority
WO
WIPO (PCT)
Prior art keywords
qos
memory access
memory
target
priority
Prior art date
Application number
PCT/CN2022/078504
Other languages
English (en)
French (fr)
Inventor
屈明广
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2022/078504 priority Critical patent/WO2023159652A1/zh
Publication of WO2023159652A1 publication Critical patent/WO2023159652A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present invention relates to the technical field of electronic equipment, in particular to an AI system, a memory access control method and related equipment.
  • AI artificial intelligence
  • ICT information and communication technology
  • SoC System on Chip
  • Model release time Parameter amount The amount of pre-training data GPT June 2018 117 million about 5GB GPT-2 February 2019 1.5 billion about 40GB GPT-3 May 2020 175 billion About 45TB
  • AI computing can be divided into two categories from the application process: training and reasoning.
  • training and reasoning To complete the training of such a large-scale AI network model, a set of high-performance AI computing clusters must be used to achieve business goals within an acceptable time scale.
  • the powerful computing power of the AI computing cluster server is required, and a large amount of input and output and intermediate data will be generated during the calculation process, such as various sample data (text, voice, etc.) , images, videos), various weights/parameters of neural networks, gradient data, feature maps obtained during model training, etc. These data are often stored in high-speed memory on the SoC.
  • AI computing there are a large number of concurrent computing hardware units on the SoC of AI computing, and the AI chip needs to frequently access the memory data on the SoC during the computing process, such as temporarily storing data in the memory or reading data from the memory, while the memory Bandwidth is often a key bottleneck that affects AI computing performance.
  • AI clusters in the training and reasoning process, due to the need to complete large-scale model calculations, AI clusters often work together to perform training tasks.
  • each service node (server) of the cluster and between servers the There are concurrent communication flows (such as model feature maps, parameter weights), AI calculation flows, various dedicated hardware accelerators (such as Davinci Vision Pre-Processor (DVPP), Audio Signal Processor (Audio Signal Processing, ASP), Image Signal Processing (Image Signal Processing, ISP, etc.) data flow, if these high-concurrency data flows are not controlled, it will also cause serious system performance degradation in the process of accessing memory.
  • various dedicated hardware accelerators such as Davinci Vision Pre-Processor (DVPP), Audio Signal Processor (Audio Signal Processing, ASP), Image Signal Processing (Image Signal Processing, ISP, etc.
  • Embodiments of the present application provide an AI system, a memory access control method, and related equipment, so as to improve the computing performance of the AI system.
  • the embodiment of the present application provides an artificial intelligence AI system, which is characterized in that it includes an AI system-on-chip SoC, the AI SoC includes M subsystems and N memory controllers, the M subsystems and the N memory controllers are interconnected through the SoC bus; the M subsystems include a target subsystem, and the target subsystem is any one of the M subsystems, and the target subsystem includes S processing nodes, M , N, and S are all integers greater than or equal to 1; wherein, the target processing node among the S processing nodes is used to: receive a computing task to be executed, and the computing task carries a quality of service identifier QoS ID; The target processing node is any one of the S processing nodes; the QoS ID is used to indicate the category to which the computing task belongs; a memory access request for the computing task is generated, and in the memory access request Carrying the QoS ID; sending the memory access request to a target memory controller among the N memory controllers; the target memory controller is
  • an on-chip memory access quality of service (Quality of Service, QoS) control technology is introduced, by performing QoS marking on each computing task to be assigned to the AI SoC in the AI system, and Different types of computing tasks have different corresponding QoS IDs (for example, classify according to the business flow to which the computing tasks belong, or classify according to different memory access delay requirements of computing tasks, etc.), so that the follow-up can be based on
  • the QoS ID carried in each computing task determines the QoS priority corresponding to the memory access request of each computing task, and finally performs QoS control on each memory access request based on the determined QoS priority, so as to realize the AI system from the granularity of the computing task
  • the purpose of memory access QoS control for different types of memory can also be achieved, and the memory access requirements for different types of computing tasks (such as different business flows) can be achieved, and different memory access service guarantee functions can be provided.
  • the existing AI system computing On the basis of power and memory bandwidth resources, better AI computing performance can be obtained. It is different from the existing technology that only performs memory access control at the processing node level in the SoC (that is, all memory access requests on the same processing node are controlled according to the unified memory access service quality), which leads to the inability to meet various requirements in the AI system. The actual memory access requirements of different computing tasks (such as computing tasks belonging to different business flows) ultimately lead to the problem of poor computing performance of the AI system.
  • the embodiments of the present application are allocated by processing nodes in each subsystem of the AI SoC (for the convenience of description in this application, subsequent related embodiments may use Master as an example to describe processing nodes, and will not be described in detail later).
  • the QoS ID carried in the calculation task is used to indicate the category to which the calculation task belongs, and can finally be determined according to the category to which it belongs.
  • the corresponding memory access QoS priority is based on the fact that in the field of AI computing, different types of computing tasks (such as computing tasks under different business flows) have different requirements for the quality of service of memory access, and certain types of computing tasks are different. There is memory access competition between some types of computing tasks, but there is no memory access competition among certain types of computing tasks. Therefore, setting the matching QoS priority for the memory access requests of computing tasks according to the category to which they belong can better meet the requirements of different types of computing tasks.
  • the memory access requirements of computing tasks (it is understandable that different types of computing tasks can correspond to different QoS IDs, but different QoS IDs may correspond to the same QoS priority or may correspond to different QoS priorities); further, in During the process of each processing node executing the received computing task, it can generate a memory access request for the computing task according to the memory address and data that each computing task needs to access, and continue to carry the computing power in the memory access request.
  • the QoS ID carried in the task that is, the QoS ID is transferred to its corresponding memory access request along with the computing task flow, so that when the subsequent memory controller receives the memory access request, it can use the QoS ID carried by it.
  • the memory access request performs memory access control corresponding to the priority.
  • the memory controller can provide better memory access service quality for the memory access request carrying the QoS ID, that is, for different Different memory access QoS controls are performed for computing tasks with memory access priority requirements, so as to avoid the serious degradation of system performance caused by random preemption of key memory bandwidth resources due to non-discriminatory treatment in the prior art.
  • the embodiment of the present application finally realizes the control of memory access service quality from the granularity of computing tasks, and solves the problem of different types of computing tasks (such as different types of business flows) in AI training and reasoning tasks.
  • the calculation task also carries a second QoS priority corresponding to the QoS ID, and the second QoS priority is the initial QoS priority corresponding to the QoS ID in the calculation task. QoS priority.
  • the computing task assigned to each processing node (such as Master) in the subsystem in the AI system can also carry the initial QoS corresponding to the QoS ID in the computing task priority (that is, the second QoS priority). That is to say, in the embodiment of this application, the QoS ID and the corresponding initial QoS priority can be configured for the calculation task at the beginning of the calculation task assignment, so that the follow-up can be performed according to the QoS ID and the initial QoS priority. Subsequent QoS priority regulation, and corresponding memory QoS access control.
  • the QoS ID carried in the memory access request of the computing task can remain unchanged during the transfer process from the Master to the target memory controller, but its corresponding QoS priority can be Different adjustments and optimizations are made according to different requirements and situations of memory access requests in the scheduling process.
  • the target subsystem further includes a sub-scheduler; the target processing node is specifically configured to: send the memory access request to the sub-scheduler, and pass the sub-scheduler
  • the scheduler schedules to the target memory controller among the N memory controllers; the sub-scheduler is configured to: receive memory access requests respectively sent by the S processing nodes in the target subsystem; According to the second QoS priority corresponding to the QoS ID carried in the memory access requests sent by the S processing nodes respectively, the memory access requests sent by the S processing nodes are scheduled to the SoC bus, and the first The second QoS priority is the initial QoS priority corresponding to the QoS ID; wherein, the second QoS priority is used to indicate the priority of the corresponding memory access request being dispatched to the SoC bus.
  • each subsystem in the AI SoC in the AI system also includes a sub-scheduler, which can be used to schedule memory access requests of computing tasks being executed in all processing nodes (such as Master) in the subsystem,
  • the memory access requests generated by the Master in these subsystems are sent to the SoC bus for arbitration, address resolution and routing after being scheduled by its internal sub-scheduler, and then sent to the corresponding memory controller for memory access. Since the memory access requests of the computing tasks executed in each Master carry the QoS ID of the corresponding computing tasks, the sub-schedulers in each subsystem can carry the QoS priority in the process of scheduling memory access requests.
  • Memory access requests with higher QoS IDs are prioritized to be dispatched to the SoC bus, while memory access requests with lower QoS priority QoS IDs are dispatched to the SoC bus later to ensure that memory access requests are dispatched to the SoC bus
  • the corresponding QoS priority of each memory access request has been considered, so as to provide each computing task with a memory access control service matching its QoS ID from the source of the entire AI system.
  • the sub-scheduler is specifically configured to: respectively establish task queues for the S processing nodes, and each of the task queues includes a memory access request sent by a corresponding processing node; wherein, The target processing node corresponds to the target task queue; when a target memory access request is currently inserted in the target task queue, the second QoS priority corresponding to the QoS ID carried in all memory access requests in the target task queue Respectively promoted to the third QoS priority, the target memory access request is a memory access request whose second QoS priority corresponding to the carried QoS ID exceeds the preset priority; according to the task queues of the S processing nodes The second QoS priority or the third QoS priority corresponding to the QoS ID carried in the memory access request in the memory access request, and send the memory access requests in the task queues of the S processing nodes to the SoC bus successively .
  • a task queue is created for computing tasks in each processing node (such as Master), and all generated in each Master Memory access requests are placed in a task queue, and are sent to the SoC bus successively according to the QoS priority corresponding to the QoS ID carried by the memory access requests in each task queue; when a task queue currently appears For a memory access request with a higher QoS priority, in order to avoid that all memory access requests in the task queue are forced to be blocked due to the QoS priority of the front end of the task queue being too low (for example, queue head blocking), this
  • the sub-scheduler in the Master upgrades the QoS priority of all memory access requests in the task queue (that is, from the second QoS priority to the third QoS priority), so that any A memory access request (especially the aforementioned memory access request with higher QoS priority) will not cause the entire The memory access request of the task queue
  • the SoC bus is configured to: receive one or more memory access requests in the target task queue sent by the sub-scheduler, and the one or more memory access requests include The memory access request: restore the third QoS priority corresponding to the QoS ID carried in the one or more memory access requests in the target task queue to the corresponding second QoS priority.
  • the QoS priority of the memory access request in each task queue is adjusted from the sub-scheduler of each processing node (such as Master), and the memory access request is scheduled according to the adjusted QoS priority Afterwards, these memory access requests have been dispatched to the SoC bus. At this time, the QoS priority adjustment in the subsystem has eliminated the risk of being blocked by memory access requests with low QoS priority in each task queue.
  • the SoC bus is further configured to: based on the second QoS priority after recovery of one or more memory access requests in the target task queue, set the target task queue to One or more memory access requests among the N memory controllers are respectively dispatched to corresponding memory controllers among the N memory controllers.
  • the SoC bus restores the QoS priority of the QoS ID of the memory access request dispatched from each subsystem to the initial second QoS priority, it can be based on the restored second QoS priority
  • the scheduling of memory access requests is performed at the level, that is, each memory access request is dispatched to the corresponding memory controller according to the recovered second QoS priority, so that the memory controller performs subsequent memory access QoS control and memory access.
  • the AI SoC further includes an advanced memory access agent MATA: the SoC bus, specifically configured to: send one or more memory access requests in the target task queue to the MATA , and respectively dispatching the one or more memory access requests to corresponding memory controllers among the N memory controllers through the MATA.
  • the AI SoC further includes an advanced memory access agent MATA: the SoC bus, specifically for: sending the memory access requests sent by the S processing nodes to The MATA, and through the MATA, the memory access requests sent by the S processing nodes are respectively dispatched to the corresponding memory controllers in the N memory controllers, and the access requests sent by the S processing nodes are respectively The storage request includes the storage access request.
  • the AI SoC may further include a memory access agent MATA for memory access control.
  • MATA memory access agent
  • MATA memory access agent
  • each memory controller can be controlled and managed in an overall manner, and each received memory access request can be further regulated. For example, further optimize the second QoS priority corresponding to the QoS ID in each memory access request.
  • the MATA is configured to: receive the memory access request, determine the second QoS priority corresponding to the QoS ID carried in the memory access request; The second QoS priority corresponding to the ID, combined with the historical memory bandwidth statistical information corresponding to the QoS ID, and the memory access policy control parameters corresponding to the QoS ID, determine the first QoS priority corresponding to the QoS ID.
  • QoS priority, the memory access policy control parameters include one or more of the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through.
  • MATA after MATA receives each memory access request dispatched by the SoC bus, it can further optimize and adjust the initial priority (that is, the second QoS priority) carried in each memory access request.
  • the principle may include that before the memory access request is dispatched to each memory controller by the SoC bus, the corresponding initial QoS priority (that is, the second QoS priority) according to the QoS ID carried in the memory access request can be firstly passed through MATA. And combined with the historical memory bandwidth statistics corresponding to the QoS ID and the access policy control parameters corresponding to the QoS ID that are currently recorded by the MATA, generate the final corresponding QoS priority (i.e.
  • the target memory controller can finally perform memory access QoS control on the memory access request according to the final QoS priority. That is to say, when MATA performs memory access control, it not only takes into account the QoS priority initially configured by the AI system for each QoS ID, but also further considers the historical bandwidth statistics corresponding to each QoS ID (such as carrying a certain same QoS ID The memory bandwidth information currently obtained by a class of computing tasks) and the memory access policy control parameters that should be obtained when the memory access request corresponding to the QoS ID is configured (such as the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass, etc.) To comprehensively consider what kind of memory access QoS control service is provided for the current memory access request, so as to finally obtain the QoS priority that matches the memory access request, so as to perform more accurate memory access QoS control, and further optimize and improve the performance of the AI system.
  • the memory access request corresponding to a certain QoS ID has occupied a large amount of memory bandwidth.
  • the QoS priority of the QoS ID can be lowered to balance the memory access bandwidth occupation corresponding to each QoS ID; and if The memory access request corresponding to a certain QoS ID currently occupies less memory bandwidth, and the QoS priority corresponding to the QoS ID can be increased to compensate for the memory access bandwidth occupation corresponding to the QoS ID.
  • the MATA is also used to: pre-set the memory access policy control parameters corresponding to each QoS ID, count and record the historical memory bandwidth corresponding to each QoS ID; Performance real-time monitoring information, updating and optimizing the memory access policy control parameters corresponding to each QoS ID.
  • MATA also configures corresponding memory access policy control parameters for each QoS ID, and also counts and records the historical memory bandwidth corresponding to each QoS ID, so as to determine whether it is corresponding to a certain QoS ID based on these two kinds of information. Based on the initial priority of the QoS priority, the QoS priority is increased or the QoS priority is decreased. Finally determine the ultimate QoS priority of a memory access request corresponding to a certain QoS ID, so that the memory controller can perform specific memory access QoS control according to the ultimate QoS priority. For example, MATA can set the highest bandwidth, lowest bandwidth, and access priority that are allowed to pass through a memory access request carrying a certain QoS ID. Furthermore, MATA can also update and optimize the memory access policy control parameters corresponding to each QoS ID according to the real-time monitoring information of the AI system's memory access performance, for example, through optimization algorithms and adaptive machine learning algorithms to optimize memory access Policy control parameters are adjusted.
  • the MATA is further configured to: carry the first QoS priority in the memory access request, and schedule the memory access request based on the first QoS priority to the target memory controller.
  • MATA may also continue to carry the QoS ID in the memory access request and send it to the memory controller.
  • the AI SoC further includes MATA; the MATA is configured to: carry the determined first QoS priority in the memory access request, and based on The first QoS priority dispatches the memory access request to the target memory controller.
  • the final priority (that is, the first QoS priority) can be carried in the memory access request and sent to the corresponding memory controller, so that The corresponding memory controller can perform memory access QoS control according to the first QoS priority.
  • the MATA may also schedule the memory access request to the target memory controller based on the first QoS priority.
  • the memory controller can according to the first QoS priority and QoS IDs jointly make decisions on memory access QoS control.
  • the memory controller can also calculate the historical memory bandwidth occupied by the memory access request corresponding to the QoS ID based on the QoS ID, and perform memory access QoS control based on this. further optimization.
  • the target memory controller is specifically configured to: based on the first QoS priority corresponding to the QoS ID, and combine the memory access service conditions of the target memory controller
  • the storage access QoS control is performed on the memory access request, and the memory access service situation includes memory access timing requirements, or memory bandwidth bus utilization.
  • the memory controller can combine the The current service status of the memory controller performs memory access QoS control on the memory access request. That is to say, when the memory controller performs memory access QoS control, it not only takes into account the QoS priority finally generated by MATA for each QoS ID, but also further considers the current service situation of each memory controller (for example, Access timing requirements, memory bandwidth bus utilization, etc.) to perform more precise memory access QoS control, thereby further optimizing and improving the computing performance of the AI system.
  • the memory controller can further calculate the historical memory occupied by the memory access request corresponding to the QoS ID on itself based on the QoS ID Bandwidth, and further optimize memory access QoS control based on this.
  • the memory access service conditions include memory access timing requirements, or memory bandwidth bus utilization.
  • the memory controller finally needs to perform final memory access control on each memory access request according to the current memory service situation, so as to make the memory access QoS control for each memory access request more accurate and reasonable, and avoid merely Memory access control is performed according to the QoS priority of computing tasks, and the current actual situation of each memory controller, such as memory access timing requirements and memory bandwidth bus utilization, can be further combined to comprehensively consider what kind of access is provided for the current memory access request.
  • Deposit QoS control service is
  • the target memory controller is further configured to: when the amount of memory access requests received by the target memory controller is greater than a preset threshold, broadcast a response to the M subsystems A pressure indication, where the back pressure indication is used to instruct one or more of the M subsystems to delay, reduce, or stop sending memory access requests.
  • the relevant subsystem when the number of memory access requests received by a certain memory controller is too large, it can instruct the relevant subsystem to reduce, or delay or even stop the currently sent memory access requests, and the relevant subsystem receives After the above instructions, you can adjust the sending of memory access requests according to your own situation, for example, suspend sending memory access requests to the SoC bus, or stop sending memory access requests to the SoC bus, and so on.
  • the AI SoC further includes a host; the host is configured to: receive a task to be executed, split the task to be executed into one or more computing tasks to be executed;
  • a business flow label table is set to identify the business flow types of the one or more computing tasks to be executed after splitting, and the preset business flow label table includes the business flow to which the predefined computing tasks belong The mapping relationship between the type and the QoS ID; according to the identification result, carry the corresponding QoS ID for the one or more computing tasks to be executed.
  • the AI system in addition to multiple subsystems and multiple memory controllers that can execute computing tasks, can further include a host that uniformly receives various computing tasks issued by users.
  • the host can By identifying and marking the types of business flows in the AI network model, the calculation tasks under different business flows are given different service flow access QoS labels, that is, QoS IDs, so that the subsequent entire AI system can use these QoS
  • the ID performs reasonable and matching memory access QoS control on the computing tasks carrying the QoS ID, so as to finally realize the load balancing of the memory access of the entire AI system and improve the comprehensive execution performance and efficiency of the entire AI system.
  • system further includes a system scheduler; the host is further configured to: send one or more computing tasks carrying corresponding QoS IDs to the system scheduler.
  • the host in the AI system after the host in the AI system identifies the service flow and carries the QoS ID, it can send these calculation tasks carrying the QoS ID to the system scheduler on the AI SoC for subsequent allocation. That is to say, after the host splits, identifies, and tags the tasks to be executed, it will send the processed computing tasks to the system scheduler, so that the system scheduler can subsequently label these tasks (that is, carry matching QoS ID) computing tasks are scheduled and assigned.
  • the host or the target processing node is further configured to: pre-configure a corresponding second QoS priority for the QoS ID in the computing task, and the second QoS priority The level is the initial priority corresponding to the QoS ID.
  • the host side or inside the target processing node also configures an initial QoS priority (that is, the second QoS priority) for each computing task, that is, configures a matching QoS priority for each QoS ID. So that the relevant modules in the subsequent AI SoC can adjust the follow-up QoS priority or the final QoS priority based on the initial QoS priority.
  • the host is further configured to: update and optimize the second QoS priority corresponding to each QoS ID according to the real-time monitoring information of the AI system's memory access performance.
  • the host side in the AI system can also update and optimize the initial QoS priority corresponding to each QoS ID in the system according to the real-time monitoring information of memory access performance.
  • automatic QoS optimization is adaptively performed through an optimization algorithm and an adaptive machine learning algorithm.
  • the system scheduler is configured to: receive the one or more computing tasks to be executed sent by the host; wherein, each of the computing tasks to be executed also carries There is a task descriptor used to describe the type of computing task; according to the task descriptor carried in each computing task to be executed, a matching subsystem is selected for each computing task to be executed from the M subsystems, and from the Select a matching processing node from one or more processing nodes in the matching subsystem; and schedule each computing task to be executed to the matching processing node in the matching subsystem.
  • the system scheduler All computing tasks sent by the Host can be reasonably allocated.
  • the specific allocation principle can be allocated according to the task descriptor carried in the computing task, so as to allocate appropriate computing tasks for each computing task according to the type of task described in the task descriptor.
  • Subsystems and processing nodes to better perform or accelerate individual computing tasks For example, a certain AI matrix calculation task is assigned to an appropriate AI subsystem, and assigned to an idle processing node on the AI subsystem.
  • the AI system when the AI system is applied to a virtual scene, the AI system includes multiple virtual machines, where each of the multiple virtual machines corresponds to one or more Processes, one process includes one or more computing tasks; the one or more processes run on one or more processing nodes of at least one of the M subsystems; the system scheduler is also used to : assign a VM ID for each virtual machine; wherein, the VM ID of the corresponding virtual machine is shared in the page tables of one or more processes corresponding to each virtual machine.
  • a VM ID is assigned to each virtual machine in units of virtual machines, and all processes under the virtual machine are set to correspond to the same VM ID. It is to isolate different virtual machines to ensure the security isolation and non-interference between users corresponding to different virtual machines.
  • the target subsystem further includes a system memory management unit SMMU;
  • the target processing node is further configured to: process the memory access request of the computing task Send to the SMMU, and update the QoS ID carried in the memory access request of the computing task through the SMMU;
  • the SMMU is configured to: receive the memory access of the computing task sent by the target processing node request; according to the virtual address and service set identifier SSID in the memory access request, determine the target process to which the computing task belongs; determine the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process, The QoS ID carried in the memory access request of the computing task is replaced with the VM ID of the target virtual machine.
  • the AI system when the AI system is in a virtual scene, it is necessary to replace the initial QoS ID assignment and transfer process, and uniformly replace the QoS ID assignment according to the virtual machine to which the process belongs, that is, each processing node Through the SMMU in the processing node, the QoS ID carried in the received memory access request is replaced uniformly with the VM ID of the virtual machine corresponding to the process of the calculation task corresponding to the memory access request.
  • bandwidth security isolation is the primary purpose as much as possible to meet the basic needs of virtual machine users for data isolation, computing power resource isolation, and mutual non-interference. Furthermore, the problem of memory bandwidth isolation and bandwidth commitment among users of different virtual machines can also be solved.
  • the AI SoC further includes an L2 Cache cache; the L2 cache is configured to: receive memory access requests of each computing task, and The QoS ID accesses the corresponding storage area in the L2 Cache, wherein the memory access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
  • the storage area of the cache that can be accessed by each memory access request can be controlled.
  • the corresponding storage area in the storage area is safely isolated. Since the ID of the virtual machine corresponding to the process under each virtual machine is the VM ID, the VM ID to which it belongs can be carried as the QoS ID in the corresponding memory access request, so that the cache can be isolated based on this In order to achieve the security isolation effect in the virtual machine scenario.
  • the embodiment of the present application provides a memory access control method, which is characterized in that it is applied to an artificial intelligence AI system, the AI system includes an AI system on chip SoC, and the AI SoC includes M subsystems and N memory A controller, the M subsystems and the N memory controllers are interconnected through an SoC bus; the M subsystems include a target subsystem, and the target subsystem is any subsystem in the M subsystems, so The target subsystem includes S processing nodes, and M, N, and S are all integers greater than or equal to 1; the method includes: receiving a computing task to be executed through the target processing node among the S processing nodes, and the The computing task carries a quality of service identifier QoS ID; the target processing node is any one of the S processing nodes; the QoS ID is used to indicate the category to which the computing task belongs; generating the computing A memory access request of a task, wherein the memory access request carries the QoS ID in the computing task; the memory access request is sent
  • the calculation task also carries a second QoS priority corresponding to the QoS ID, and the second QoS priority is the initial QoS priority corresponding to the QoS ID in the calculation task. (Basic) QoS priority.
  • the target subsystem further includes a sub-scheduler; the target processing node sends the memory access request to the target memory controller in the N memory controllers
  • the device includes: sending the memory access request to the sub-scheduler through the target processing node, and scheduling to the target memory controller among the N memory controllers through the sub-scheduler; the method It also includes: through the sub-scheduler, receiving the memory access requests sent by the S processing nodes in the target subsystem respectively; according to the QoS IDs carried in the memory access requests respectively sent by the S processing nodes Corresponding to the second QoS priority, the memory access requests sent by the S processing nodes are dispatched to the SoC bus, and the second QoS priority is the initial QoS priority of the corresponding QoS ID; wherein, the first The second QoS priority is used to indicate the corresponding priority of the memory access request to be dispatched to the SoC bus.
  • the sub-scheduler assigns the S processing nodes to Scheduling the memory access requests sent separately to the SoC bus includes: establishing task queues for the S processing nodes through the sub-scheduler, each of the task queues includes memory access requests sent by corresponding processing nodes;
  • the target processing node corresponds to a target task queue; when a target memory access request is currently inserted in the target task queue, the second QoS corresponding to the QoS ID carried in all memory access requests in the target task queue Priorities are raised to the third QoS priority respectively, and the target memory access request is a memory access request whose second QoS priority corresponding to the carried QoS ID exceeds the preset priority; according to the S processing nodes
  • the method further includes: receiving one or more memory access requests in the target task queue sent by the sub-scheduler through the SoC bus, and the one or more The memory access request includes the memory access request; the third QoS priority corresponding to the QoS ID carried in the one or more memory access requests in the target task queue is restored to the corresponding second QoS priority .
  • the method further includes: through the SoC bus, based on the restored second QoS priority of one or more memory access requests in the target task queue, assigning the One or more memory access requests in the target task queue are respectively dispatched to corresponding memory controllers among the N memory controllers.
  • the AI SoC further includes an advanced memory access agent MATA: through the SoC bus, one or more memory access requests in the target task queue are dispatched to the N On the corresponding memory controller in a memory controller, including: sending one or more memory access requests in the target task queue to the MATA through the SoC bus, and sending the one or more memory access requests through the MATA or multiple memory access requests are respectively dispatched to corresponding memory controllers among the N memory controllers.
  • MATA advanced memory access agent
  • the AI SoC further includes an advanced memory access agent MATA: the SoC bus, specifically for: sending the memory access requests sent by the S processing nodes to The MATA, and through the MATA, the memory access requests sent by the S processing nodes are respectively dispatched to the corresponding memory controllers in the N memory controllers, and the access requests sent by the S processing nodes are respectively
  • the storage request includes the storage access request.
  • the method further includes: receiving the memory access request through the MATA, and determining the second QoS priority corresponding to the QoS ID carried in the memory access request; Based on the second QoS priority corresponding to the QoS ID, combined with historical memory bandwidth statistics corresponding to the QoS ID, and memory access policy control parameters corresponding to the QoS ID, determine the corresponding QoS ID
  • the first QoS priority, the memory access policy control parameters include one or more of the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through.
  • the method further includes: through the MATA, presetting the memory access policy control parameters corresponding to each QoS ID, counting and recording the historical memory bandwidth corresponding to each QoS ID; according to the AI system The real-time monitoring information of memory access performance, update and optimize the said memory access policy control parameters corresponding to each QoS ID.
  • the memory access policy control parameters corresponding to each QoS ID are updated and optimized through an optimization algorithm and an adaptive machine learning algorithm.
  • the method further includes: using the MATA, carrying the first QoS priority in the memory access request, and sending the access request based on the first QoS priority The memory request is dispatched to the target memory controller.
  • the AI SoC further includes MATA; the method further includes: using the MATA, carrying the determined first QoS priority in the memory access request , and dispatch the memory access request to the target memory controller based on the first QoS priority.
  • the performing memory access QoS control on the memory access request based on the first QoS priority through the target memory controller includes: through the target memory controller, Based on the first QoS priority corresponding to the QoS ID, and in combination with the memory access service status of the target memory controller, perform memory access QoS control on the memory access request, and the memory access service situation includes memory access timing requirements, or memory bandwidth bus utilization.
  • the method further includes: when the amount of memory access requests received by the target memory controller is greater than a preset threshold, sending The system broadcasts a back pressure indication, where the back pressure indication is used to instruct one or more of the M subsystems to delay, or reduce, or stop sending memory access requests.
  • the AI system further includes a host; the method further includes: using the host, receiving a task to be executed, and splitting the task to be executed into one or more computations to be executed Task: Identify the business flow type of the one or more computing tasks to be executed after splitting according to a preset business flow label table, the preset business flow label table includes predefined computing tasks The mapping relationship between the service flow type and the QoS ID; according to the identification result, carry the corresponding QoS ID for the one or more computing tasks to be executed.
  • the AI SoC further includes a system scheduler; the method further includes: sending one or more computing tasks carrying corresponding QoS IDs to the system scheduler through the host device.
  • the method further includes: pre-configuring the corresponding second QoS priority for the QoS ID in the computing task through the host or through the target Maser, the second The QoS priority is the initial priority corresponding to the QoS ID.
  • the method further includes: updating and optimizing the second QoS priority corresponding to each QoS ID through the host according to the real-time monitoring information of the memory access performance of the AI system.
  • the second QoS priority corresponding to each QoS ID is updated and optimized through an optimization algorithm and an adaptive machine learning algorithm.
  • the method further includes: receiving, through the system scheduler, the one or more computing tasks to be executed sent by the host; wherein each of the computing tasks to be executed The task also carries a task descriptor used to describe the type of computing task; according to the task descriptor carried in each computing task to be executed, select a matching subsystem for each computing task to be executed from the M subsystems , and selecting a matching processing node from one or more processing nodes in the matching subsystem; scheduling each computing task to be executed to the matching processing node in the matching subsystem.
  • the AI system when the AI system is applied to a virtual scene, the AI system includes multiple virtual machines, where each of the multiple virtual machines corresponds to one or more process, one process includes one or more computing tasks; the one or more processes run on one or more processing nodes of at least one of the M subsystems; the method further includes: through the The system scheduler allocates a VM ID for each of the virtual machines; wherein, the VM ID of the corresponding virtual machine is shared in the page table of one or more processes corresponding to each of the virtual machines.
  • the target subsystem further includes a system memory management unit SMMU; the method further includes: using the target processing node, converting the The memory access request is sent to the SMMU, and the QoS ID carried in the memory access request of the computing task is updated through the SMMU; the memory access request of the computing task sent by the target processing node is received through the SMMU.
  • SMMU system memory management unit
  • the target process to which the computing task belongs determines the target process to which the computing task belongs; determine the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process , replacing the QoS ID carried in the memory access request of the computing task with the VM ID of the target virtual machine.
  • the AI SoC also includes an L2 Cache cache; the method further includes: receiving memory access requests of each computing task through the L2 cache, and The QoS ID carried in the request accesses the corresponding storage area in the L2 Cache, wherein the memory access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
  • the present application provides a semiconductor chip, which may include the AI system provided in any one of the implementation manners in the foregoing first aspect.
  • the present application provides a semiconductor chip, which may include: the AI system provided by any one of the implementation manners in the first aspect above, an internal memory coupled to the AI system, and an external memory.
  • the present application provides a semiconductor chip, which may include: the host provided by any one of the implementation manners in the foregoing first aspect.
  • the present application provides a semiconductor chip, which may include: at least one AI SoC provided in any one of the above-mentioned implementation manners of the first aspect.
  • the present application provides a system-on-chip SoC chip
  • the SoC chip includes the AI system provided by any one of the implementation manners in the first aspect above, an internal memory and an external memory coupled to the bus system.
  • the SoC chip may consist of chips, or may include chips and other discrete devices.
  • the present application provides a system-on-a-chip, where the system-on-a-chip includes the AI system provided in any one of the implementation manners in the above-mentioned first aspect.
  • the AI system further includes a memory, and the memory is used to save necessary or related program instructions and data during the operation of the chip system.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • the present application provides an electronic device, and the electronic device may include the AI system provided in any one of the implementation manners in the foregoing first aspect.
  • the present application provides an electronic device, which has a function of implementing any one of the bus communication methods in the first aspect above.
  • This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the present application provides an AI device, which has the function of implementing any one of the AI computing methods in the first aspect above.
  • This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by the bus system, the AI calculation described in any one of the above-mentioned second aspects is realized. method flow.
  • the embodiment of the present application provides a computer program, the computer program includes an instruction, and when the instruction is executed by the bus system, the bus system can execute the AI calculation method described in any one of the above-mentioned second aspects process.
  • FIG. 1A is a schematic diagram of a hardware structure of an AI system provided by an embodiment of the present application.
  • FIG. 1B is a schematic diagram of a hardware structure of another AI system provided by an embodiment of the present application.
  • FIG. 1C is a schematic diagram of a hardware structure of another AI system provided by an embodiment of the present application.
  • FIG. 2A is a schematic diagram of a service flow provided by an embodiment of the present application, and a schematic diagram of a relationship between graph nodes and computing tasks.
  • FIG. 2B is a schematic diagram of a flow direction of a service flow provided by an embodiment of the present application.
  • Fig. 2C is a schematic diagram of the relationship between the service flow type and memory access bandwidth involved in the operation of a resnet50 network provided by the embodiment of the present application.
  • FIG. 3A is a schematic framework diagram of a Davinci software stack provided by the embodiment of the present application.
  • FIG. 3B is a schematic diagram of an interaction flow among various software modules in a Davinci software stack provided by an embodiment of the present application.
  • FIG. 4A is a schematic diagram of a graph compiling phase and a graph running phase provided by an embodiment of the present application.
  • FIG. 4B is a construction script diagram of an AI model of resnet50 provided by the embodiment of the present application.
  • FIG. 4C is a schematic diagram of an executable computing task after graph compilation and optimization provided by the embodiment of the present application.
  • FIG. 5A is a schematic diagram of a software architecture of automatic QoS optimization provided by an embodiment of the present application.
  • FIG. 5B is a schematic flowchart of a method for automatic QoS optimization provided by an embodiment of the present application.
  • FIG. 6A is a software architecture diagram of an AI system in a virtual scene provided by an embodiment of the present application.
  • FIG. 6B is a schematic diagram of an interaction flow among various software modules in an AI system in a virtual application scenario provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a memory access control method provided by an embodiment of the present application.
  • a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device can be components.
  • One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more packets of data (e.g., data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet via a signal interacting with other systems). Communicate through local and/or remote processes.
  • packets of data e.g., data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet via a signal interacting with other systems.
  • QoS Quality of Service
  • memory access i.e. memory access
  • QoS refers to various types of computing tasks (such as various business flows) under limited memory bandwidth resources
  • Do corresponding memory access service control such as controlling the highest bandwidth, lowest bandwidth, or access priority of various access memory requests, etc., and provide memory access service quality assurance for corresponding computing tasks (such as computing tasks under a certain business flow) .
  • the service set identifier (Service Set Identifier, SSID) can divide a wireless LAN into several sub-networks that require different authentication. Each sub-network requires independent authentication, and only authenticated users can enter Corresponding sub-networks to prevent unauthorized users from entering the network.
  • SSID Service Set Identifier
  • TOPS Tera Operations Per Second
  • Virtual Machine which refers to a complete computer system that is simulated by software and has complete hardware system functions and runs in a completely isolated environment. Everything that can be done on a physical computer can be done on a virtual machine. When creating a virtual machine on a computer, part of the hard disk and memory capacity of the physical machine needs to be used as the hard disk and memory capacity of the virtual machine.
  • Davinci Vision Pre-Processor which mainly implements video decoding (VDEC), video encoding (VENC), JPEG encoding and decoding (JPEGD/E), PNG decoding (PNGD), and visual preprocessing unit (VPC), etc.
  • VDEC video decoding
  • VENC video encoding
  • JPEGD/E JPEG encoding and decoding
  • PNGD PNG decoding
  • VPC visual preprocessing unit
  • Matrix calculation core (AI Cube core, AIC), used to realize matrix calculation in AI operation.
  • Scalar calculation core (AI Vector core, AIV), used to realize scalar calculation in AI operation.
  • the graph generation engine (Graph engine, GE) in the AI platform is responsible for compiling and converting the intermediate expression (IR) of the AI model generated by various mainstream AI computing frameworks into an AI platform (such as the Davinc platform) that can understand and execute Computational subgraph of .
  • the main functions of GE include graph preparation, graph splitting, graph optimization, graph compilation, graph loading, graph execution, and graph management, etc. (the graph here refers to the topology graph of the network model).
  • Graph fusion engine (Fusion engine Davinc, FE) in the AI platform, FE is responsible for docking GE and Tensor Boost Engine (Tensor Boost Engine, TBE) operators, with loading and management of operator information base, fusion rule management , The ability of original image fusion and sub-image optimization.
  • GE passes the subgraph to FE in the subgraph optimization stage, and FE performs precompilation according to the operator information base and FE fusion optimization, such as modifying data types, inserting conversion operators, etc., and the subgraph will be passed to GE again for subgraph merging and subgraph optimization.
  • Remote direct data access controller (10) Remote direct data access controller (Remote Direct Memory Access, RDMA), RDMA directly transmits data to the storage area of the computer through the network, and quickly moves the data from one system to the remote system memory without causing damage to the operating system any impact, so that much computer processing power is not required. It eliminates the overhead of external memory copying and context switching, thereby freeing memory bandwidth and CPU cycles for improved application performance.
  • the SoC-level direct memory access controller (System Direct memory Access, SDMA) can be used as a multi-channel efficient data transmission for data DMA access and movement in each subsystem within the SoC.
  • SDMA System Direct memory Access
  • PCIE DMA Peripheral Component Interconnect Express Direct memory Access
  • Runtime refers to the state that a program is running (cc or being executed). In some programming languages, certain reusable programs or instances are packaged or rebuilt into "runtime libraries". These instances can be linked or invoked by any program while they are running.
  • Compute Architecture for Neural Networks is a heterogeneous computing architecture launched for AI scenarios. By providing multi-level programming interfaces, it can be used to support users to quickly build applications and services based on AI platforms. To improve user development efficiency and the computing power of AI processors.
  • HCCL Huawei Collective Communication Library
  • Operator library It mainly provides broadcast, allreduce, reducescatter, allgather and other collective communication functions between single-machine multi-card and multi-machine multi-card, and provides efficient data transmission capabilities in distributed training.
  • High Bandwidth Memory is a high-performance DRAM based on 3D stack technology, that is, a memory chip (ie "RAM”), which has the characteristics of high speed and high bandwidth, and is suitable for high memory bandwidth Demanding applications, such as graphics processors, network switching and forwarding equipment (such as routers, switches), etc.
  • RAM memory chip
  • the DRV file is a file in the driver package, which can be opened with Notepad or WordPad.
  • a DRV file is a driver file created by connection and communication hardware devices, both external and internal, used in the Windows operating system. Contains commands and parameters on how to set up operating system devices and communicate with them, and can also be used to install device drivers on a computer.
  • IOCTL Input/Output Control
  • IOCTL in the computer, is a system call dedicated to device input and output operations.
  • the call passes in a request code related to the device.
  • the function of the system call depends entirely on the request code.
  • Fig. 1A is a schematic diagram of the hardware structure of an AI system provided by the embodiment of the present application.
  • the AI system 01 can be located in any electronic device, such as a computer, computer, mobile phone, tablet, server, etc. .
  • the hardware structure of the AI system 01 may be a chip or a chipset or a circuit board equipped with a chip or a chipset.
  • the chip or chipset or the circuit board equipped with the chip or chipset can work under the necessary software drive.
  • a host 10 and an AI SoC 20 may be included.
  • the AI SOC 20 may include M subsystems (as shown in FIG.
  • the target subsystem includes S processing nodes (as shown in Figure 1A, may include Master 1, Master 2, ..., Master S), it should be noted that, for the convenience of description in the subsequent relevant embodiments of this application, the processing node can be named or translated as Master, or the processing node can be understood as other types of nodes including Master, understandable Yes, the description that
  • the S Masters include a target Master
  • the target Master is any subsystem in the M Masters (for ease of description, the target Master will be used as the target subsystem in the target subsystem 201-1 in subsequent embodiments Master1 is taken as an example for illustration, and it is also understandable that this example does not constitute any limitation on the target Master itself).
  • M, N, and S are all integers greater than or equal to 1.
  • FIG. 1B is a schematic diagram of the hardware structure of another AI system provided by the embodiment of the present application. Compared with the AI system in FIG. 1A, the AI system in FIG. 1B may further include advanced The memory access agent (MATA) 205 can be used for overall management of the above N memory controllers ((203-1 ⁇ 203-N)).
  • the memory access agent (MATA) 205 can be used for overall management of the above N memory controllers ((203-1 ⁇ 203-N)).
  • the host 10 may include a host CPU and an internal memory not shown in FIG. 1A , and optionally, may further include physical devices such as a host controller, other input and output controllers, and interfaces. Wherein, a host system (Host System), such as X86, ARM, etc., can run on the host CPU. In this embodiment of the application, the host 10 can be used as the center of business flow deployment and task management of the AI system 01 or AI system 02, and is used to manage multiple hardware accelerators (Device) such as various SoCs. It can be understood that it includes at least AI SoC 20 as described in this application.
  • Host System hardware accelerators
  • the functions of the host 10 include managing tasks, communicating instructions, or providing specific services to each SoC; furthermore, it can also be used to identify the type of business flow issued by the user (including AI computing framework, model segmentation and identification, etc. ), to assign appropriate QoS IDs to the corresponding computing tasks in the business flow, that is, to label each computing task with a suitable QoS label, for example, based on the processing capabilities of various hardware accelerators (Device), to assign various hardware accelerators (Device) Device) allocates appropriate computing tasks that have carried the QoS ID.
  • the host 10 may also carry the initial QoS priority corresponding to the QoS ID carried by each computing task, that is, the second QoS priority.
  • AI SoC 20 is an artificial intelligence system on a chip, as shown in Figure 1A or Figure 1B, AI SoC 20 can specifically include a system scheduler 200, a plurality of subsystems (such as subsystems 201-1, ..., subsystems 201-M ), a system-on-chip bus 202, multiple memory controllers (such as memory controller 203-1, ..., memory controller 203-N), and further, multiple memories (memory 204-1, memory 204-1, memory 204-2, ..., memory 204-N), that is, each memory controller is used to control at least one memory, wherein any memory can be a high bandwidth memory (High Bandwidth Memory, HBM) or a double-rate synchronous dynamic random Memory (Double Data Rate, DDR), etc. in,
  • HBM High Bandwidth Memory
  • DDR Double Data Rate
  • the system scheduler 200 when carrying the QoS ID calculation task from the host computer 10 to the AI SoC 20, can first pass through the system scheduler 200 in the AI SoC 20, and the system scheduler 200 can be based on the type of the calculation task (such as according to the calculation task)
  • the task descriptor carried in the task schedules each computing task to each subsystem suitable for executing the computing task, and further schedules to the appropriate Master.
  • Subsystems may be an integrated circuit with a dedicated function, or an accelerator for accelerating a certain function.
  • the subsystem can be artificial intelligence core device (AI CORE), vision preprocessor (DVPP), image signal processor (ISP), sound signal processor (ASP), SOC system level DMA controller (SDMA), Remote direct data access controller (RDMA), peripheral device interconnect expansion bus direct memory access controller (PCIE DMA), encryption and decryption engine, or general-purpose CPU, etc.
  • AI CORE artificial intelligence core device
  • DVPP vision preprocessor
  • ISP image signal processor
  • ASP sound signal processor
  • SDMA SOC system level DMA controller
  • RDMA Remote direct data access controller
  • PCIE DMA peripheral device interconnect expansion bus direct memory access controller
  • encryption and decryption engine or general-purpose CPU, etc.
  • the masters can also be interconnected through a bus (Connect bus) 211 within the subsystem, and each subsystem can further include a sub-scheduler 212.
  • AI subsystems are AI subsystems, and may include systems or other subsystems that are not used for AI computing but are used to cooperate with AI subsystems; correspondingly, not all computing tasks are AI computing tasks can also have some tasks that cooperate with AI computing, or some general computing tasks.
  • Processing nodes can be understood as requesters who can initiate memory access requests in this application, sources of memory access requests, or data requesters, etc., one or more processing nodes inside each subsystem (such as Master) can represent multiple cores of the subsystem.
  • the Master inside each subsystem executes the computing task according to the task description in the computing task.
  • each Master carries the QoS ID of the task together with the data and address of the memory to be accessed.
  • the information is sent to the SOC bus 202 through the sub-scheduler 212 together.
  • multiple Masters within the subsystem can represent general-purpose CPU cores; when the subsystem is a GPU, multiple Masters within the subsystem can represent GPU cores; when the subsystem is In the case of an NPU, multiple Masters within the subsystem can represent NPU cores and the like.
  • the sub-scheduler 212 can be used to schedule memory access requests generated by all the Masters in the sub-system to which it belongs during the execution of computing tasks. For example, establish a corresponding memory access request queue for each Master, which is used to store memory access requests generated in each Master in sequence, and combine the QoS ID carried in each memory access request in the queue according to the sequence in the queue Corresponding to the QoS priority, the memory access requests in each queue are scheduled successively, and dispatched to the system on chip bus (SoC Connection BUS) 202 in the AI SoC 20.
  • SoC Connection BUS system on chip bus
  • the on-chip system bus 202 (SoC Connection BUS) can realize the connection between each subsystem (201-1 ⁇ 201-M) and each memory controller (203-1 ⁇ 203-N) in the AI SoC 20, and the bus
  • the data communication between each subsystem (201-1-201-M) and each memory controller (203-1-203-N) is realized in a manner. That is, the memory access requests (referred to as memory access requests) of each subsystem (201-1 ⁇ 201-M) can be issued to the corresponding memory access requests after arbitration, address resolution and routing of the system-on-chip bus 202 controller.
  • the specification of the system-on-chip bus can also define the relationship between drivers, timing, strategies, etc. in the processes of initialization, arbitration, request transmission, response, sending and receiving, etc. among various modules.
  • An advanced memory access agent can be used to coordinate and manage the N memory controllers (203-1 ⁇ 203-N), and can configure corresponding memory access policy control parameters for each QoS ID (such as allowing One or more of the highest bandwidth, the lowest bandwidth, and the access priority that the access request passes through), and generate optimized QoS priorities for each memory access request, that is, the first QoS ID corresponding to each memory access request
  • the second QoS priority (such as the initial priority corresponding to the QoS ID, the default priority, or the default priority, etc.) is optimized and regulated to generate the corresponding first QoS priority (possibly based on the original QoS priority) up, and possibly down).
  • the final QoS priority of the memory access request is calculated.
  • MATA can also count and record the historical memory bandwidth occupied by each QoS ID on each memory controller.
  • the QoS ID carried in a certain computing task may not change, but the QoS priority corresponding to the QoS ID may change, and the memory access policy control parameters corresponding to a certain QoS ID may also change. Variety.
  • MATA can be set outside the N memory controllers, or on a certain memory controller, that is, MATA and each memory controller can be an independent physical entity, or MATA can be integrated in On one or more memory controllers, the embodiments of the present application do not limit the physical relationship between the MATA and the memory controllers.
  • the memory controller (203-1-203-N) is used to control the memory and is responsible for data exchange between the memory and various subsystems.
  • the memory controller determines which memory to send data to, or determines which memory to read data from and returns to the corresponding subsystem according to the address in the memory access request sent by each subsystem.
  • the memory controller can also perform virtual address to physical address mapping, memory access control, cache support and so on. Specifically, after receiving the memory access request scheduled by MATA, the target memory access controller parses the read/write address in the memory access request and the QoS priority carried therein (i.e.
  • the memory access request is controlled by memory access QoS, for example, whether to allow access immediately or temporarily store the memory
  • the command waits for the next round of scheduling arbitration access, etc. If the memory access request is scheduled this time, the memory access request will be sent to the specific memory access unit to perform read and write operations, and the corresponding QoS ID will be updated on the memory controller bandwidth statistics.
  • the final QoS priority calculated by MATA may be only one of the factors based on which the memory controller performs memory QoS, for example, it may also be based on timing factors, starvation mechanism, and the like.
  • the quality of service can be determined by the QoS priority in some cases, but in some cases it will be determined according to the actual memory situation, so it is not necessarily QoS The higher the priority, the better the service quality of the memory access will be finally obtained.
  • the internal memory in this application includes a readable and writable running memory, which is used to temporarily store the computing data in each subsystem (201-1 ⁇ 201-M), and interact with the external memory of the AI SoC20, which can be used as A storage medium for temporary data of an operating system or other running programs.
  • the processing program, operating program or operating system running on Master1 in the subsystem 201-1 for performing calculation tasks transfers the data to be calculated from the internal memory 204-2 to Master1 for calculation, and when the calculation is completed, Master1 then transmits the result.
  • the internal memory may include one or more of dynamic random access memory (DRAM), static random access memory (SRAM), synchronous dynamic random access memory (SDRAM), and the like.
  • DRAM includes double data rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM) referred to as DDR, second generation double rate synchronous dynamic random access memory (DDR2), third generation double rate synchronous dynamic random access memory (DDR SDRAM) DDR3), four generations of low power double data rate synchronous DRAM (Low Power Double Data Rate 4, LPDDR4) and five generations of low power double data rate synchronous DRAM (Low Power Double Data Rate 5, LPDDR5), etc. .
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • DDR SDRAM second generation double rate synchronous dynamic random access memory
  • DDR SDRAM third generation double rate synchronous dynamic random access memory
  • LPDDR4 Low Power Double Data Rate 4
  • LPDDR5 Low Power Double Data Rate 5 LPDDR5
  • the structure shown in the embodiment of the present application does not constitute a specific limitation on the AI system 01 or the AI system 02 .
  • the AI system 01 or the AI system 02 may include more or fewer components than shown in the illustration, or combine some components, or split some components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • Figure 1C is a schematic diagram of the hardware structure of another AI system provided by the embodiment of the present application.
  • the AI system 03 can also be located in any electronic device, such as a computer, computer, mobile phone, tablet, etc. .
  • the hardware structure of the AI system 03 may be a chip or a chipset or a circuit board equipped with a chip or a chipset.
  • the chip or chipset or the circuit board equipped with the chip or chipset can work under the necessary software drive.
  • AI system 03 in Figure 1C the difference between AI system 03 in Figure 1C and AI system 01 in Figure 1A or AI system 02 in Figure 1B is that the hardware structure of AI system 03 can be used to support the realization of virtualization scenarios, that is, virtual machines Bandwidth isolation between tenants, including user data isolation and computing power resource isolation between different virtual machine users, to meet the requirement that services between different users do not affect each other; from the perspective of hardware architecture, AI in Figure 1C
  • the subsystem of the AI SoC further includes a system memory management unit SMMU 210, and a secondary cache (L2 cache) 206; optionally, the subsystem of the AI SoC in the AI system 03 may further include an advanced memory access agent MATA 205.
  • SMMU 210 system memory management unit
  • L2 cache secondary cache
  • a system memory management unit (System Memory Management Unit, SMMU) 210 may be located inside each subsystem and between each Master and the connection bus 211 .
  • SMMU 210 can perform rights management: if the address spaces between programs are different, it is used to control the rights of different programs; it can perform address mapping: such as converting virtual addresses and physical addresses; physical memory management: such as controlling the physical address of the system It manages memory resources and provides user programs with operation interfaces such as application and release of physical memory.
  • the SMMU can also be used for isolation in virtual scenarios, such as address isolation between different processes, isolation of physical address space, isolation of memory bandwidth, and the like.
  • each virtual machine corresponds to a unique VM ID
  • the system memory management unit performs address translation including converting the QoS ID, specifically searching the page table corresponding to the current process through the SMMU, Then get the virtual machine identification (VM ID) in this page table. That is to say, the page tables in all different processes in the same VM end up with the same QoS ID.
  • the QoS IDs in all different processes in the same VM are converted into the same QoS ID, and the VM IDs of different VMs are different, so the QoS IDs corresponding to the processes under different VMs are also different in the end.
  • the VM ID in the page table is assigned by the system when the virtual machine is created, but the SMMU will query it here.
  • the system memory management unit (SMMU) in the embodiment of the present application is a functional unit inside each subsystem (201-1-201-M), and its function is to "translate" the program address into a physical address; and
  • the memory controller (203-1-203-N) may be an external device for the subsystem (201-1-201-M), and may be responsible for corresponding the physical address to a specific memory location.
  • L2 high-speed cache refers to the memory that can carry out high-speed data exchange, in this application it is prior to each memory (204-1 ⁇ 204-N) and each subsystem (201-1 ⁇ 201 -M) exchange data, so it is faster. It can be understood that if the cache is inside each subsystem (201-1-201-M), then the cache can generally be referred to as an L1 cache, while the L2 Cache shown in FIG. 1C is located in each subsystem The outside of the system, that is, the external cache, is located between each subsystem and each memory (204-1-204-N). In the application embodiment, the L2 Cache can be applied in the virtual scene.
  • the corresponding VM ID can be configured for each virtual machine through the relevant management unit, and each virtual machine can be configured in the L2 Cache.
  • the corresponding storage area (such as the address range and size of the storage space, etc.), that is, the memory access request corresponding to each virtual machine is only allowed to access the storage area corresponding to the configured virtual machine, and cannot access other virtual machines.
  • the storage area corresponding to the machine so that the AI system can realize the security isolation of the cache in the virtual scene.
  • the structure shown in the embodiment of the present application does not constitute a specific limitation on the AI system 03 .
  • the AI system 03 may include more or fewer components than shown in the illustration, or combine some components, or split some components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • FIG. 1A , FIG. 1B and FIG. 1C are just several exemplary implementations in the embodiments of the present application, and the AI system architectures in the embodiments of the present application include but not limited to the above architectures.
  • the specific functions implemented by the AI system 01, 02 or 03 may include the following:
  • the target Master among the S processing node Masters contained in the target subsystem in the AI SoC 20 (take the subsystem 201-1 as an example) is used to: receive the calculation task to be executed, and the calculation The task carries a quality of service identifier QoS ID; the target Master is any one of the S Masters; the QoS ID is used to indicate the category to which the computing task belongs; according to the memory address and data, generate a memory access request for the computing task, the memory access request carries the QoS ID; send the memory access request of the computing task to the target in the N memory controllers Memory controller; the target memory controller (taking the memory controller 203-1 as an example) is used to: receive the memory access request, determine the first quality of service QoS priority corresponding to the QoS ID; based on the The first QoS priority is to perform memory access QoS control on the memory access request.
  • QoS ID quality of service identifier
  • the target Master is any one of the S Masters
  • the QoS ID is used to indicate the category
  • the corresponding computing tasks carry the service quality identification QoS ID, that is, the QoS ID carried in the computing tasks It is used to indicate the category to which the computing task belongs, and the corresponding memory access QoS priority can be finally determined according to the category to which it belongs.
  • the basis is that in the field of AI computing, different types of computing tasks (such as computing under different business flows tasks) have different requirements for the service quality of memory access, and there is memory access competition among certain types of computing tasks, but there is no memory access competition among certain types of computing tasks.
  • Setting the matching QoS priority for the memory access request can better meet the memory access requirements of different types of computing tasks (it is understandable that different types of computing tasks can correspond to different QoS IDs, but different QoS IDs may correspond to the same QoS priority or may correspond to different QoS priorities); further, in the process of each Master executing the received computing task, it generates The memory access request of the computing task, and the QoS ID carried in the computing task continues to be carried in the memory access request, that is, the QoS ID is transferred to its corresponding memory access request along with the computing task flow, so that the subsequent memory When the controller receives a memory access request, it can perform corresponding priority memory access control on the memory access request according to the QoS ID it carries.
  • the memory controller can provide The memory access request of this QoS ID provides better memory access service quality, that is, different memory access QoS controls are performed for computing tasks with different memory access priority requirements, so as to avoid random preemption of key memory due to non-discriminatory treatment in the prior art Memory bandwidth resources, resulting in serious degradation of system performance.
  • the first QoS priority corresponding to the QoS ID may be the initial QoS priority corresponding to the QoS ID, for example, the initial QoS priority set by the target Master, or the host in the AI system is Its preset QoS priority, etc.; or the first QoS priority is the final QoS priority corresponding to the QoS ID, that is, the QoS priority can be the final QoS priority after adjusting the initial QoS priority , for example, based on the initial QoS priority corresponding to the QoS ID, some adjustments and optimizations are made to it during the transfer of memory access requests, such as temporarily increasing or decreasing the QoS priority, or temporarily increasing or The lowered QoS priority is restored, for example, after a series of temporary adjustments to the initial QoS priority, the final first QoS priority is obtained through final adjustment, etc.
  • This embodiment of the present application does not specifically limit this.
  • different service flows may correspond to different QoS IDs, that is, the QoS IDs corresponding to or carried by different service flows are different between computing tasks.
  • the computing tasks split under the same service flow correspond to the same QoS ID, that is, the QoS ID at this time is used to indicate different service flow types.
  • the embodiment of the present application can classify the computing tasks according to the type of the business flow, the computing tasks of the same type correspond to the same QoS ID, and the computing tasks of different types correspond to different QoS IDs, and in the embodiment of the application , the classification principle is to classify according to the type of business flow to which the computing task belongs.
  • the QoS ID in this embodiment of the present application can also be used to indicate other types or classification methods (such as the calculation type of computing tasks, the type of importance of computing tasks, or the time period during which computing tasks are executed, or according to memory access latency requirements of computing tasks, or classification according to the purpose of memory access by computing tasks, etc.), that is, when it can be determined in the AI system that there is competition for memory access requests between two types of computing tasks
  • the categories of computing tasks can be divided, and the same QoS ID can be assigned to each category of computing tasks, while different QoS IDs can be assigned to different types of computing tasks. How to classify computing tasks is not specifically limited.
  • an on-chip memory access quality of service (QoS) control technology is introduced, by performing QoS marking on each computing task to be assigned to the AI SoC in the AI system, and different types of computing
  • the corresponding QoS IDs of the tasks are different (for example, classify according to the business flow to which the computing task belongs, or classify according to the different memory access delay requirements of the computing task, etc.), so that the follow-up can be based on each computing task.
  • the carried QoS ID determines the QoS priority corresponding to the memory access request of each computing task, and finally performs QoS control on each memory access request based on the determined QoS priority, so as to realize the memory access of the AI system from the granularity of the computing task
  • QoS control it can also realize the memory access requirements for different types of computing tasks (such as under different business flows), and can provide different memory access service guarantee functions.
  • the existing computing power and memory bandwidth of the AI system On the basis of resources, obtain better AI computing performance.
  • the embodiment of the present application finally realizes the control of memory access service quality from the granularity of computing tasks, and solves the problem of different types of computing tasks (such as different types of business flows) in AI training and reasoning tasks.
  • the problem of insufficient memory bandwidth caused by concurrent competition for memory bandwidth and since the embodiment of this application provides the corresponding memory access service according to the priority corresponding to the QoS ID in the memory access request, it can also give priority to guaranteeing AI training Computing tasks with higher latency requirements in and reasoning, make better use of memory bandwidth resources in a larger and more efficient manner, and finally achieve load balancing of the entire AI system's memory access, improving the overall execution performance and performance of the entire AI system. efficiency.
  • the embodiment of this application when the category of computing tasks is divided according to the type of business flow they belong to, it can also solve the problem of AI training and reasoning tasks caused by the lack of business flow priority identification and control and optimization means.
  • Performance jitter problems such as delay jitter in the training process, will greatly affect the improvement of the linearity of the AI cluster scale, resulting in the failure of the computing power of a large number of AI cluster machines to be utilized to the greatest extent, resulting in the loss of valuable AI computing resources. Waste, increase the customer's model training cost and time overhead.
  • the embodiment of the present application can avoid performance jitter (such as delay jitter) and improve the linearity of the AI system.
  • the calculation task also carries a second QoS priority corresponding to the QoS ID, and the second QoS priority is the initial QoS priority corresponding to the QoS ID in the calculation task.
  • QoS priority (such as basic priority, default priority, or default priority, etc.).
  • the calculation task assigned to each Master in the AI system in the AI system can also carry the initial QoS priority corresponding to the QoS ID in the calculation task (that is, the first QoS priority). Two QoS priorities).
  • the QoS ID and the corresponding initial QoS priority can be configured for the calculation task at the beginning of the calculation task assignment, so that the follow-up can be performed according to the QoS ID and the initial QoS priority. Subsequent QoS priority regulation, and corresponding memory QoS access control.
  • the QoS ID carried in the memory access request of the computing task can remain unchanged during the transfer process from the Master to the target memory controller, but its corresponding QoS priority can be Different adjustments and optimizations are made according to different requirements and situations of memory access requests in the scheduling process.
  • the QoS ID used to indicate the category of the computing task is naturally It may not change, but due to different QoS IDs and their corresponding QoS priorities, it is possible to encounter different situations in the actual scheduling process with the memory access requests of computing tasks, or in the actual access process, Many other factors or conditions need to be considered, so the QoS priority corresponding to a certain QoS ID may change.
  • the AI system 01 or 02 in the embodiment of the present application may further include: a function of scheduling within each subsystem the memory access requests generated by multiple Masters within the subsystem. The following describes in detail how the AI system 01 or 02 schedules the memory access requests generated by each Master during the execution of computing tasks within the subsystem in combination with some embodiments provided by this application.
  • the target subsystem further includes a sub-scheduler; the target Master is specifically configured to: send the memory access request to the sub-scheduler, and pass the sub-scheduler scheduler to the target memory controller in the N memory controllers; the sub-scheduler is configured to: receive memory access requests sent by the S Masters in the target subsystem; according to the S The second QoS priority corresponding to the QoS ID carried in the storage access requests sent by the S Masters respectively, the storage access requests sent by the S Masters respectively are dispatched to the SoC bus, and the second QoS priority is the corresponding The initial QoS priority of the QoS ID; wherein, the second QoS priority is used to indicate that the corresponding memory access request is scheduled to the priority of the SoC bus.
  • each subsystem in the AI SoC in the AI system also includes a sub-scheduler, which can be used to schedule memory access requests of computing tasks being executed in all Masters in the subsystem.
  • the memory access request generated by the Master is sent to the SoC bus for arbitration, address resolution and routing after being scheduled by its internal sub-scheduler, and then sent to the corresponding memory controller for memory access. Since the memory access requests of the computing tasks executed in each Master carry the QoS ID of the corresponding computing tasks, the sub-schedulers in each subsystem can carry the QoS priority in the process of scheduling memory access requests.
  • Memory access requests with higher QoS IDs are prioritized to be dispatched to the SoC bus, while memory access requests with lower QoS priority QoS IDs are dispatched to the SoC bus later to ensure that memory access requests are dispatched to the SoC bus
  • the corresponding QoS priority of each memory access request has been considered, so as to provide each computing task with a memory access control service matching its QoS ID from the source of the entire AI system.
  • the sub-scheduler is specifically configured to: respectively establish task queues for the S Masters, and each of the task queues includes a memory access request sent by a corresponding Master; wherein, the Target Master corresponds to the target task queue; when a target memory access request is currently inserted in the target task queue, the second QoS priorities corresponding to the QoS IDs carried in all memory access requests in the target task queue are raised to The third QoS priority, the target memory access request is a memory memory request that the second QoS priority corresponding to the carried QoS ID exceeds the preset priority; according to the memory memory access in the task queue of the S Masters The second QoS priority or the third QoS priority corresponding to the QoS ID carried in the request sends the memory access requests in the task queues of the S Masters to the SoC bus successively. That is to say, the second QoS priority can be understood as the source QoS priority corresponding to the QoS ID, and the third QoS priority can be understood as the
  • the embodiment of the present application in the process of scheduling memory access requests by the sub-scheduler of each subsystem, by creating a task queue for computing tasks in each Master, all memory access requests generated in each Master are placed In a task queue, according to the QoS priority corresponding to the QoS ID carried by the memory access request in each task queue, it is sent to the SoC bus successively; when a task queue currently has a higher QoS priority
  • the embodiment of the present application uses the Master
  • the sub-scheduler in the task queue upgrades the QoS priority of all memory access requests (that is, from the second QoS priority to the third QoS priority), so that any memory access request ( In particular, the memory access requests with higher QoS priority mentioned above) will not cause the memory access of the entire task queue because the QoS priority of the memory access request at the
  • the AI system in the embodiment of the present application may further include: a function of dispatching memory access requests issued by multiple subsystems to corresponding memory controllers.
  • a function of dispatching memory access requests issued by multiple subsystems to corresponding memory controllers The following describes in detail how the AI system 01 or 02 dispatches memory access requests on each subsystem to corresponding memory controllers in combination with some embodiments provided by the present application.
  • the SoC bus is configured to: receive one or more memory access requests in the target task queue sent by the sub-scheduler, and the one or more memory access requests include The memory access request: restore the third QoS priority corresponding to the QoS ID carried in the one or more memory access requests in the target task queue to the corresponding second QoS priority.
  • the SoC bus is further configured to: based on the second QoS priority after recovery of one or more memory access requests in the target task queue, set the target task queue to One or more memory access requests among the N memory controllers are respectively dispatched to corresponding memory controllers among the N memory controllers.
  • the SoC bus restores the QoS priority of the QoS ID of the memory access request dispatched from each subsystem to the initial second QoS priority, it can be based on the restored second QoS priority
  • the scheduling of memory access requests is performed at the level, that is, each memory access request is dispatched to the corresponding memory controller according to the recovered second QoS priority, so that the memory controller performs subsequent memory access QoS control and memory access.
  • the AI SoC further includes an advanced memory access agent MATA: the SoC bus, specifically configured to: send one or more memory access requests in the target task queue to the MATA , and respectively dispatching the one or more memory access requests to corresponding memory controllers among the N memory controllers through the MATA.
  • the AI SoC also includes an advanced memory access agent MATA: the SoC bus, specifically used to: send the memory access requests sent by the S Masters to the The MATA, and the memory access requests sent by the S Masters are respectively dispatched to the corresponding memory controllers in the N memory controllers through the MATA, and the memory access requests sent by the S Masters include The memory access request.
  • the AI SoC may further include a memory access agent MATA for memory access control.
  • MATA memory access agent
  • MATA memory access agent
  • each memory controller can be controlled and managed in an overall manner, and each received memory access request can be further regulated. For example, further optimize the second QoS priority corresponding to the QoS ID in each memory access request.
  • the MATA is configured to: receive the memory access request, determine the second QoS priority corresponding to the QoS ID carried in the memory access request; The second QoS priority corresponding to the ID, combined with the historical memory bandwidth statistical information corresponding to the QoS ID, and the memory access policy control parameters corresponding to the QoS ID, determine the first QoS priority corresponding to the QoS ID.
  • QoS priority, the memory access policy control parameters include one or more of the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through.
  • MATA uses the historical memory bandwidth statistics corresponding to the QoS ID (such as the sum of the total bandwidth occupied by all memory access requests carrying the QoS ID on all N memory controllers) and memory access policy control parameters (such as The highest bandwidth actually configured by the QoS ID) is compared, and then the floating priority corresponding to the QoS ID is calculated, and then the floating priority is added or subtracted to the second QoS priority corresponding to the QoS ID (the increase is Add up, subtract if lower), and finally get the first QoS priority of this QoS ID.
  • the QoS ID such as the sum of the total bandwidth occupied by all memory access requests carrying the QoS ID on all N memory controllers
  • memory access policy control parameters such as The highest bandwidth actually configured by the QoS ID
  • MATA after MATA receives each memory access request dispatched by the SoC bus, it can further optimize and adjust the initial priority (that is, the second QoS priority) carried in each memory access request.
  • the principle may include that before the memory access request is dispatched to each memory controller by the SoC bus, the corresponding initial QoS priority (that is, the second QoS priority) according to the QoS ID carried in the memory access request can be firstly passed through MATA. And combined with the historical memory bandwidth statistics corresponding to the QoS ID and the access policy control parameters corresponding to the QoS ID that are currently recorded by the MATA, generate the final corresponding QoS priority (i.e.
  • the target memory controller can finally perform memory access QoS control on the memory access request according to the final QoS priority. That is to say, when MATA performs memory access control, it not only takes into account the QoS priority initially configured by the AI system for each QoS ID, but also further considers the historical bandwidth statistics corresponding to each QoS ID (such as carrying a certain same QoS ID The memory bandwidth information currently obtained by a class of computing tasks) and the memory access policy control parameters that should be obtained when the memory access request corresponding to the QoS ID is configured (such as the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass, etc.) To comprehensively consider what kind of memory access QoS control service is provided for the current memory access request, so as to finally obtain the QoS priority that matches the memory access request, so as to perform more accurate memory access QoS control, and further optimize and improve the performance of the AI system.
  • the memory access request corresponding to a certain QoS ID has occupied a large amount of memory bandwidth.
  • the QoS priority of the QoS ID can be lowered to balance the memory access bandwidth occupation corresponding to each QoS ID; and if The memory access request corresponding to a certain QoS ID currently occupies less memory bandwidth, and the QoS priority corresponding to the QoS ID can be increased to compensate for the memory access bandwidth occupation corresponding to the QoS ID.
  • the MATA is also used to: pre-set the memory access policy control parameters corresponding to each QoS ID, count and record the historical memory bandwidth corresponding to each QoS ID; Performance real-time monitoring information, updating and optimizing the memory access policy control parameters corresponding to each QoS ID.
  • Performance real-time monitoring information updating and optimizing the memory access policy control parameters corresponding to each QoS ID.
  • update and optimize the control parameters of the memory access strategy corresponding to each QoS ID through an optimization algorithm and an adaptive machine learning algorithm.
  • MATA also configures corresponding memory access policy control parameters for each QoS ID, and also counts and records the historical memory bandwidth corresponding to each QoS ID, so as to determine whether it is corresponding to a certain QoS ID based on these two kinds of information. On the basis of the initial priority, the QoS priority is increased or the QoS priority is decreased. Finally determine the ultimate QoS priority of the memory access request corresponding to a certain QoS ID, so that the memory controller can perform specific memory access QoS control according to the ultimate QoS priority. For example, MATA can set the highest bandwidth, lowest bandwidth, and access priority that are allowed to pass through a memory access request carrying a certain QoS ID.
  • the MATA is further configured to: carry the first QoS priority in the memory access request, and schedule the memory access request based on the first QoS priority to the target memory controller.
  • MATA may also continue to carry the QoS ID in the memory access request and send it to the memory controller.
  • the final priority (that is, the first QoS priority) can be carried in the memory access request and sent to the corresponding memory controller, so that The corresponding memory controller can perform memory access QoS control according to the first QoS priority.
  • the MATA may also schedule the memory access request to the target memory controller based on the first QoS priority.
  • the memory controller can according to the first QoS priority and QoS IDs jointly make decisions on memory access QoS control.
  • the memory controller can also calculate the historical memory bandwidth occupied by the memory access request corresponding to the QoS ID based on the QoS ID, and perform memory access QoS control based on this. further optimization.
  • the AI SoC further includes a MATA; the MATA is used to: carry the determined first QoS priority in the memory access request, and based on the determined Scheduling the memory access request to the target memory controller according to the first QoS priority. That is to say, MATA can carry the first QoS priority determined by itself or determined by other modules in the AI system in the memory access request, and based on the first QoS priority, schedule the memory access request to in the corresponding target memory controller.
  • the first QoS priority can be the final QoS priority obtained by the MATA according to the initial QoS priority carried in the memory access request and related adjustments and optimizations, or it can be the final QoS priority obtained by the MATA for the memory access request.
  • the memory access request is medium, which is not specifically limited in this embodiment of the present invention.
  • the AI system in the embodiment of the present application may further include: the function of specifically how to perform memory access QoS control.
  • the following describes in detail how the AI system 01 or 02 provides appropriate memory access QoS control for different memory access requests in combination with some embodiments provided in this application.
  • the target memory controller is specifically configured to: receive the memory access request, determine the first QoS priority corresponding to the QoS ID; The first QoS priority, combined with the memory access service status of the target memory controller to perform memory access QoS control on the memory access request, the memory access service situation includes memory access timing requirements, or memory bandwidth bus utilization Rate.
  • the memory access service situation may include the read and write timing requirements of the DDR controller (because when multiple memory access requests need to access a certain memory controller at the same time, the corresponding timing requirements must be met); or include the DDR bandwidth bus Utilization, that is, the access efficiency.
  • the memory controller will give priority to the data of the same bank and the same row; or some read and write rules; Or some read and write conditions of the memory controller itself, etc.
  • the memory controller can combine the The current service status of the memory controller performs memory access QoS control on the memory access request. That is to say, when the memory controller performs memory access QoS control, it not only takes into account the QoS priority finally generated by MATA for each QoS ID, but also further considers the current service situation of each memory controller (for example, Access timing requirements, memory bandwidth bus utilization, etc.) to perform more precise memory access QoS control, thereby further optimizing and improving the computing performance of the AI system.
  • the memory controller can further calculate the historical memory occupied by the memory access request corresponding to the QoS ID on itself based on the QoS ID Bandwidth, and further optimize memory access QoS control based on this.
  • the target memory controller is further configured to: when the amount of memory access requests received by the target memory controller is greater than a preset threshold, broadcast a response to the M subsystems A pressure indication, where the back pressure indication is used to instruct one or more of the M subsystems to delay, reduce, or stop sending memory access requests.
  • the relevant subsystem when the number of memory access requests received by a certain memory controller is too large, it can instruct the relevant subsystem to reduce, or delay or even stop the currently sent memory access requests, and the relevant subsystem receives After the above instructions, you can adjust the sending of memory access requests according to your own situation, for example, suspend sending memory access requests to the SoC bus, or stop sending memory access requests to the SoC bus, and so on.
  • the AI system in the embodiment of the present application can further include : Functions such as identifying computing tasks, assigning and carrying corresponding QoS IDs for each computing task.
  • Functions such as identifying computing tasks, assigning and carrying corresponding QoS IDs for each computing task.
  • the following describes in detail how the AI system 01 or 02 allocates and carries corresponding QoS IDs for each computing task before distributing the computing tasks to be executed to the target Master in combination with some embodiments provided by this application.
  • the AI system in the embodiment of the present application further includes a host; the host is configured to:
  • receiving tasks to be executed splitting the tasks to be executed into one or more computing tasks to be executed;
  • the service flow type is identified, and the preset service flow label table includes the mapping relationship between the service flow type and the QoS ID of the predefined computing task; according to the identification result, the one or more to-be-executed Computing tasks carry corresponding QoS IDs respectively.
  • the host computer 10 needs to split a complete task to be executed into computing tasks that can be understood by each hardware (Master), for example, split an entire AI task into matrix computing tasks to be executed, scalar Computing tasks, vector computing tasks, and more. Then, the host 10 needs to mark all the divided computing tasks with memory access QoS tags, that is, to match the appropriate QoS ID.
  • the identification module in the host system can be used to assign each AI Computing tasks on the system are assigned appropriate QoS IDs.
  • the basis for the host 10 to assign the QoS ID to the computing task may be based on the type of service flow to which the computing task belongs (for the classification of the service flow in the embodiment of the present application, please refer to the corresponding embodiment in Table 2 Relevant descriptions will not be repeated here), that is, computing tasks in the same service flow carry the same QoS ID, while computing tasks in different service flows have different QoS IDs.
  • the system scheduler 200 judges the task type rather than the QoS ID carried in the computing task, that is, specifically by identifying the task descriptor carried in the computing task, such as what to do According to the task descriptor, the system scheduler 200 can select a suitable subsystem and a suitable Master for each computing task according to the preset scheduling principle (for example, when there are multiple When the Master is optional, you can choose a relatively idle Master).
  • the host 10 assigns a QoS ID to a computing task, it is based on the type of service flow to which the computing task belongs; while the system scheduler 200 assigns computing tasks to each subsystem based on the task description carried by the computing task. character, that is, the task type.
  • the AI system can further include a host that uniformly receives various computing tasks issued by users.
  • the host can By identifying and marking the types of business flows in the AI network model, the calculation tasks under different business flows are given different service flow access QoS labels, that is, QoS IDs, so that the subsequent entire AI system can use these QoS
  • the ID performs reasonable and matching memory access QoS control on the computing tasks carrying the QoS ID, so as to finally realize the load balancing of the memory access of the entire AI system and improve the comprehensive execution performance and efficiency of the entire AI system.
  • the AI SoC also includes a system scheduler; the host is further configured to: send one or more computing tasks carrying corresponding QoS IDs to the system scheduler device.
  • the host in the AI system identifies the service flow and carries the QoS ID, it can send these calculation tasks carrying the QoS ID to the system scheduler on the AI SoC for subsequent allocation. That is to say, after the host splits, identifies, and tags the tasks to be executed, it will send the processed computing tasks to the system scheduler, so that the system scheduler can subsequently label these tasks (that is, carry matching QoS ID) computing tasks are scheduled and assigned.
  • priority distinction may be made between competing service flows.
  • different AI network models may correspond to different types of business flows, so the identification process is also different.
  • different AI network models may have different business flows running concurrently. Therefore, for different AI network models, they can have their own dedicated business flow classification, and they can be classified according to business flow and run concurrently.
  • a calculation task (subtask) in this application can be a piece of code, a thread, a thread has many steps, and each step may schedule different functions, for example, some calculation tasks are to normalize pictures, some Computing tasks do matrix operations, and some computing tasks do addition and subtraction. These calculations may use GPUs, NPUs, and communication modules respectively.
  • the above computing tasks can all belong to the same service flow, have the same QoS ID, and have the same memory access QoS priority.
  • FIG. 2A is a schematic diagram of the relationship between business flows, graph nodes, and computing tasks provided by the embodiment of the present application.
  • an execution task may involve multiple business flows, and a The business flow may involve multiple graph nodes in the graph running phase, and a graph node can contain multiple computing tasks, and each computing task is composed of multiple operators.
  • Each computing task is finally distributed to each Master in the same or different subsystems for execution according to its task type (that is, corresponding to the task type described in the task descriptor in this application).
  • the service flow label table contains the access data flow to the storage device issued during the AI model training or reasoning process.
  • Table 2 is a possible traffic classification method provided by the embodiment of the present application for the currently known AI network model and the matching relationship between the corresponding QoS ID and QoS priority.
  • the type of service flow and the matching relationship with QoS ID and QoS priority can be adapted and adjusted accordingly, which will not be listed here. The details are shown in Table 2 below:
  • FIG. 2B is a schematic diagram of a service flow flow direction provided by the embodiment of the present application, as shown in FIG. 2B.
  • H2D traffic refers to the business flow data transmitted between the Host and the Device (accelerator card).
  • D2D traffic refers to the business flow data transmitted between different Devices (accelerator cards), including the internal SoC on the same Device.
  • the accelerator accesses the business flow data in the local memory.
  • These business flow data in different directions, through complex on-chip buses, or through inter-communication technologies, such as RDMA, PCIE, or HiSilicon’s self-developed HCCS bus, finally reach the memory controllers of each SoC, and become memory controllers again and again.
  • Read/write access request transactions for cells DDR/HBM).
  • control traffic and the intra-layer model are parallel Feature map communication data flow, data parallel parameter prefetching traffic, feature map sharing traffic, feature map prefetching traffic, embedding (Embedding) read and write traffic, data parallel parameter global reduction operation (All Reduce) traffic, AI CORE computing traffic, CMO operation traffic, general-purpose CPU computing traffic, and image and video accelerator sample traffic correspond to different QoS IDs, namely 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11, and the above-mentioned different QoS IDs
  • the corresponding QoS priorities between them can be the same or different, depending on whether there is memory access competition between these flows.
  • the QoS ID corresponding to the control traffic is 1, and its corresponding QoS priority is 1; ;
  • the QoS ID corresponding to data parallel parameter prefetching traffic is 3, the QoS ID corresponding to feature map sharing traffic is 4, the QoS ID corresponding to feature map prefetching traffic is 5, and the QoS ID corresponding to Embedding read and write traffic is 6, but the above
  • the QoS priorities corresponding to the different service flows of these corresponding QoS IDs are all 3 etc.
  • the corresponding QoS IDs of different service flows are different, but different QoS IDs can correspond to the same QoS priority, or they can correspond to different QoS priorities, depending on whether there is memory access competition between these different service flows , if there is competition, different service flows can correspond to different QoS priorities, so as to provide different QoS services for computing tasks under different service flows to avoid excessive competition. If there is no competition, different service flows can also Corresponding to the same QoS priority.
  • FIG. 2C is a schematic diagram of the relationship between the service flow type and the memory access bandwidth involved in the operation of a resnet50 network provided by the embodiment of the present application.
  • the resnet50 network is running Various business flows in the process, including forward calculation FP, backward calculation BP, and gradient aggregation parameters, parameter update and application business flow. It can be seen from Figure 2C that after the concurrent superposition of the stage 1 stage 1671GB of gradient aggregation traffic and the backward calculation traffic (672GB), it exceeds the total bandwidth of 2000GB (2TGB) that HBM can provide.
  • the embodiment of this application considers the priority of memory access bandwidth for gradient aggregation level to reduce the tailing time of the AI calculation process and ensure the shortest delay in each round of iteration.
  • the memory access behavior and priority of these service flows are not effectively managed and controlled, it is inevitable that various service flows will compete for precious memory bandwidth, and the resulting performance jitter is difficult to control and optimize.
  • the embodiment of the present application can classify and identify the service flow of the AI network model, and then give different service flow labels (that is, QoS ID) to different service flow types.
  • the hardware based on the AI SoC in the AI system
  • the hardware According to the user's actual AI network business traffic model, it provides customers with a simple technical means to adjust a set of QoS configuration parameters in line with the customer's actual AI network model for different business traffic, so as to ensure that the user's AI platform Get the best model training performance on the Internet, helping customers release the maximum computing power of the hardware.
  • the embodiment of this application also provides a framework of a software stack running on the above-mentioned hardware architecture (that is, the Davinci software stack), which can be used to specifically implement the The corresponding functions of the AI system.
  • the Davinci software stack is only used to implement one of the possible AI platforms or software architectures (software stacks) of the AI system in this application, and is not intended to limit the AI system in any embodiment of this application.
  • Applicable AI platform or software stack Applicable AI platform or software stack.
  • the embodiment of the present application will give an exemplary description of each software module in the software stack framework, as well as the function of each software module or the software process involved therein.
  • Fig. 3A is the frame schematic diagram of a kind of Davinci software stack that the embodiment of the present application provides, in the framework of the Davinci software stack shown in this Fig. 3A, wherein, the framework of this Davinci software stack is mainly divided into HOST side (that is, corresponding to the host 10 side in this application) and DEVICE side (that is, corresponding to the AI SoC 20 side in this application), and describe the software modules and their corresponding functions involved in the HOST side and DEVICE side respectively .
  • the software modules related to the QoS memory access control function in the AI system described in this application may specifically include the following:
  • GE graph generation engine
  • These graph generation engine (GE) modules can be based on the context information of the model graph, the labels in the training script, and the type of operator used by each subgraph, and can identify differences in the training process such as forward calculation, backward calculation, and collective communication.
  • business flow, and different business flows can be marked with pre-defined labels in the framework, and then passed to GE/HCCL.
  • GE/FE/HCCL calls the API interface provided by the QoS Manager module according to the service flow label delivered by the graph switching and model distribution subsystem (Model Distribution Subsystem, MDS), to obtain the QoS ID and QoS corresponding to the service flow label, and then This information is passed to RUNTIME to assemble the task queue descriptor (Send Queue Element, SQE) of the task (task), and finally RUNTIME sends the SQE of these tasks to the task queue (RunTime Send Queue, RTSQ) at runtime.
  • MDS Model Distribution Subsystem
  • libQoSManager provides an application interface for QoS resources of QoS configuration items for different data flows of services such as GE/HCCL/DVPP based on the global QoS planning table; GE/HCCL provides service flow labels, and calls this interface to obtain the global planning table The QoS ID and QoS information corresponding to the label in the
  • QoSAutoAdjustTools is a command-line tool running on the Host side. Its main functions include:
  • (1) Query the QoS configuration information of all Devices managed by the Host server, or query the QoS configuration information of the specified Device, which is consistent with the QoS register of MATA. Display the QoS priority corresponding to each paritd in a list, bandwidth high and low waterlines, whether the hardware limit (hardlimit) is enabled, the service flow name corresponding to the QoS ID, the label of the service flow and other information. The information can be displayed on the command line or saved to a specified file.
  • a thread needs to be started to periodically issue commands to the QoS Monitor driver on the device side to obtain real-time data. After obtaining the data, it can be saved to a specified file or displayed on the command line.
  • the implementation of these functions of the tool depends on the new QoS API interface provided by the Device State Manage Interface (DSMI) on the Host side. Through this interface and the forwarding of the DSMI driver framework of the Device, finally the QoS interface on the DEVICE side The driver implements the corresponding function.
  • DSMI Device State Manage Interface
  • QoS_DSMI Quality of Service Device Status Management Interface
  • DSMI is a common application programming interface (Application Programming Interface, API) interface for D series chips.
  • the bottom layer communicates with the DSMI driver framework on the Device side through the communication channel (Host Device Communication, HDC) between the host and the device, and the DSMI driver framework calls
  • the pre-registered callback function of the QoS driver on the device side implements related configuration delivery and status query functions. Using this mechanism can greatly simplify the management of QoS-related functions on the Device side. Since QoS-related configuration delivery and status query are new functions, new interfaces and function support need to be added to the DSMI module.
  • the DSMI driver framework is a set of general-purpose driver modules running in the kernel mode. This module provides an easy-to-extend implementation of the DSMI command forwarding framework, and provides a command registration interface in the kernel mode, which is convenient for new driver modules to implement DSMI interfaces and QoS
  • the driver only needs to register the processing functions of the new commands required by QoS to the framework. After the framework receives QoS-related configuration and query commands from the HOST side, it will automatically forward these commands to the callback processing functions registered in advance by the QoS module.
  • the QoS Driver module is deployed in the kernel-mode QoS Driver module on the Device side. This module establishes a communication channel with the QoS Host Driver on the Host side through the DSMI kernel-mode interface to complete the main QoS management functions. It is mainly composed of the following four sub-modules:
  • the DSMI_HOOK module is responsible for implementing various QoS configuration and query commands on the host side, and registering the implementation interfaces of these commands into the DSMI driver framework.
  • the DSMI driver framework receives QoS-related commands, it automatically calls these callback functions registered in advance by QoS to complete the query and configuration of QoS-related commands.
  • the IOCTL module is responsible for encapsulating the QoS driver into a character platform device driver, and providing an IOCTL interface to the process call of the virtual machine user state on the DEVICE side in the virtualization scenario.
  • the QoS ID of the process needs to be configured.
  • the QoS Config module is mainly responsible for directly configuring the QoS corresponding to each QoS ID applied by the Host side to the MATA register. It is also responsible for processing the QoS query command of the QoS tools tool on the Host side, and returns the QoS configuration of all QoS IDs on the Device to the QoS tools tool of the Host through the DSMI interface to display to the user.
  • This module mainly implements two functions:
  • FIG. 3B is a schematic diagram of the interaction process between various software modules in a Davinci software stack provided by the embodiment of the present application, which may specifically include the following process:
  • a character device will be created to provide an IOCTL interface to the user-mode Device Management Protocol (DMP) program or virtual machine process, so as to control the QoS driver to execute QoS configuration and query commands; note In the virtualization scenario, it is necessary to support multiple processes to open the QoS device driver at the same time, and there may be concurrent IOCTL commands;
  • DMP Device Management Protocol
  • the QoS driver module registers QoS-related command processing hook functions with the DSMI driver framework
  • the libQoSManger module provides an initialization interface, which is called by the NPUTOOL tool to perform QoS configuration initialization;
  • the libQoSManger module parses the configuration values of each QoS ID in the QoS global configuration table and loads them into the memory;
  • the QoS configuration of each QoS ID is packaged into a command message and sent to the device side through the DSMI interface;
  • the DSMI module implements a transfer interface for QoS configuration messages.
  • the bottom layer is implemented based on HDC communication. After HDC, the message is transmitted to the DSMI driver framework on the DEVICE side;
  • the DSMI driver framework on the DEVICE side parses the message, recognizes that this is a QoS-related command, and calls the callback hook function provided by the QoS device driver;
  • the QoSHook module interprets the QoS ID in the configuration message, and calls the interface of MataConfig one by one to configure the bandwidth and QoS value of each QOS ID into the hardware register of MATA;
  • PCIE supports 48 independent channels, and each channel can be configured with a QOS ID and QoS. 1981
  • the QoS ID and QoS of the PCIE channels used on the data plane and the management plane need to be configured separately, and multiple PCIE channels on the data plane use the same QoS ID and QoS configuration.
  • the MDS module in GE is responsible for the identification of different service flows, and the identified service flow labels (that is, QoS ID), as well as the information on which DEVICE and which DIE the service flow will run on, are in the process of loading the map A GE execution frame is passed in.
  • GE execution framework applies for different QoS IDs for different service flows (collective communication, AI computing, DVPP, etc.) according to the service flow label and DEVICE ID/DIE ID information delivered by MDS, and saves them in its internal context middle;
  • GE loads the operator kernel through the interface provided by Runtime, and carries these QoS IDs and QoS information in the task. After receiving these tasks, Runtime, based on the operator type, QoS ID and QoS priority value, the SQE format of the hardware and other task information, call the interface of related operators to construct SQE.
  • libQoSManger parses the standard memory resource partition and monitoring (Memory System Resource Partitioning and Monitoring, MPAM) configuration in each destination DEVICE in the QoS global configuration table in the UninitQoSLib interface, and packs it into a message to clear the QoS configuration to DSMI in the interface;
  • MPAM Memory System Resource Partitioning and Monitoring
  • DSMI delivers the message to the IOCTL submodule in the QoSDriver on the DEVICE side by means of its general mechanism
  • the IOCTL submodule parses the message, deletes each QoS ID from Monitor, and clears the statistical data of this QoS ID in the memory;
  • the IOCTL submodule parses the message, and each QoS ID is cleared from the configuration register of MATA.
  • this application also provides a method for identifying service flow types based on the above-mentioned service flow classification method, which can cope with various AI network models emerging in an endless stream, and the industry Different framework platforms (such as TensorFlow framework, Pytorch framework, MindSpore framework, etc.). Accurate and reliable identification of various service flow types in the AI network model is the basis for configuring appropriate QoS parameters (that is, QoS ID) for various service flows in the subsequent stage.
  • the embodiment of the present application takes the TensorFlow framework as an example, and describes the identification methods of various business flow types of the AI network model in combination with the above-mentioned Davinci computing platform software architecture.
  • the business flow type identification in the embodiment of this application refers to marking and classifying the data access types that will be issued in the graph execution stage on each computing or communication node on the graph in the graph compilation/graph optimization stage of the AI network model.
  • the embodiment of the present application adopts the method of separating flags in the compilation stage and the running stage, as shown in Figure 4A, which is a schematic diagram of a graph compilation stage and a graph running stage provided by the embodiment of the present application, in the graph generation of the AI network,
  • Figure 4A is a schematic diagram of a graph compilation stage and a graph running stage provided by the embodiment of the present application, in the graph generation of the AI network.
  • abstract service flow type labels are used to classify and label different service flows. Before the graph is loaded to the device for execution, the abstract service flow labels are transformed into physical hardware identifiable by way of table lookup. The QoS ID.
  • the task is composed of graph node A, graph node B, graph node C, graph node D, and graph node E, and each graph node can carry the corresponding abstract
  • the service flow type label i.e. QoS Lable
  • the QoS value can be the QoS priority of the corresponding service flow
  • the service flow type label i.e. QoS Lable
  • the QoS ID is replaced by the QoS ID that can be recognized by the AI system.
  • model compilation is often completed on the user's general-purpose CPU system (such as the host in this application), while model execution is completed on the dedicated AI computing SoC (eg AI SoC 20 in this application).
  • dedicated AI computing SoC eg AI SoC 20 in this application.
  • users can present easy-to-understand service flow categories without perceiving hardware QoS IDs related to physical implementations, and the platform framework does not need to expose the underlying physical implementations to ordinary users. Therefore, it can improve the usability and system security at the same time, and facilitate the subsequent evolution and changes of the platform framework without users needing to perceive it.
  • the modules involved are mainly It can include TensorFlow framework, graph generation engine GE inside Davinci platform, graph fusion engine (FE), graph switching and model distribution subsystem (Model Distribution Subsystem, MDS).
  • the TensorFlow framework provides a function called user-defined scope attribute when using the python scripting language to build an AI network model. Nodes automatically add this property value.
  • Figure 4B is a script diagram for building an AI model of resnet50 provided by the embodiment of the present application.
  • the calculation process of line 106 in the script uses the QoSLable specified by line 105 as 1, and line 119 goes to
  • the QoS Lable used in line 124 is 2 specified by line 118.
  • These QoS Lable values are pre-defined enumeration values.
  • the enumeration values of various service flow types are defined as follows:
  • QoSServiceLableType The corresponding value of QoSServiceLableType is defined by QoSManager:
  • the Davinci platform framework will call the internal lookup function to convert the enumeration value into a QoS ID that the hardware can recognize and process. After that, all memory access requests will carry the QoS ID, so that the bottom-level
  • the memory access controller (that is, the memory controller in this application) determines the final execution strategy of each memory access request according to the QoS policy configured by the software and other statistical information.
  • the Davinci platform provides users with the ability to specify the range and QoS marking during the specified calculation process by providing an advanced Python API.
  • the framework of the Davinci platform no matter what kind of AI computing framework, after the model is built, it needs to be parsed into a front-end expression by an intermediate parser, and then compiled and optimized to become a final executable on the Davinci platform. specific computing tasks.
  • Figure 4C is a schematic diagram of an executable computing task after graph compilation and optimization provided by the embodiment of the present application.
  • a deep learning development framework such as TensorFlow/Pytorch/ME ( Corresponding to 1 in Figure 4C, and then use the graph generation engine in the AI platform such as the GE module to convert the constructed AI network model into a general AI graph node (corresponding to 4 in Figure 4C), and the conversion process may need to go through
  • Compute Engine Plugins (corresponding to 2 or 3 in Figure 4C), such as neural network computing architecture CANN, Huawei collective communication library HCCL, preprocessing module Process, etc.
  • Compute Engine Plugins such as neural network computing architecture CANN, Huawei collective communication library HCCL, preprocessing module Process, etc.
  • the internal modules in the Daivinci platform will merge, split or add communication nodes and other operations on the nodes in the figure, such as Cache refresh operation, etc. These nodes are generated by the operations performed by the Davinci platform in the background.
  • the platform knows the types of various operations.
  • the platform will inherit the QoS label of the parent node of the newly added node for the newly added node. For specific operations, such as Cache operation, the model cross
  • the platform will also automatically insert predefined QoS labels for the model-parallel data communication service flow automatically generated by cross-chip DIE deployment.
  • the standard memory resource partitioning and monitoring can be provided by the ARM CPU, and the standard API function interface provided by the linux operating system can be used to configure the CPU and The QoS ID and QoS information of its process.
  • the AI system in the embodiment of the present application may further include: the function of continuously updating and optimizing the memory access parameters matched by the QoS priority of each different QoS ID.
  • the host or the target Master is further configured to: pre-configure a corresponding second QoS priority for the QoS ID in the computing task, and the second QoS priority It is the initial priority corresponding to the QoS ID.
  • the initial QoS priority matched by each QoS ID can be configured by the host (Host), and can also be configured by a register in the target Master.
  • the host side or within the target Master also configures the initial or source QoS priority (ie, the second QoS priority) for each computing task, that is, configures the matching QoS priority for each QoS ID . So that the relevant modules in the subsequent AI SoC can adjust the follow-up QoS priority or the final QoS priority based on the initial QoS priority.
  • the initial or source QoS priority ie, the second QoS priority
  • the host is further configured to: update and optimize the second QoS priority corresponding to each QoS ID according to the real-time monitoring information of the AI system's memory access performance. That is, the initial priority corresponding to the QoS ID can be continuously tuned and optimized.
  • update and optimize the second QoS priority corresponding to each QoS ID according to the real-time monitoring information of the memory access performance of the AI system, and through an optimization algorithm and an adaptive machine learning algorithm, update and optimize the second QoS priority corresponding to each QoS ID.
  • the host side in the AI system can also update and optimize the initial QoS priority corresponding to each QoS ID in the system according to the real-time monitoring information of memory access performance.
  • automatic QoS optimization is adaptively performed through an optimization algorithm and an adaptive machine learning algorithm.
  • the Davinci AI computing platform provided by the above embodiments can be mainly applied to AI training scenarios and inference scenarios, and provides memory access QoS control capabilities through the platform.
  • One is how to accurately identify and classify service flows.
  • Another difficulty is how to efficiently obtain a set of optimal QoS configuration parameters that meet the needs of practical applications. It enables ordinary users to provide a set of general solutions for various business flows and hardware accelerators in the face of different scales of training networks, personalized machine computing power configurations and memory bandwidth configurations, and AI Server cluster scales. .
  • Fig. 5A is a schematic diagram of a software architecture of QoS automatic optimization provided by the embodiment of the present application.
  • the software architecture mainly includes a system performance real-time monitoring module 501, a system Working environment and supporting parameter input module 502 , QoS optimization algorithm module 503 , QoS configuration interface and driver 504 , optimization algorithm termination instruction module 505 , optimization algorithm output module 506 .
  • FIG. 5B FIG.
  • FIG. 5B is a schematic flowchart of a QoS automatic optimization method provided in the embodiment of the present application. Combining with the relevant modules in FIG. 5A, the method flow in FIG. 5B is described, as shown in FIG. 5A and shown in Figure 5B.
  • the system performance real-time monitoring module 501 which can collect in real time various key performance data during the operation of the AI network on the device.
  • the key performance data may specifically include:
  • the system working environment and supporting parameter input module 502. can provide the current working environment parameters for the algorithm through configuration files.
  • the current working environment parameters can specifically include:
  • the QoS optimization algorithm module 503 is used to perform multiple rounds of iterative detection based on the Bayesian machine learning algorithm, and collect performance feedback data corrected by the system for QoS tuning parameters, and finally output a set of performance feedback data in a given working environment.
  • the optimal QoS working parameters are recorded in the file as the QoS configuration parameters for the subsequent official operation of the system.
  • QoS configuration interface and driver 504 used to provide a user mode API interface to the QoS optimization algorithm to support the algorithm to adjust the bandwidth level and QoS priority configured on the memory access controller corresponding to each QoS ID in real time.
  • the optimization algorithm termination instruction module 505 is used to perform an optimization algorithm termination instruction under the following circumstances, and the specific circumstances may include the following:
  • the jitter of the iteration time of N rounds meets the design requirements: the jitter within the server should not exceed 0.1ms, and the jitter between servers should not exceed 0.5ms;
  • the throughput of the system reaches the set target: for example, resnet50 reaches 9200fps.
  • Optimizing algorithm output module 506 used for outputting this set of configurations into the QoS global planning configuration file after the optimal QoS configuration is reached through continuous iteration, for example, according to the format of the following table to output into the configuration file.
  • step S501 the QoS optimization algorithm module 503 reads the environmental parameters from the system working environment and the supporting parameter input module 502, and executes algorithm initialization; Send the Qos parameters of each QoS ID (as the initial QoS priority corresponding to the QoS ID is also the second QoS priority);
  • Step S503 the system performance real-time monitoring module 501 collects system performance data according to performance collection intervals, configuration parameters, etc.;
  • Step S504 The QoS optimization algorithm module 503 performs noise filtering on the collected performance data, (such as Gaussian filtering, median filtering);
  • Step S505 Judging whether the system performance index meets the optimization stop condition (such as the multi-lun iteration duration mean square error minimum or enter the setting threshold); step S506: if the condition of stopping optimization is reached, then the optimization algorithm output module 506 saves the optimal Qos parameter obtained in the result file, and the optimization algorithm terminates the instruction module 505 Carry out optimization algorithm termination indication;
  • Step S507 If the condition of
  • the AI system in the embodiment of the present application may further include: a function of scheduling and allocating various computing tasks.
  • a function of scheduling and allocating various computing tasks in combination with some embodiments provided by this application, how to schedule each computing task to a suitable Master on a suitable subsystem in the AI system 01 or 02 will be described in detail.
  • the system scheduler is configured to: receive the one or more computing tasks to be executed sent by the host; wherein, each of the computing tasks to be executed also carries There is a task descriptor used to describe the type of computing task; according to the task descriptor carried in each computing task to be executed, a matching subsystem is selected for each computing task to be executed from the M subsystems, and from the Select a matching Master from one or more Masters in the matching subsystem; and schedule each computing task to be executed to the matching Master in the matching subsystem.
  • the system scheduler All computing tasks sent by the Host can be reasonably allocated.
  • the specific allocation principle can be allocated according to the task descriptor carried in the computing task, so as to allocate appropriate computing tasks for each computing task according to the type of task described in the task descriptor.
  • Subsystem and Master to better complete the execution or acceleration of various computing tasks. For example, assign an AI matrix calculation task to an appropriate AI subsystem and assign it to an idle Master on the AI subsystem.
  • the above-mentioned AI system 03 in FIG. 1C can also be applied to bandwidth isolation between virtual machine tenants in a virtualization scenario, that is, the AI system in the embodiment of this application can further include: different virtual machine tenants Functions such as memory bandwidth isolation and bandwidth commitment.
  • different virtual machine tenants Functions such as memory bandwidth isolation and bandwidth commitment.
  • the AI system when the AI system is applied to a virtual scene, the AI system includes multiple virtual machines, where each of the multiple virtual machines corresponds to one or more Process, one process includes one or more computing tasks; the one or more processes run on one or more Masters of at least one of the M subsystems; the system scheduler is also used for: A VM ID is allocated for each virtual machine; wherein, the VM ID of the corresponding virtual machine is shared in the page table of one or more processes corresponding to each virtual machine.
  • a VM ID is assigned to each virtual machine in units of virtual machines, and all processes under the virtual machine are set to correspond to the same VM ID. It is to isolate different virtual machines to ensure the security isolation and non-interference between users corresponding to different virtual machines.
  • one process includes one or more computing tasks; when the system is a virtual scene, the target subsystem also includes a system A memory management unit SMMU; the target Master is also configured to: send the memory access request of the computing task to the SMMU, and update the QoS ID carried in the memory access request of the computing task through the SMMU ;
  • the SMMU is configured to: receive the memory access request of the computing task sent by the target Master; determine the target process to which the computing task belongs according to the virtual address and service set identifier SSID in the memory access request; Determine the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process, and replace the QoS ID carried in the memory access request of the computing task with the VM ID of the target virtual machine.
  • each virtual machine process may include many computing tasks, that is, one process includes multiple computing tasks. You can first determine which process (usually 32) the computing task belongs to, and then check which virtual machine the process corresponds to, and then replace the QoS ID carried in the computing task with the VM ID of the virtual machine. It should be noted that when the AI system is applied to a virtual scene, the QoS ID carried by each computing task is replaced within the subsystem.
  • the AI system when the AI system is in a virtual scene, it is necessary to replace the initial QoS ID allocation and flow process, and uniformly replace the QoS ID allocation according to the virtual machine to which the process belongs, that is, each Master passes
  • the SMMU in the Master replaces the QoS ID carried in the received memory access request with the VM ID of the virtual machine corresponding to the process of the computing task corresponding to the memory access request.
  • the primary purpose of bandwidth security isolation is to meet the basic needs of virtual machine users for data isolation, computing power resource isolation, and mutual non-interference. Furthermore, the problem of memory bandwidth isolation and bandwidth commitment among users of different virtual machines can also be solved.
  • the AI SoC further includes an L2 Cache cache; the L2 cache is configured to: receive memory access requests of each computing task, and The QoS ID accesses the corresponding storage area in the L2 Cache, wherein the memory access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
  • the storage area of the cache that can be accessed by each memory access request can be controlled.
  • the corresponding storage area in the storage area is safely isolated. Since the ID of the virtual machine corresponding to the process under each virtual machine is the VM ID, the VM ID to which it belongs can be carried as the QoS ID in the corresponding memory access request, so that the cache can be isolated based on this In order to achieve the security isolation effect in the virtual machine scenario.
  • the powerful computing power of the Davinci AI platform can be divided into multiple independent computing functional units, usually called computing power and resource-independent virtual processes (VF), which support the virtualization specification (Single Root I /O Virtualization, SR-IOV) and mature software virtualization technologies, such as Qumeu+KVM, can provide customers with virtual machines with AI computing capabilities through cloud services.
  • VF virtualization specification
  • SR-IOV Single Root I /O Virtualization
  • Qumeu+KVM Qumeu+KVM
  • the embodiment of the present application introduces the QoS control technology of memory access into a virtualization scenario, which can effectively solve the problems of memory bandwidth isolation and bandwidth commitment among different virtual machine tenants.
  • the application on the virtual machine is not trusted to carry the QoS ID information in the SQE, but it can carry different QoS information based on the service flow in the SQE, and the QoS ID information carried in the SQE After reaching the device (DEVICE) side, it will be replaced by the QoS ID configured in the SMMU accessed by the virtual machine, that is, the VM ID.
  • Figure 6A is a schematic diagram of the software architecture of an AI system in a virtual scene provided by the embodiment of the present application, in the framework of the software stack shown in Figure 6A, where the framework of the software stack is mainly It is divided into the HOST side (that is, the host 10 side in this application) and the DEVICE side (that is, the AI SoC20 side in this application), and describes the software modules involved in the HOST side and the DEVICE side and their corresponding functions.
  • the software modules related to the QoS memory access control function of the AI system in the virtual application scenario in this application may specifically include the following:
  • each virtual machine has its own QoS global configuration table, and each virtual machine does not affect each other.
  • This tool is provided to virtual machine users to support users to adjust the QoS priority of different service flows by themselves. After the adjustment, the adjustment result will be saved in the QoS configuration table of the virtual machine, and libQoSManager will query the QoS priority in the table. QoS is returned to GE/HCCL for use.
  • the tool does not support bandwidth and QoS configuration delivery functions based on QoS ID.
  • QoSDriver The main functions of QoSDriver are basically the same as those in the bare metal scenario. However, in the virtualization scenario, the implementation of the IOCTL commands provided is different from that of the bare metal. The main difference lies in the implementation of the QoS ID configuration process. The specific description is as follows:
  • QoSDriver also needs to add the QoS ID of the virtual machine to the Monitor to collect its actual bandwidth usage data
  • SVM0 can be responsible for unified and centralized management of accelerator device drivers used by virtual machines.
  • This module is mainly used to realize the sharing of virtual addresses between the kernel and user mode processes. It is a module implemented by the kernel.
  • the QoS driver calls the interface provided by this module. It will traverse the device driver kernel state device structure of all accelerators used by the virtual machine, and call the QoS ID setting interface provided by SMMU DRV, and configure the QoS ID in the CD table of SMMU for the master that uses the virtual address to access the HBM memory.
  • SMU DRV System memory management unit driver
  • SMMU DRV is driven by the SMMU provided by the kernel, and will provide the QoS ID function configured in the CD entry of the SMMU.
  • each virtual machine has only one unique QoS ID on each SOC. No matter how many processes the virtual machine has, the SSIDs corresponding to all processes will be configured as the same QoS ID in the SMMU CD table. .
  • High Bandwidth Memory Driver HBM DRV
  • HBM DRV can provide an interface for the QoSDriver module. Through this interface, the current effective theoretical total bandwidth of the HBM of the SoC can be obtained.
  • the HBM driver will need to consider different HBM capacity configurations, different HBM channel enablement conditions, different operating frequency conditions, and The PG, FG and other actual conditions obtained after chip screening are comprehensively calculated to calculate an accurate current theoretical bandwidth.
  • This module is the device side device management driver (Devmm).
  • This module can provide a query interface for QoSDriver, which can query the computing power ratio of a given VF.
  • a VM may have multiple VFs.
  • QoSDriver needs to calculate the computing power of all VFs of the VM. The ratio of the power is added to calculate the HBM bandwidth ratio that the VM should allocate, and then configure the bandwidth waterline of the QoS ID corresponding to the VM in MATA according to the total bandwidth returned by HBM.
  • the resource isolation function based on the MPAM mechanism provided by the Linux OS kernel is required, that is, the RESCTRL function.
  • Linux OS can configure QoS IDs for different processes or process groups, and then when the OS scheduler switches to the process for execution, it will configure the MPAM register of the CPU according to the QoS ID assigned by the process, thereby realizing CPU
  • the HBM read and write requests sent on the Internet will carry the QoS ID assigned in advance by each process.
  • the RESCTRL module can provide a QoS ID setting interface for the corresponding process of the virtual machine in DEVICE. Before calling this interface, the process of the virtual machine on the DEVICE side needs to call the interface provided by the QoSDriver module to obtain the QoS ID of the process.
  • FIG. 6B is a schematic diagram of the interaction flow between various software modules in an AI system in a virtual application scenario provided by the embodiment of the present application. Based on the software architecture in FIG. 6A above, the above software architecture The specific execution process of each module in the virtualization scenario is described, which may specifically include the following process:
  • QoSDriver After the quality of service driver (QoSDriver) is started, it calls the HBM driver to provide an interface to obtain the current total bandwidth of the SoC.
  • the IOCTL module in QoSDriver searches for the devmng driver according to the VM ID and VFID of the virtual machine, and obtains the computing power ratio configuration of the virtual machine; if the virtual machine has multiple VFs, it is necessary to calculate all the VFs of the virtual machine The power ratio is added, and then a bandwidth allocated to the virtual machine is calculated based on the total computing power ratio of the virtual machine and the total bandwidth obtained from HBM.
  • the virtual machine process also needs to call the RESCTRL interface provided by the operating system, and set the QOS ID obtained from the QoSDriver to the operating system through this interface to inform the OS of the QoS ID of the process; the scheduler of the OS is scheduled to execute the task , the QoS ID of the process will be configured in the MPMAM register of the CPU, so that the LOAD/STORE operation of the AI CPU process on the HMB will carry the correct QoS ID.
  • the QoSDriver After the QoSDriver obtains the SSID and QoS ID, it needs to call the interface of SVM0 to find out the instances of the kernel state driver device data structure struct device of all master devices used by the virtual machine.
  • QoSDriver also needs to add the QoS ID of the virtual machine to the Monitor to collect its actual bandwidth usage data.
  • the SMMU driver automatically senses the exit of the virtual machine process, and automatically completes the related cleaning and release work.
  • FIG. 7 is a schematic flow diagram of a memory access control method provided by an embodiment of the present application.
  • the memory access control method is applied to an artificial intelligence AI system, and the AI system includes an AI system-on-chip SoC, and the AI SoC includes M subsystems and N memory controllers, the M subsystems and the N memory controllers are interconnected through the SoC bus; the M subsystems include a target subsystem, and the target subsystem is one of the M subsystems Any one of the subsystems, the target subsystem includes S Masters, M, N, and S are all integers greater than or equal to 1; and the memory access control method is applicable to any AI in the above-mentioned Figure 1A- Figure 1C systems and devices (such as mobile phones, computers, servers, etc.) that contain the AI system.
  • the method may include the following steps S701-step S702, wherein,
  • Step S701 through the target processing node among the S processing nodes, receive the computing task to be executed, and the computing task carries a quality of service identifier QoS ID; generate a memory access request for the computing task, and the memory accessing The request carries the QoS ID; and the memory access request is sent to a target memory controller among the N memory controllers.
  • QoS ID quality of service identifier
  • Step S702 Receive the memory access request through the target memory controller, and determine a first quality of service (QoS) priority corresponding to the QoS ID; based on the first QoS priority, process the memory access request Memory access QoS control.
  • QoS quality of service
  • the calculation task also carries a second QoS priority corresponding to the QoS ID, and the second QoS priority is the initial QoS priority corresponding to the QoS ID in the calculation task. QoS priority.
  • the target subsystem further includes a sub-scheduler; the target master sends the memory access request to the target memory controller among the N memory controllers , including: sending the memory access request to the sub-scheduler through the target Master, and scheduling the memory access request to the target memory controller among the N memory controllers through the sub-scheduler; the method also includes : through the sub-scheduler, receive the memory access requests sent by the S Masters in the target subsystem respectively; according to the second QoS priority, scheduling the memory access requests sent by the S Masters respectively to the SoC bus, the second QoS priority is the initial QoS priority of the corresponding QoS ID; wherein, the second QoS priority is used In order to indicate the priority of the corresponding memory access request to be dispatched to the SoC bus.
  • the sub-scheduler sends the S masters respectively according to the second QoS priority corresponding to the QoS ID carried in the memory access requests sent by the S masters respectively.
  • Scheduling the memory access request to the SoC bus including: establishing task queues for the S Masters through the sub-scheduler, each of the task queues includes a memory access request sent by the corresponding Master; wherein, the Target Master corresponds to the target task queue; when a target memory access request is currently inserted in the target task queue, the second QoS priorities corresponding to the QoS IDs carried in all memory access requests in the target task queue are raised to The third QoS priority, the target memory access request is a memory memory request that the second QoS priority corresponding to the carried QoS ID exceeds the preset priority; according to the memory memory access in the task queue of the S Masters The second QoS priority or the third QoS priority corresponding to the QoS ID carried in the request sends the memory access requests in the task queues of the
  • the method further includes: receiving one or more memory access requests in the target task queue sent by the sub-scheduler through the SoC bus, and the one or more The memory access request includes the memory access request; the third QoS priority corresponding to the QoS ID carried in the one or more memory access requests in the target task queue is restored to the corresponding second QoS priority .
  • the method further includes: through the SoC bus, based on the restored second QoS priority of one or more memory access requests in the target task queue, assigning the One or more memory access requests in the target task queue are respectively dispatched to corresponding memory controllers among the N memory controllers.
  • the AI SoC further includes an advanced memory access agent MATA: through the SoC bus, one or more memory access requests in the target task queue are dispatched to the N On the corresponding memory controller in a memory controller, including: sending one or more memory access requests in the target task queue to the MATA through the SoC bus, and sending the one or more memory access requests through the MATA or multiple memory access requests are respectively dispatched to corresponding memory controllers among the N memory controllers.
  • MATA advanced memory access agent
  • the AI SoC also includes an advanced memory access agent MATA: the SoC bus, specifically used to: send the memory access requests sent by the S Masters to the The MATA, and the memory access requests sent by the S Masters are respectively dispatched to the corresponding memory controllers in the N memory controllers through the MATA, and the memory access requests sent by the S Masters include The memory access request.
  • MATA the SoC bus
  • the method further includes: receiving the memory access request through the MATA, and determining the second QoS priority corresponding to the QoS ID carried in the memory access request; Based on the second QoS priority corresponding to the QoS ID, combined with historical memory bandwidth statistics corresponding to the QoS ID, and memory access policy control parameters corresponding to the QoS ID, determine the corresponding QoS ID
  • the first QoS priority, the memory access policy control parameters include one or more of the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through.
  • the method further includes: through the MATA, presetting the memory access policy control parameters corresponding to each QoS ID, counting and recording the historical memory bandwidth corresponding to each QoS ID; according to the AI system The real-time monitoring information of memory access performance, update and optimize the said memory access policy control parameters corresponding to each QoS ID.
  • the method further includes: using the MATA, carrying the first QoS priority in the memory access request, and sending the access request based on the first QoS priority The memory request is dispatched to the target memory controller.
  • the AI SoC further includes MATA; the method further includes: using the MATA, carrying the determined first QoS priority in the memory access request , and dispatch the memory access request to the target memory controller based on the first QoS priority.
  • the performing memory access QoS control on the memory access request based on the first QoS priority through the target memory controller includes: through the target memory controller, Based on the first QoS priority corresponding to the QoS ID, and in combination with the memory access service status of the target memory controller, perform memory access QoS control on the memory access request, and the memory access service situation includes memory access timing requirements, or memory bandwidth bus utilization.
  • the method further includes: when the amount of memory access requests received by the target memory controller is greater than a preset threshold, sending The system broadcasts a back pressure indication, where the back pressure indication is used to instruct one or more of the M subsystems to delay, or reduce, or stop sending memory access requests.
  • the AI system further includes a host; the method further includes: using the host, receiving a task to be executed, and splitting the task to be executed into one or more computations to be executed Task: Identify the business flow type of the one or more computing tasks to be executed after splitting according to a preset business flow label table, the preset business flow label table includes predefined computing tasks The mapping relationship between the service flow type and the QoS ID; according to the identification result, carry the corresponding QoS ID for the one or more computing tasks to be executed.
  • the AI SoC further includes a system scheduler; the method further includes: sending one or more computing tasks carrying corresponding QoS IDs to the system scheduler through the host device.
  • the method further includes: pre-configuring the corresponding second QoS priority for the QoS ID in the computing task through the host or through the target Maser, the second The QoS priority is the initial priority corresponding to the QoS ID.
  • the method further includes: updating and optimizing the second QoS priority corresponding to each QoS ID through the host according to the real-time monitoring information of the memory access performance of the AI system.
  • the method further includes: receiving, through the system scheduler, the one or more computing tasks to be executed sent by the host; wherein each of the computing tasks to be executed The task also carries a task descriptor used to describe the type of computing task; according to the task descriptor carried in each computing task to be executed, select a matching subsystem for each computing task to be executed from the M subsystems , and selecting a matching Master from one or more Masters in the matching subsystem; scheduling each computing task to be executed to the matching Master in the matching subsystem.
  • the AI system when the AI system is applied to a virtual scene, the AI system includes multiple virtual machines, where each of the multiple virtual machines corresponds to one or more process, one process includes one or more computing tasks; the one or more processes run on one or more Masters of at least one of the M subsystems; the method also includes: through the system The scheduler is configured to assign a VM ID to each virtual machine; wherein, the VM ID of the corresponding virtual machine is shared in the page table of one or more processes corresponding to each virtual machine.
  • the target subsystem further includes a system memory management unit SMMU; the method further includes: using the target Master to access the computing task Send the storage request to the SMMU, and update the QoS ID carried in the storage access request of the computing task through the SMMU; receive the storage access request of the computing task sent by the target Master through the SMMU ; According to the virtual address in the memory access request and the service set identifier SSID, determine the target process to which the computing task belongs; determine the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process, and set The QoS ID carried in the memory access request of the computing task is replaced with the VM ID of the target virtual machine.
  • SMMU system memory management unit
  • the AI SoC also includes an L2 Cache cache; the method further includes: receiving memory access requests of each computing task through the L2 cache, and The QoS ID carried in the request accesses the corresponding storage area in the L2 Cache, wherein the memory access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
  • the embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium can store a program, and when the program is executed by the AI system, it includes some or all of the steps described in any one of the above method embodiments .
  • the embodiment of the present application also provides a computer program, the computer program includes instructions, and when the computer program is executed by the AI system, the AI system can execute some or all steps of any memory access control method.
  • the disclosed device can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the above units is only a logical function division.
  • there may be other division methods for example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the above integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, server, or network device, etc., specifically, a processor in the computer device) execute all or part of the steps of the above-mentioned methods in various embodiments of the present application.
  • the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disc, read-only memory (Read-Only Memory, abbreviated: ROM) or random access memory (Random Access Memory, abbreviated: RAM) and the like.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bus Control (AREA)

Abstract

本申请实施例公开了一种AI系统、内存访问控制方法及相关设备,该AI系统可包括AI SoC,该AI SoC包括M个子系统和N个内存控制器;该M个子系统包括目标子系统,该目标子系统包括S个处理节点,该S个处理节点中的目标处理节点用于接收待执行的计算任务,该计算任务中携带有QoS ID;生成所述计算任务的访存请求,所述访存请求中携带有所述QoS ID;将所述访存请求,发送至所述N个内存控制器中的目标内存控制器;该目标内存控制器用于接收所述访存请求,确定与所述QoS ID对应的第一QoS优先级;基于所述第一QoS优先级,对所述访存请求进行访存QoS控制。采用本申请,可以提升AI系统的计算性能。

Description

一种AI系统、内存访问控制方法及相关设备 技术领域
本发明涉及电子设备技术领域,尤其涉及一种AI系统、内存访问控制方法及相关设备。
背景技术
近年来,在信息通信技术(Information and Communication Technology,ICT)计算领域,人工智能(Artificial Intelligence,AI)成为最火热技术之一,全球各大顶级软硬件厂商不断推出日益强大的软硬件产品,推动着人工智能软硬件技术的飞速进步。一方面在硬件算力上,用于AI计算的单芯片的算力从10万亿次操作/每秒(Tera Operations Per Second,TOPS)到近1000 TOPS,并且通过各种通讯互联技术,可以将这些算力强大的单个片上系统(System on Chip,SoC)组成算力更加庞大的,由几千颗甚至几十万颗这样的SoC为核心的AI计算集群服务器,以满足各种人工智能网络的训练或推理计算任务的高速度和高精度需求。
另一方面,随着AI和互联网技术在各行各业的日益深入和广泛的应用推广,再加上各行各业在生产生活中形成的大量数据,在诸如自动驾驶、自然语言处理、机器学习等通过人工智能来解决的问题规模和复杂度也变得更加庞大,由此推动着AI计算网络模型的规模和复杂度也呈现几何级数规模的增加。例如,由OpenAI提出的预训练语言模型(Generative Pre-trained Transformer,GPT),其模型规模发展如下表1:
表1
模型 发布时间 参数量 预训练数据量
GPT 2018年6月 1.17亿 约5GB
GPT-2 2019年2月 15亿 约40GB
GPT-3 2020年5月 1750亿 约45TB
通常,AI计算从应用过程可以分为两大类:训练和推理。要完成这类规模庞大的AI网络模型的训练,必须要使用一套高性能AI计算集群才可能在一个可以接受的时间尺度内达成业务目标。现有技术中,在完成这类计算任务过程中,需要用到在AI计算集群服务器的强大算力,在计算过程中会产生大量的输入输出及中间数据,例如各种样本数据(文本、语音、图像、视频)、神经网络的各种权重/参数、梯度数据、模型训练过程中得到的特征图(feature map)等。这些数据往往都存储在SoC上的高速内存中。例如,在AI计算的SoC上存在大量的并发计算硬件单元,AI芯片在计算过程中需要频繁的访问SoC上的内存数据,比如往内存中暂存数据、或从内存中读取数据,而内存带宽往往是影响AI计算性能的关键瓶颈点。又例如,在训练推理过程,由于需要完成大规模模型计算,往往是AI集群一起分工配合来执行训练任务,在计算过程中,在集群的各个服务节点(server)内部以及server之间,会同时存在并发通讯流(如模型特征图、参数权重)、AI计算流,各种专用硬件加速器(例如,达芬奇视觉预处理器(Davinci Vision Pre-Processor,DVPP)、声音信号处理器(Audio Signal Processing,ASP)、图像信号处理器(Image Signal Processing,ISP)等)数据流,这些高并发性的数据流量如果不加控制,同样也会在访问内存的过程中导致系统性能严重下降。
因此,如何高效合理的利用好SoC上宝贵的内存带宽以提升AI计算性能,是亟待解决的问题之一。
发明内容
本申请实施例提供一种AI系统、内存访问控制方法及相关设备,以提升AI系统的计算性能。
第一方面,本申请实施例提供了一种人工智能AI系统,其特征在于,包括AI片上系统SoC,所述AI SoC包括M个子系统和N个内存控制器,所述M个子系统和所述N个内存控制器通过SoC总线互联;所述M个子系统包括目标子系统,所述目标子系统为所述M个子系统中的任意一个子系统,所述目标子系统包括S个处理节点,M、N、S均为大于或者等于1的整数;其中,所述S个处理节点中的目标处理节点,用于:接收待执行的计算任务,所述计算任务中携带有服务质量标识QoS ID;所述目标处理节点为所述S个处理节点中的任意一个处理节点;所述QoS ID用于指示所述计算任务所属的类别;生成所述计算任务的访存请求,所述访存请求中携带有所述QoS ID;将所述访存请求,发送至所述N个内存控制器中的目标内存控制器;所述目标内存控制器,用于:接收所述访存请求,确定与所述访存请求中携带的所述QoS ID对应的第一服务质量QoS优先级;基于所述第一QoS优先级,对所述访存请求进行访存QoS控制。
本申请实施例,在AI计算领域,引入了一种片上内存访问服务质量(Quality of Service,QoS)控制技术,通过对AI系统中的待分配至AI SoC上的各个计算任务进行QoS标记,且不同类别的计算任务其对应的QoS ID不同(例如,按照计算任务所属的业务流进行类别的划分、或者按照计算任务的不同访存时延需求进行类别的划分等等),以使得后续可依据各个计算任务中携带的QoS ID确定各个计算任务的访存请求所对应的QoS优先级,并最终基于确定的QoS优先级对各个访存请求进行QoS控制,从而实现从计算任务的粒度对AI系统的内存进行访存QoS控制的目的,同时也可实现针对不同类别(如不同业务流下)的计算任务的访存需求,可提供不同的内存访问服务保障的功能,最终在AI系统现有的算力和内存带宽资源的基础上,获得更优的AI计算性能。区别于现有技术中只针对SoC中的处理节点级别进行访存控制(即同一个处理节点上的所有访存请求均按照统一访存服务质量进行控制),因而导致无法满足AI系统中各类不同计算任务(如属于不同业务流的计算任务)的实际访存需求而最终造成AI系统的计算性能差的问题。具体地,本申请实施例是通过在为AI SoC中各个子系统中的处理节点(本申请中为了便于描述,后续相关实施例可将处理节点以Master为例进行描述,后续不再赘述)分配计算任务时,在相应的计算任务中携带服务质量标识QoS ID,也即是该计算任务中所携带的QoS ID用于表示该计算任务所属的类别,且依据该所属的类别最终可以确定与之对应的访存QoS优先级,其依据在于,在AI计算领域,不同类别的计算任务(如不同业务流下的计算任务)对访存的服务质量的需求不一样、且某些类别的计算任务之间存在访存竞争而某些类别的计算任务之间不存在访存竞争,因此,按照所属类别对计算任务的访存请求设置与之匹配的QoS优先级,便可以更好地满足不同类别的计算任务的访存需求(可以理解的是,不同类别的计算任务可对应不同的QoS ID,但不同的QoS ID可能对应相同的QoS优先级也可能对应不同的QoS优先级);进一步地,在每个处理节点执行所接收到的计算任务的过程中,其可以根据每个计算任务所需要访问的内存地址和数据生成该计算任务的访存请求,并在该访存请求中继续携带该计算任务中所携带的QoS ID,也即是,QoS ID随计算任务流转到其对应的访存请求中,以便于后续内存控制器在接收到访存请求时,可以根据其携带的QoS ID对该访存请求进行对应优先级的内存访问控制,例如,QoS ID对应的QoS优先级越高,则内存控制器可以为携带该QoS ID的访存请求提供更好的内存访问服务质量,即针对不同访存优先级需求的计算任务进行不同的访存QoS控制,避免现有技 术中由于不区别对待导致的随机抢占关键的内存带宽资源,而造成的系统性能严重下降。综上,本申请实施例最终实现了从计算任务的粒度来进行内存访问服务质量的控制,解决了AI训练和推理任务中,由于各种不同类别的计算任务(如不同类型的业务流)在并发竞争内存带宽的过程中导致的内存带宽不足的问题,并且,由于本申请实施例是依据访存请求中的QoS ID对应的优先级来提供对应的访存服务,因此还可以优先保证AI训练和推理中对时延要求更高的计算任务,以更大化、更高效地方式利用好内存带宽资源,最终实现了整个AI系统访存的负载均衡,提升了整个AI系统的综合执行性能和效率。
在一种可能的实现方式中,所述计算任务中还携带与所述QoS ID对应的第二QoS优先级,所述第二QoS优先级为所述计算任务中的所述QoS ID对应的初始QoS优先级。
本申请实施例中,AI系统中分配至子系统中各个处理节点(如Master)的计算任务中除了携带该计算任务的QoS ID以外,还可以携带与该计算任务中的QoS ID对应的初始QoS优先级(即第二QoS优先级)。也即是,在本申请实施例中,可以通过在计算任务分配之初就为该计算任务配置好QoS ID和对应的初始QoS优先级,以便于后续可以根据该QoS ID和初始QoS优先级进行后续QoS优先级的调控,并进行相应的内存QoS访问控制。可选的,在本申请实施例中,计算任务的访存请求中携带的QoS ID在从Master到目标内存控制器之间的流转过程中可以一直保持不变,但其对应的QoS优先级可以根据访存请求在调度过程中的不同的需求的和情况进行不同的调整和优化。
在一种可能的实现方式中,所述目标子系统中还包括子调度器;所述目标处理节点,具体用于:将所述访存请求发送至所述子调度器,并通过所述子调度器调度至所述N个内存控制器中的所述目标内存控制器;所述子调度器,用于:接收所述目标子系统中的所述S个处理节点分别发送的访存请求;根据所述S个处理节点分别发送的访存请求中携带的QoS ID所对应的第二QoS优先级,将所述S个处理节点分别发送的访存请求调度至所述SoC总线,所述第二QoS优先级为对应QoS ID的初始QoS优先级;其中,所述第二QoS优先级用于指示对应的所述访存请求被调度至所述SoC总线的优先级。
本申请实施例中,AI系统中的AI SoC中的各个子系统中还包括子调度器,可用于调度该子系统中的所有处理节点(如Master)中正在执行的计算任务的访存请求,这些子系统中的Master所产生的访存请求,经过其内部的子调度器的调度后被发送至SoC总线上进行仲裁、地址解析和路由之后,下发至对应的内存控制器进行内存访问。由于各个Master中执行的计算任务的访存请求携带了对应的计算任务中的QoS ID,因此,在各个子系统内部的子调度器在调度访存请求的过程中,可以将携带了QoS优先级更高的QoS ID的访存请求优先调度至SoC总线中,而将携带的QoS优先级更低的QoS ID的访存请求延后调度至SoC总线中,以保证在将访存请求在下发至SoC总线的过程中,就已经考虑了各个访存请求其所对应的QoS优先级,从而从整个AI系统的源头为各个计算任务提供与其QoS ID所匹配的访存控制服务。
在一种可能的实现方式中,所述子调度器,具体用于:为所述S个处理节点分别建立任务队列,每个所述任务队列中包括对应处理节点发送的访存请求;其中,所述目标处理节点对应目标任务队列;当所述目标任务队列中当前插入目标访存请求时,将所述目标任务队列中所有访存请求中携带的QoS ID对应的所述第二QoS优先级分别提升至第三QoS优先级,所述目标访存请求为所携带的QoS ID对应的所述第二QoS优先级超过预设优先级的访存请求;根据所述S个处理节点的任务队列中的访存请求中携带的QoS ID对应的所述第二QoS优先级或所述第三QoS优先级,将所述S个处理节点的任务队列中的访存请求先后发送至所述SoC总线。
本申请实施例中,在每个子系统的子调度器具体调度访存请求的过程中,通过为每个处理节点(如Master)中的计算任务创建一个任务队列,将每个Master中产生的所有访存请求均放置在一个任务队列中,并根据各个任务队列中的访存请求所携带的QoS ID对应的QoS优先级来将其先后发送至SoC总线中;当某个任务队列中当前出现了QoS优先级较高的访存请求时,为了避免由于任务队列的前端的访存请求的QoS优先级过于低而导致该任务队列中所有访存请求均被迫阻塞(例如队头阻塞),本申请实施例通过Master中的子调度器对该任务队列中的所有访存请求的QoS优先级进行提升(即从第二QoS优先级提升至第三QoS优先级),从而使得该任务队列中任意一个访存请求(尤其是前面提到的QoS优先级较高的访存请求)都不会因为任务队列中前端(或者说是队头)的访存请求的QoS优先级较低而导致该整个任务队列的访存请求被阻塞的情况,从而整体优化了QoS访存等级控制的效率和效果。
在一种可能的实现方式中,所述SoC总线,用于:接收所述子调度器发送的所述目标任务队列中的一个或多个访存请求,所述一个或多个访存请求包括所述访存请求;将所述目标任务队列中的一个或多个访存请求中携带的QoS ID对应的所述第三QoS优先级恢复至对应的所述第二QoS优先级。
本申请实施例中,当从各个处理节点(如Master)的子调度器中将各个任务队列中的访存请求进行了QoS优先级的调整,且按照调整后的QoS优先级去调度访存请求之后,这些访存请求已经被调度至SoC总线中,此时已经通过在子系统中进行QoS优先级调整,消除了各个任务队列中被低QoS优先级的访存请求阻塞的风险,因此,当访存请求从各个任务队列被调度至SoC总线之后,可以恢复到之前的QoS优先级,也即是由第三QoS优先级恢复至对应的第二QoS优先级,以便于按照AI系统初始为各个计算任务分配的QoS ID对应的QoS优先级进行访存的QoS控制。
在一种可能的实现方式中,所述SoC总线,还用于:基于所述目标任务队列中的一个或多个访存请求恢复后的所述第二QoS优先级,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。
本申请实施例中,当SoC总线将从各个子系统中调度出来的访存请求的QoS ID的QoS优先级恢复至初始的第二QoS优先级之后,则可以根据该恢复后的第二QoS优先级来进行访存请求的调度,即按照恢复后的第二QoS优先级将各个访存请求调度至对应的内存控制器上,以使得内存控制器进行后续的访存QoS控制及内存访问。
在一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述SoC总线,具体用于:将所述目标任务队列中的一个或多个访存请求发送至所述MATA,并通过所述MATA将所述一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。可选的,在另一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述SoC总线,具体用于:将所述S个处理节点分别发送的访存请求发送至所述MATA,并通过所述MATA将所述S个处理节点分别发送的访存请求分别调度至所述N个内存控制器中对应的内存控制器上,所述S个处理节点分别发送的访存请求包括所述访存请求。
本申请实施例中,AI SoC中还可以进一步包括用于进行内存访问控制的内存访问代理MATA,在SoC总线将各个子系统中的访存请求调度至对应的内存控制器的过程中,还可以具体通过上述MATA将访存请求调度至对应的内存控制器,也即是通过MATA,可以对各个内存控制器进行统筹地控制和管理,也可以对接收到的各个访存请求进行进一步的调控,例如对各个访存请求中QoS ID对应的第二QoS优先级进行进一步优化等。
在一种可能的实现方式中,所述MATA,用于:接收所述访存请求,确定所述访存请求 中携带的所述QoS ID对应的所述第二QoS优先级;基于所述QoS ID对应的所述第二QoS优先级、结合与所述QoS ID对应的历史内存带宽统计信息、以及与所述QoS ID对应的访存策略控制参数,确定所述QoS ID对应的所述第一QoS优先级,所述访存策略控制参数包括允许访问请求通过的最高带宽、最低带宽和访问优先级中的一种或多种。
本申请实施例中,当MATA接收到由SoC总线调度过来的各个访存请求之后,可以对各个访存请求中携带的初始优先级(即第二QoS优先级)进行进一步的优化调整,具体调整原则可以包括,当访存请求被SoC总线调度至各个内存控制器上之前,可以先通过MATA根据该访存请求中携带的QoS ID所述对应的初始QoS优先级(即第二QoS优先级)并结合该MATA当前所记录保存的与该QoS ID对应的历史内存带宽统计信息和该QoS ID对应的访存策略控制参数,生成该QoS ID最终对应的QoS优先级(即第一QoS优先级),使得目标内存控制器最终可以根据该最终的QoS优先级来对该访存请求进行访存QoS控制。也即是,MATA在进行访存控制时,不仅考虑到了AI系统初始为各个QoS ID配置的QoS优先级,也同时进一步考虑各个QoS ID对应的历史带宽统计信息(如携带某个相同的QoS ID的一类计算任务当前已经获得的内存带宽信息)和该QoS ID对应的访存请求被配置应该获得的访存策略控制参数(如允许访问请求通过的最高带宽、最低带宽和访问优先级等)来综合考量为当前的访存请求提供何种访存QoS控制服务,以最终得到与该访存请求匹配的QoS优先级,从而进行更为精准的访存QoS控制,进一步优化和提升AI系统的计算性能。比如,某个QoS ID对应的访存请求已经占用了大量内存带宽,为了均衡考虑,则可以将该QoS ID的QoS优先级进行降低,以平衡各个QoS ID对应的访存带宽占用量;而若某个QoS ID对应的访存请求当前占用了较少的内存带宽,则可以将该QoS ID对应的QoS优先级进行提升,以弥补该QoS ID对应的访存带宽占用量。
在一种可能的实现方式中,所述MATA,还用于:预先设置各个QoS ID对应的访存策略控制参数,统计并记录各个QoS ID对应的历史内存带宽;根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述访存策略控制参数。
本申请实施例中,MATA还为各个QoS ID配置对应的访存策略控制参数,并且还统计并记录各个QoS ID对应的历史内存带宽,以依据这两种信息来确定是在某个QoS ID对应的初始优先级的基础上进行QoS优先级的提升,还是进行QoS优先级的降低。最终确定出某个QoS ID对应访存请求的终极QoS优先级,从而使得内存控制器可以根据该终极QoS优先级进行具体的访存QoS控制。例如,MATA可设置携带了某个QoS ID的访存请求其所被允许访问请求通过的最高带宽、最低带宽和访问优先级等。进一步地,MATA还可以根据AI系统的访存性能实时监控信息,来对各个QoS ID对应的访存策略控制参数进行更新和优化,例如通过寻优算法和自适应的机器学习算法来对访存策略控制参数进行调整。
在一种可能的实现方式中,所述MATA,还用于:将所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。可选的,MATA除了将第一QoS优先级携带在访存请求中,还可以将QoS ID继续携带在访存请求中发送至内存控制器。可选的,在另一种可能的实现方式中,所述AI SoC还包括MATA;所述MATA,用于:将确定的所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。
本申请实施例中,当MATA确定了访存请求的最终优先级之后,则可以将该最终的优先级(即第一QoS优先级)携带在访存请求中发送至对应的内存控制器,以便于对应的内存控制器可以依据该第一QoS优先级进行访存QoS控制。此外,MATA也可以基于该第一QoS 优先级将该访存请求调度至目标内存控制器。可选的,若MATA除了将第一QoS优先级携带在访存请求中,还将QoS ID继续携带在访存请求中发送至内存控制器时,则内存控制器可以根据第一QoS优先级和QoS ID共同进行访存QoS控制的决策,例如,内存控制器也可以根据QoS ID计算该QoS ID对应的访存请求在其自身上所占用的历史内存带宽,并依据此对访存QoS控制进行进一步的优化。
在一种可能的实现方式中,所述目标内存控制器,具体用于:基于所述QoS ID对应的所述第一QoS优先级,并结合所述目标内存控制器的访存服务情况对所述访存请求进行访存QoS控制,所述访存服务情况包括内存访问时序要求、或内存带宽总线利用率。
本申请实施例中,当访存请求被SoC总线通过MATA调度至各个内存控制器上之后,内存控制器可以根据该访存请求中携带的最终QoS优先级(即第一QoS优先级)并结合该内存控制器当前的服务情况对该访存请求进行访存QoS控制。也即是,内存控制器在进行访存QoS控制时,不仅考虑到了MATA为各个QoS ID最终生成的QoS优先级,也同时进一步考虑各个内存控制器当前的服务情况(例如,该内存控制器上的访问时序要求、内存带宽总线利用率等),以进行更为精准的访存QoS控制,从而进一步优化和提升AI系统的计算性能。可选的,当内存控制器接收到的访存请求中还携带QoS ID时,则该内存控制器可以进一步依据该QoS ID计算该QoS ID对应的访存请求在其自身上所占用的历史内存带宽,并依据此对访存QoS控制进行进一步的优化。
在一种可能的实现方式中,所述访存服务情况包括内存访问时序要求、或内存带宽总线利用率。
本申请实施例中,内存控制器最终需要根据当前的内存服务情况对各个访存请求进行最终的内存访问控制,以使得对每个访存请求的访存QoS控制更为精确合理、避免仅仅只是根据计算任务的QoS优先级来进行访存控制,而可以进一步结合各个内存控制器当前的实际情况如内存访问时序要求、内存带宽总线利用率,来综合考量为当前的访存请求提供何种访存QoS控制服务。
在一种可能的实现方式中,所述目标内存控制器,还用于:当所述目标内存控制器接收到的访存请求的量大于预设阈值时,则向所述M个子系统广播反压指示,所述反压指示用于指示所述M个子系统中的一个或者多个子系统延迟、或者减少、或者停止发送访存请求。
本申请实施例中,当某个内存控制器接收到的访存请求的数量过多时,可以指示相关的子系统减少、或延迟甚至停止当前所发送的访存请求,而相关子系统在接收到上述指示后,可以根据自身的情况进行访存请求发送的调整,例如,暂缓向SoC总线发送访存请求,或者是停止向SoC总线发送访存请求等等。
在一种可能的实现方式中,所述AI SoC还包括主机;所述主机,用于:接收待执行任务,将所述待执行任务拆分为一个或多个待执行的计算任务;根据预设的业务流标签表,对拆分后的所述一个或多个待执行的计算任务的业务流类型进行识别,所述预设的业务流标签表中包括预先定义的计算任务所属的业务流类型与QoS ID之间的映射关系;根据识别结果,为所述一个或多个待执行的计算任务分别携带对应的QoS ID。
本申请实施例中,该AI系统中除了包括可以进行计算任务执行的多个子系统和多个内存控制器以外,还可以进一步包括统一接收用户下发的各类计算任务的主机Host,该主机可以通过对AI网络模型中的业务流的类别进行识别和标记,即针对不同的业务流下的计算任务给出不同的业务流的访存QoS标签即QoS ID,以便于后续整个AI系统可以依据这些QoS ID对携带该QoS ID的计算任务进行合理、匹配的访存QoS控制,从而最终实现整个AI系统的 访存的负载均衡,提升整个AI系统的综合执行性能和效率。
在一种可能的实现方式中,所述系统还包括系统调度器;所述主机,还用于:将携带有对应的QoS ID的一个或多个计算任务发送至所述系统调度器。
本申请实施例中,当AI系统中的主机在对业务流进行了识别以及携带QoS ID之后,可以将这些携带有QoS ID的计算任务发送至AI SoC上的系统调度器中进行后续的分配。也即是当主机对待执行任务进行拆分、识别、打标签之后,会将处理后的计算任务下发至系统调度器,以便于后续由系统调度器对这些已经打好标签(即携带了匹配的QoS ID)的计算任务进行调度分配。
在一种可能的实现方式中,所述主机或者所述目标处理节点,还用于:预先为所述计算任务中的所述QoS ID配置对应的第二QoS优先级,所述第二QoS优先级为所述QoS ID对应的初始优先级。
本申请实施例中,主机侧或者在目标处理节点内部还为每个计算任务配置初始的QoS优先级(即第二QoS优先级),也即是为各个QoS ID配置匹配的QoS优先级。从而使得后续AI SoC中的相关模块可以基于此初始的QoS优先级进行后续的随路QoS优先级或者是最终QoS优先级的调整。
在一种可能的实现方式中,所述主机,还用于:根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述第二QoS优先级。
本申请实施例中,AI系统中的主机侧还可以根据访存性能的实时监控信息,对系统中的各个QoS ID对应的初始QoS优先级进行更新和优化。例如,通过寻优算法和自适应的机器学习算法自适应的进行QoS自动寻优。
在一种可能的实现方式中,所述系统调度器,用于:接收所述主机发送的所述一个或多个待执行的计算任务;其中,每个所述待执行的计算任务中还携带有用于描述计算任务类型的任务描述符;根据各个待执行的计算任务中携带的所述任务描述符,从所述M个子系统中为各个待执行的计算任务选择匹配的子系统,以及从所述匹配的子系统中的一个或多个处理节点中选择匹配的处理节点;将各个待执行的计算任务调度至所述匹配的子系统中所述匹配的处理节点上。
本申请实施例中,当AI系统中的主机在对业务流进行了识别以及携带QoS ID之后,将这些携带有QoS ID的计算任务发送至AI SoC上的系统调度器上后,该系统调度器可以对Host发送的所有计算任务进行合理分配,具体分配原则可以是按照计算任务中所携带的任务描述符进行分配,以根据该任务描述符所描述的任务的类型,为各个计算任务分配合适的子系统和处理节点,以更好地完成各个计算任务的执行或加速。例如,将某个AI矩阵计算任务分配至合适的AI子系统上、且分配至该AI子系统上的空闲的处理节点上。
在一种可能的实现方式中,当所述AI系统应用于虚拟场景时,所述AI系统中包括多个虚拟机,其中,所述多个虚拟机中的每个虚拟机对应一个或多个进程,一个进程包括一个或多个计算任务;所述一个或多个进程运行在所述M个子系统中的至少一个子系统的一个或多个处理节点上;所述系统调度器,还用于:为每个所述虚拟机分配一个VM ID;其中,每个所述虚拟机对应的一个或多个进程的页表中都共享对应虚拟机的VM ID。
本申请实施例中,当AI系统应用于虚拟场景时,是通过以虚拟机为单位为每个虚拟机分配一个VM ID,并设置该虚拟机下的所有进程都对应该同一个VM ID,目的是为了将不同的虚拟机之间进行隔离,以保证不同虚拟机对应的用户之间的安全隔离和互不影响。
在一种可能的实现方式中,当所述系统为虚拟场景时,所述目标子系统还包括系统内存 管理单元SMMU;所述目标处理节点,还用于:将所述计算任务的访存请求发送至所述SMMU,并通过所述SMMU更新所述计算任务的访存请求中携带的所述QoS ID;所述SMMU,用于:接收所述目标处理节点发送的所述计算任务的访存请求;根据所述访存请求中的虚拟地址和服务集标识SSID,确定所述计算任务所属的目标进程;根据所述目标进程的页表确定所述目标进程对应的目标虚拟机的VM ID,将所述计算任务的访存请求中携带的所述QoS ID替换为所述目标虚拟机的VM ID。
本申请实施例中,当AI系统处于虚拟场景中时,就需要将初始的QoS ID的分配与流转的流程进行替换,而统一换成按照进程所属的虚拟机来分配QoS ID,即各个处理节点通过该处理节点中的SMMU对接收到的访存请求中携带的QoS ID进行替换,统一替换为该访存请求对应的计算任务其所属的进程对应的虚拟机的VM ID,其目的是为了在该场景下尽可能的以带宽安全隔离为首要目的,以满足虚拟机的用户之间可以进行数据隔离、算力资源隔离以及相互之间互不影响的基本需求。进一步的,也可以解决不同的虚拟机的用户之间的内存带宽隔离和带宽承诺问题。
在一种可能的实现方式中,所述AI SoC还包括L2 Cache高速缓存;所述L2高速缓存,用于:接收各个计算任务的访存请求,并根据各个计算任务的访存请求中携带的QoS ID对所述L2 Cache中对应的存储区域进行访问,其中,携带不同QoS ID的访存请求对应所述L2 Cache中的不同存储区域。
本申请实施例中,通过每个访存请求中携带的QoS ID,可控制各个访存请求所能访问的高速缓存的存储区域,也即是,通过访存请求中的QoS ID来对高速缓存中相应的存储区域进行安全隔离。由于每个虚拟机下的进程其对应的虚拟机的ID即VM ID,因此可以将其所属的VM ID作为QoS ID携带在对应的访存请求中,从而可以基于此来进行高速缓存Cache的隔离以实现虚拟机场景下的安全隔离效果。
第二方面,本申请实施例提供了一种内存访问控制方法,其特征在于,应用于人工智能AI系统,所述AI系统包括AI片上系统SoC,所述AI SoC包括M个子系统和N个内存控制器,所述M个子系统和所述N个内存控制器通过SoC总线互联;所述M个子系统包括目标子系统,所述目标子系统为所述M个子系统中的任意一个子系统,所述目标子系统包括S个处理节点,M、N、S均为大于或者等于1的整数;所述方法包括:通过所述S个处理节点中的目标处理节点,接收待执行的计算任务,所述计算任务中携带有服务质量标识QoS ID;所述目标处理节点为所述S个处理节点中的任意一个处理节点;所述QoS ID用于指示所述计算任务所属的类别;生成所述计算任务的访存请求,所述访存请求中携带有所述计算任务中的所述QoS ID;将所述访存请求,发送至所述N个内存控制器中的目标内存控制器;通过所述目标内存控制器,接收所述访存请求,确定与所述QoS ID对应的第一服务质量QoS优先级;基于所述第一QoS优先级,对所述访存请求进行访存QoS控制。
在一种可能的实现方式中,所述计算任务中还携带与所述QoS ID对应的第二QoS优先级,所述第二QoS优先级为所述计算任务中的所述QoS ID对应的初始(基础)QoS优先级。
在一种可能的实现方式中,所述目标子系统中还包括子调度器;所述通过目标处理节点将所述访存请求,发送至所述N个内存控制器中的所述目标内存控制器,包括:所述通过目标处理节点将所述访存请求发送至所述子调度器,并通过所述子调度器调度至所述N个内存控制器中的目标内存控制器;所述方法还包括:通过所述子调度器,接收所述目标子系统中的所述S个处理节点分别发送的访存请求;根据所述S个处理节点分别发送的访存请求中携 带的QoS ID所对应的第二QoS优先级,将所述S个处理节点分别发送的访存请求调度至所述SoC总线,所述第二QoS优先级为对应QoS ID的初始QoS优先级;其中,所述第二QoS优先级用于指示对应的所述访存请求被调度至所述SoC总线的优先级。
在一种可能的实现方式中,所述通过所述子调度器根据所述S个处理节点分别发送的访存请求中携带的QoS ID对应的第二QoS优先级,将所述S个处理节点分别发送的访存请求调度至所述SoC总线,包括:通过所述子调度器为所述S个处理节点分别建立任务队列,每个所述任务队列中包括对应处理节点发送的访存请求;其中,所述目标处理节点对应目标任务队列;当所述目标任务队列中当前插入目标访存请求时,将所述目标任务队列中所有访存请求中携带的QoS ID对应的所述第二QoS优先级分别提升至第三QoS优先级,所述目标访存请求为所携带的QoS ID对应的所述第二QoS优先级超过预设优先级的访存请求;根据所述S个处理节点的任务队列中的访存请求中携带的QoS ID对应的所述第二QoS优先级或所述第三QoS优先级,将所述S个处理节点的任务队列中的访存请求先后发送至所述SoC总线。
在一种可能的实现方式中,所述方法还包括:通过所述SoC总线,接收所述子调度器发送的所述目标任务队列中的一个或多个访存请求,所述一个或多个访存请求包括所述访存请求;将所述目标任务队列中的一个或多个访存请求中携带的QoS ID对应的所述第三QoS优先级恢复至对应的所述第二QoS优先级。
在一种可能的实现方式中,所述方法还包括:通过所述SoC总线,基于所述目标任务队列中的一个或多个访存请求恢复后的所述第二QoS优先级,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。
在一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述通过所述SoC总线,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上,包括:通过所述SoC总线,将所述目标任务队列中的一个或多个访存请求发送至所述MATA,并通过所述MATA将所述一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。可选的,在另一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述SoC总线,具体用于:将所述S个处理节点分别发送的访存请求发送至所述MATA,并通过所述MATA将所述S个处理节点分别发送的访存请求分别调度至所述N个内存控制器中对应的内存控制器上,所述S个处理节点分别发送的访存请求包括所述访存请求。
在一种可能的实现方式中,所述方法还包括:通过所述MATA,接收所述访存请求,确定所述访存请求中携带的所述QoS ID对应的所述第二QoS优先级;基于所述QoS ID对应的所述第二QoS优先级、结合与所述QoS ID对应的历史内存带宽统计信息、以及与所述QoS ID对应的访存策略控制参数,确定所述QoS ID对应的所述第一QoS优先级,所述访存策略控制参数包括允许访问请求通过的最高带宽、最低带宽和访问优先级中的一种或多种。
在一种可能的实现方式中,所述方法还包括:通过所述MATA,预先设置各个QoS ID对应的访存策略控制参数,统计并记录各个QoS ID对应的历史内存带宽;根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述访存策略控制参数。例如,通过寻优算法和自适应的机器学习算法来更新优化各个QoS ID对应的所述访存策略控制参数。
在一种可能的实现方式中,所述方法还包括:通过所述MATA,将所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。可选的,在另一种可能的实现方式中,所述AI SoC还包括MATA;所述方法还包括:通过所述MATA,将确定的所述第一QoS优先级携带在所述访存请求中,并基于所述第一 QoS优先级将所述访存请求调度至所述目标内存控制器。
在一种可能的实现方式中,所述通过所述目标内存控制器,基于所述第一QoS优先级,对所述访存请求进行访存QoS控制,包括:通过所述目标内存控制器,基于所述QoS ID对应的所述第一QoS优先级,并结合所述目标内存控制器的访存服务情况对所述访存请求进行访存QoS控制,所述访存服务情况包括内存访问时序要求、或内存带宽总线利用率。
在一种可能的实现方式中,所述方法还包括:当所述目标内存控制器接收到的访存请求的量大于预设阈值时,则通过所述目标内存控制器,向所述M个子系统广播反压指示,所述反压指示用于指示所述M个子系统中的一个或者多个子系统延迟、或者减少、或者停止发送访存请求。
在一种可能的实现方式中,所述AI系统还包括主机;所述方法还包括:通过所述主机,接收待执行任务,将所述待执行任务拆分为一个或多个待执行的计算任务;根据预设的业务流标签表,对拆分后的所述一个或多个待执行的计算任务的业务流类型进行识别,所述预设的业务流标签表中包括预先定义的计算任务所属的业务流类型与QoS ID之间的映射关系;根据识别结果,为所述一个或多个待执行的计算任务分别携带对应的QoS ID。
在一种可能的实现方式中,所述AI SoC还包括系统调度器;所述方法还包括:通过所述主机,将携带有对应的QoS ID的一个或多个计算任务发送至所述系统调度器。
在一种可能的实现方式中,所述方法还包括:通过所述主机或通过所述目标Maser预先为所述计算任务中的所述QoS ID配置对应的第二QoS优先级,所述第二QoS优先级为所述QoS ID对应的初始优先级。
在一种可能的实现方式中,所述方法还包括:通过所述主机根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述第二QoS优先级。例如,通过寻优算法和自适应的机器学习算法来更新优化各个QoS ID对应的所述第二QoS优先级。
在一种可能的实现方式中,所述方法还包括:通过所述系统调度器,接收所述主机发送的所述一个或多个待执行的计算任务;其中,每个所述待执行的计算任务中还携带有用于描述计算任务类型的任务描述符;根据各个待执行的计算任务中携带的所述任务描述符,从所述M个子系统中为各个待执行的计算任务选择匹配的子系统,以及从所述匹配的子系统中的一个或多个处理节点中选择匹配的处理节点;将各个待执行的计算任务调度至所述匹配的子系统中所述匹配的处理节点上。
在一种可能的实现方式中,当所述AI系统应用于虚拟场景时,所述AI系统中包括多个虚拟机,其中,所述多个虚拟机中的每个虚拟机对应一个或多个进程,一个进程包括一个或多个计算任务;所述一个或多个进程运行在所述M个子系统中的至少一个子系统的一个或多个处理节点上;所述方法还包括:通过所述系统调度器,为每个所述虚拟机分配一个VM ID;其中,每个所述虚拟机对应的一个或多个进程的页表中都共享对应虚拟机的VM ID。
在一种可能的实现方式中,当所述系统为虚拟场景时,所述目标子系统还包括系统内存管理单元SMMU;所述方法还包括:通过所述目标处理节点,将所述计算任务的访存请求发送至所述SMMU,并通过所述SMMU更新所述计算任务的访存请求中携带的所述QoS ID;通过所述SMMU,接收所述目标处理节点发送的所述计算任务的访存请求;根据所述访存请求中的虚拟地址和服务集标识SSID,确定所述计算任务所属的目标进程;根据所述目标进程的页表确定所述目标进程对应的目标虚拟机的VM ID,将所述计算任务的访存请求中携带的所述QoS ID替换为所述目标虚拟机的VM ID。
在一种可能的实现方式中,所述AI SoC还包括L2 Cache高速缓存;所述方法还包括: 通过所述L2高速缓存,接收各个计算任务的访存请求,并根据各个计算任务的访存请求中携带的QoS ID对所述L2 Cache中对应的存储区域进行访问,其中,携带不同QoS ID的访存请求对应所述L2 Cache中的不同存储区域。
第三方面,本申请提供一种半导体芯片,可包括上述第一方面中的任意一种实现方式所提供的AI系统。
第四方面,本申请提供一种半导体芯片,可包括:上述第一方面中的任意一种实现方式所提供的AI系统、耦合于所述AI系统的内部存储器以及外部存储器。
第五方面,本申请提供一种半导体芯片,可包括:上述第一方面中的任意一种实现方式所提供的主机Host。
第六方面,本申请提供一种半导体芯片,可包括:至少一个上述第一方面中的任意一种实现方式所提供的AI SoC。
第七方面,本申请提供一种片上系统SoC芯片,该SoC芯片包括上述第一方面中的任意一种实现方式所提供的AI系统、耦合于所述总线系统的内部存储器和外部存储器。该SoC芯片,可以由芯片构成,也可以包含芯片和其他分立器件。
第八方面,本申请提供了一种芯片系统,该芯片系统包括上述第一方面中的任意一种实现方式所提供的AI系统。在一种可能的设计中,所述AI系统还包括存储器,所述存储器,用于保存所述芯片系统在运行过程中所必要或相关的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其它分立器件。
第九方面,本申请提供一种电子设备,该电子设备可包括上述第一方面中的任意一种实现方式所提供的AI系统。
第十方面,本申请提供一种电子设备,该电子设备具有实现上述第一方面中的任意一种总线通信方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第十一方面,本申请提供一种AI装置,该AI装置具有实现上述第一方面中的任意一种AI计算方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第十二方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,该计算机程序被总线系统执行时实现上述第二方面中任意一项所述的AI计算方法流程。
第十三方面,本申请实施例提供了一种计算机程序,该计算机程序包括指令,当该指令被总线系统执行时,使得总线系统可以执行上述第二方面中任意一项所述的AI计算方法流程。
附图说明
图1A是本申请实施例提供的一种AI系统的硬件结构示意图。
图1B是本申请实施例提供的另一种AI系统的硬件结构示意图。
图1C是本申请实施例提供的又一种AI系统的硬件结构示意图。
图2A为本申请实施例提供的一种业务流和图节点及计算任务之间的关系示意图。
图2B为本申请实施例提供的一种业务流流动方向的示意图。
图2C是本申请实施例提供的一种resnet50网络在运行过程中所涉及的业务流类型与访存 带宽之间的关系示意图。
图3A为本申请实施例提供的一种Davinci软件栈的框架示意图。
图3B为本申请实施例提供的一种Davinci软件栈中各个软件模块之间的交互流程示意图。
图4A是本申请实施例提供的一种图编译阶段和图运行阶段的示意图。
图4B是本申请实施例提供的一种resnet50的AI模型的构建脚本图。
图4C是本申请实施例提供的一种经过图编译和优化后的可执行计算任务的示意图。
图5A为本申请实施例提供的一种QoS自动寻优的软件架构示意图。
图5B为本申请实施例提供的一种QoS自动寻优的方法流程示意图。
图6A为本申请实施例提供的一种虚拟场景下的AI系统的软件架构图。
图6B为本申请实施例提供的一种在虚拟应用场景下的AI系统中各个软件模块之间的交互流程示意图。
图7为本申请实施例提供的一种内存访问控制方法的流程示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例进行描述。本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。
(1)服务质量(Quality of Service,QoS),在本申请中,访存(即内存访问)QoS是指在有限的内存带宽资源下,为各种类别的计算任务(如各种业务流)做相应的内存访问服务控制,比如控制各类访问内存请求的最高带宽、最低带宽、或访问优先级等,为相应的计算任务(如某种业务流下的计算任务)提供内存访问的服务质量保证。
(2)服务集标识(Service Set Identifier,SSID)可以将一个无线局域网分为几个需要不同身份验证的子网络,每一个子网络都需要独立的身份验证,只有通过身份验证的用户才可 以进入相应的子网络,防止未被授权的用户进入本网络。
(3)每秒万亿次操作(Tera Operations Per Second,TOPS),代表处理器每秒钟可进行一万亿次(10^12)操作,可用于表征处理器的服务质量。
(4)虚拟机(Virtual Machine,VM),指通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的完整计算机系统。在实体计算机中能够完成的工作在虚拟机中都能够实现。在计算机中创建虚拟机时,需要将实体机的部分硬盘和内存容量作为虚拟机的硬盘和内存容量。
(5)视觉预处理器(Davinci Vision Pre-Processor,DVPP),主要实现视频解码(VDEC)、视频编码(VENC)、JPEG编解码(JPEGD/E)、PNG解码(PNGD)、视觉预处理单元(VPC)等。
(6)矩阵计算核心(AI Cube core,AIC),用于实现AI运算中的矩阵计算。
(7)标量计算核心(AI Vector core,AIV),用于实现AI运算中的标量计算。
(8)AI平台中的图生成引擎(Graph engine,GE),负责将各种主流AI计算框架生成的AI模型的中间表达式(IR)编译转换为AI平台(如Davinc平台)能够理解和执行的计算子图。GE的主要功能包括图准备、图拆分、图优化、图编译、图加载、图执行和图管理等(此处图指网络模型拓扑图)。
(9)AI平台中的图融合引擎(Fusion engine Davinc,FE),FE负责对接GE和张量加速引擎(Tensor Boost Engine,TBE)算子,具备算子信息库的加载与管理、融合规则管理、原图融合和子图优化的能力。GE在子图优化阶段将子图传递给FE,FE根据算子信息库以及FE融合优化进行预编译,例如修改数据类型、插入转换算子等,该子图将再次传递给GE进行子图合并及子图优化。
(10)远程直接数据存取控制器(Remote Direct Memory Access,RDMA),RDMA通过网络把资料直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,这样就不需要用到多少计算机的处理功能。它消除了外部存储器复制和上下文切换的开销,因而能解放内存带宽和CPU周期用于改进应用系统性能。
(11)基于以太网的RDMA协议的硬件实现(RDMA over Converged Ethernet Engine,RoCE),基于该协议,可以在通过标准的以太网络在不同机器之间高速传输大量数据。
(12)SoC级别的直接存储器访问控制器(System Direct memory Access,SDMA),可用作SoC内部在各个子系统进行数据DMA访问搬移的多通道高效数据传输。
(13)外围设备互联扩展总线直接内存访问控制器(Peripheral Component Cnterconnect Express Direct memory Access,PCIE DMA),其中,PCIE一种高速串行计算机扩展总线标准,而PCIE DMA是一种遵循PCIE标准规范而实现的DMA控制器,可用于HOST与Device之间的数据高效传输。
(14)(Huawei Cache-Coherent System HCCS)华为自定义的维护多SoCket之间数据一致性的协议标准,是ARM架构服务器多路CPU互连的物理链路,用于跨片通信。
(15)运行时刻(Runtime),是指一个程序在运行(cc或者在被执行)的状态。在一些编程语言中,把某些可以重用的程序或者实例打包或者重建成为“运行库"。这些实例可以在它们运行的时候被链接或者被任何程序调用。
(16)神经网络计算架构(Compute Architecture for Neural Networks,CANN),是针对AI场景推出的异构计算架构,通过提供多层次的编程接口,可用于支持用户快速构建基于AI平台的应用和业务。以提升用户开发效率和AI处理器的算力。
(17)华为集合通信库(Huawei Collective Communication Library,HCCL),对外提供集合通信算子,支持网卡及集群不同节点间的RoCE传输功能,为分布式训练中不同NPU之间提供高效的数据传输能力。算子库:主要提供单机多卡以及多机多卡间的Broadcast,allreduce,reducescatter,allgather等集合通信功能,在分布式训练中提供高效的数据传输能力。
(18)高带宽存储器(High Bandwidth Memory,HBM)是一种基于3D堆栈工艺的高性能DRAM即是一种内存芯片(即“RAM”),具有高速、高带宽的特点,适用于高存储器带宽需求的应用场合,例如图形处理器、网络交换及转发设备(如路由器、交换器)等。
(19)DRV文件是驱动程序包中的一个文件,可以用记事本或者写字板打开。DRV文件是由Windows操作系统中使用的连接和通信硬件设备(包括外部和内部)创建的驱动程序文件。包含如何设置操作系统设备和一起沟通的命令和参数,也可用于一台计算机上安装设备驱动程序。
(20)输入输出控制(Input/Output Control,IOCTL),在计算机中是一个专用于设备输入输出操作的系统调用,该调用传入一个跟设备有关的请求码,系统调用的功能完全取决于请求码。
请参见图1A,图1A是本申请实施例提供的一种AI系统的硬件结构示意图,该AI系统01可以位于任意一个电子设备中,如电脑、计算机、手机、平板、服务器等各类设备中。该AI系统01的硬件结构具体可以是芯片或芯片组或搭载有芯片或者芯片组的电路板。该芯片或芯片组或搭载有芯片或芯片组的电路板可在必要的软件驱动下工作。在图1A中所示的AI系统01中,可包括主机10和AI SoC 20。其中,AI SOC 20中可包括通过片上系统(SoC)总线连接的202的M个子系统(如图1A中所示,可包括子系统201-1、……、子系统201-M)及N个内存控制器203(如图1A中所示,可包括内存控制器203-1、内存控制器203-2、……、内存控制器203-N),且每个内存控制器用于控制至少一个内存;其中,所述M个子系统包括目标子系统,所述目标子系统为所述M个子系统中的任意一个子系统(为了便于描述,后续实施例中将以目标子系统201-1为例进行说明,可以理解的是,该举例对目标子系统本身不构成任何限制),所述目标子系统包括S个处理节点(如图1A中所示,可包括Master 1、Master 2、……、Master S),需要说明的是,后续本申请中相关实施例为了便于描述,可将处理节点命名或翻译为Master,或者将处理节点理解为可以包括Master在内的等其他类型的节点,可以理解的是,将处理节点等同于Master或以Master为例的描述并不对处理节点的本身构成任何限定。其中,所述S个Master中包括目标Master,所述目标Master为所述M个Master中的任意一个子系统(为了便于描述,后续实施例中将以目标Master为目标子系统201-1中的Master1为例进行说明,同样可以理解的是,该举例对目标Master本身不构成任何限制)。还可以理解的是,不同的子系统中所包含的Master的数量可以相等也可以不等,本申请实施例对此不作具体限定。其中,M、N、S均为大于或者等于1的整数。可选的,请参见图1B,图1B是本申请实施例提供的另一种AI系统的硬件结构示意图,相比于图1A中的AI系统,该图1B的AI系统中还可以进一步包括先进内存访问代理(MATA)205,可用于对上述N个内存控制器((203-1~203-N))进行统筹地管理。
下面结合上述图1A中所示的AI系统01的硬件结构,自上而下对本申请实施例中的AI系统01或AI系统02中所涉及到多种部件及其功能进行示例性描述:
主机10,可包括图1A中未示出的主机CPU、内存储器,可选的,还可以进一步包括主机控制器、其他输入输出控制器、接口等物理器件。其中,主机CPU上可运行有主机系统(Host  System),如X86、ARM等。在本申请实施例中,主机10可以作为AI系统01或AI系统02的业务流部署和任务管理的中心,用于管理多个硬件加速器(Device)如各类SoC,可以理解的是且至少包括本申请中所述的AI SoC 20。具体地,主机10的功能包括管理任务、传达指令或向各个SoC提供特定服务等;进一步地,还可用于识别用户下发的业务流的类型(包括AI计算框架、模型的切分和识别等),为业务流中的相应的计算任务分配合适的QoS ID,即为每个计算任务打上合适的QoS标签,例如,基于各类硬件加速器(Device)的处理能力特点,为各类硬件加速器(Device)分配合适的且已携带好QoS ID的计算任务。可选的,主机10也可以为每个计算任务携带上与其所携带的QoS ID对应的初始QoS优先级,即第二QoS优先级。
AI SoC 20,为人工智能片上系统,如图1A或图1B中所示,AI SoC 20具体可以包括系统调度器200、多个子系统(如子系统201-1、……、子系统201-M)、片上系统总线202、多个内存控制器(如内存控制器203-1、……、内存控制器203-N)的片上系统,进一步地,还可以包括多个内存(内存204-1、内存204-2、……、内存204-N),即每个内存控制器用于控制至少一个内存,其中,任意一个内存可以为高带宽内存(High Bandwidth Memory,HBM)或双倍速率同步动态随机存储器(Double Data Rate,DDR)等。其中,
系统调度器200,当携带了QoS ID计算任务从主机10下发到AI SoC 20后,可先经过AI SoC 20中的系统调度器200,系统调度器200可以根据计算任务的类型(如根据计算任务中携带的任务描述符),将各个计算任务调度到适合执行该计算任务的各个子系统中,且进一步调度至合适的Master中。
子系统(201-1~201-M),每个子系统可以为一个有专用功能的集成电路、或者为用于加速某个功能的加速器。例如,子系统可以为人工智能核心器件(AI CORE)、视觉预处理器(DVPP)、图像信号处理器(ISP)、声音信号处理器(ASP)、SOC系统级别的DMA控制器(SDMA)、远程直接数据存取控制器(RDMA)、外围设备互联扩展总线直接内存访问控制器(PCIE DMA)、加解密引擎、或者通用CPU等。如图1A或图1B中所示。在每个子系统内部,各个Master之间还可以通过子系统内部的总线(Connect bus)211进行互联,并且在每个子系统内部还可以进一步包括子调度器212。需要说明的是,M个子系统中并不是都为AI子系统,可以包含不是用于AI计算而是用于配合AI子系统的系统或者其他子系统;相应的,也不是所有的计算任务都是AI计算任务,也可以有一些配合完成AI计算的任务、或者是一些通用的计算任务。其中,
处理节点(Master1~Master S),处理节点可以理解为本申请中可发起内存访问请求的请求方、访存请求的源头或数据请求方等,每个子系统内部的一个或多个处理节点(如Master)可以表示该子系统的多个内核。各子系统内部的Master按照计算任务中的任务描述来执行计算任务,当计算或者通讯任务需要访问内存系统时,则各个Master将该任务携带下来的QoS ID连同需要访问的内存的数据及地址等信息一起通过子调度器212发送到片上系统总线202上。例如,当子系统为通用CPU时,则该子系统内部的多个Master可以表示通用CPU核;当子系统为GPU时,则该子系统内部的多个Master可以表示GPU核;当子系统为NPU时,则该子系统内部的多个Master可以表示NPU核等。
子调度器212,可用于对其所属的子系统中的所有的Master在执行计算任务的过程中所产生的访存请求进行调度。例如,为每个Master建立对应的访存请求的队列,用于按照先后顺序存放各个Master中产生的访存请求,并按照队列中的先后顺序以及结合队列中各个访存请求中携带的QoS ID对应的QoS优先级,对各个队列中的访存请求进行先后的调度,并调 度至AI SoC 20中的片上系统总线(SoC Connection BUS)202。
片上系统总线202(SoC Connection BUS),可以实现AI SoC 20中各个子系统(201-1~201-M)以及各个内存控制器(203-1~203-N)之间的连接,并以总线方式实现各个子系统(201-1~201-M)以及各个内存控制器(203-1~203-N)之间的数据通信。也即是各个子系统(201-1~201-M)的内存访问请求(简称为访存请求),经过该片上系统总线202的仲裁、地址解析和路由之后,可以下发给对应的内存访问控制器。可以理解的是,片上系统总线的规范还可以定义各个模块之间初始化、仲裁、请求传输、响应、发送接收等过程中驱动、时序、策略等关系。
先进内存访问代理(Memory Advanced Technology Agent,MATA),可用于统筹管理所述N个内存控制器(203-1~203-N),可以为各个QoS ID配置对应的访存策略控制参数(如允许访问请求通过的最高带宽、最低带宽和访问优先级中的一种或多种),以及为各个访存请求生成优化后的QoS优先级,即对各个访存请求中携带的QoS ID对应的第二QoS优先级(如为QoS ID对应的初始优先级、默认优先级、或缺省优先级等)进行优化和调控生成对应的第一QoS优先级(可能在原来的QoS优先级的基础上进行提升,也可能进行降低)。例如,结合各个QoS ID的对应的历史带宽统计信息,并且结合QoS ID对应的访问策略,计算出该访存请求的最终的QoS优先级。此外,MATA还可以统计并记录各个QoS ID在各个内存控制器上所占用的历史内存带宽。在本申请实施例中,某个计算任务中携带的QoS ID可以不变,但是该QoS ID对应的QoS优先级有可能会发生变化,而且某个QoS ID对应的访存策略控制参数也有可能会变化。可选的,MATA可以设置在N个内存控制器之外,也可以设置在某个内存控制器上,即MATA和各个内存控制器之间可以是独立的物理实体,也可以是将MATA集成在某一个或多个内存控制器上,本申请实施例对MATA与内存控制器之间的物理关系不做限制。
内存控制器(203-1~203-N),用于控制内存并且负责内存与各子系统之间数据交换。内存控制器根据各个子系统发出的访存请求中的地址来确定将数据发送给哪个内存,或者确定从哪个内存中读取数据并且返回给对应的子系统。此外,内存控制器还可以进行虚拟地址到物理地址映射、存储器访问权限控制、高速缓存支持等。具体地,目标内存访问控制器,接收到由MATA调度过来的访存请求后,解析该访存请求中的读/写地址,以及其中携带的QoS优先级(即第一QoS优先级),并最终根据该QoS优先级和该内存控制器的服务情况(如内存访问时序要求、或内存带宽总线利用率等)对访存请求进行访存QoS控制,例如,是立刻允许访问、还是暂存该命令等待下一轮调度仲裁访问等,如果该访存请求得到本次调度,则该访存请求就会下发到具体的内存访问单元执行读写操作,并且在该内存控制器更新对应QoS ID的带宽统计。需要说明的是,由MATA计算出的最终的QoS优先级可以只是内存控制器进行内存QoS所依据的其中一个因素,例如,还可以依据时序因素、饿死机制等。因此可以理解的是,在本申请实施例的AI系统中,在一些情况下可以由QoS优先级来决定服务质量,但是也有一些情况下会根据实际内存情况来决定,因此,并不一定是QoS优先级越高最终获得的访存服务质量越好。
内存(204-1~204-N),也可称之为内存储器,通常为掉电易失性存储器,断电时会丢失其上存储的内容,也可称为内存(Memory)或主存储器。本申请中的内存储器包括可读可写的运行内存,其作用是用于暂时存放各个子系统(201-1~201-M)中的运算数据,以及与AI SoC20外部存储器交互数据,可作为操作系统或其他正在运行中的程序的临时数据的存储媒介。例如,运行于子系统201-1中的Master1上的用于执行计算任务的处理程序、操作程序或操作 系统把需要运算的数据从内存储器204-2调到Master1中进行运算,当运算完成后Master1再将结果传送出来。可选的,内存储器可以包括动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、同步动态随机存取存储器(SDRAM)等中的一种或多种。其中,DRAM又包括双倍速率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM)简称DDR、二代双倍速率同步动态随机存储器(DDR2)、三代双倍速率同步动态随机存储器(DDR3)、四代低功耗双倍数据率同步动态随机存储器(Low Power Double Data Rate 4,LPDDR4)和五代低功耗双倍数据率同步动态随机存储器(Low Power Double Data Rate 5,LPDDR5)等。
可以理解的是,本申请实施例示意的结构并不构成对AI系统01或AI系统02的具体限定。在本申请另一些实施例中,AI系统01或AI系统02可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
请参见图1C,图1C是本申请实施例提供的又一种AI系统的硬件结构示意图,该AI系统03同样可以位于任意一个电子设备中,如电脑、计算机、手机、平板等各类设备中。该AI系统03的硬件结构具体可以是芯片或芯片组或搭载有芯片或者芯片组的电路板。该芯片或芯片组或搭载有芯片或芯片组的电路板可在必要的软件驱动下工作。从实现的功能来看,图1C中的AI系统03与图1A中的AI系统01或图1B中的AI系统02的区别在于,AI系统03的硬件结构可用于支持实现虚拟化场景即虚拟机租户间的带宽隔离,包括不同的虚拟机用户之间的用户数据隔离、算力资源隔离等,以满足不同用户之间的服务互不影响的需求;从硬件架构上来看,图1C中的AI系统03与图1A中的AI系统01或图1B中的AI系统02的主要区别在于,AI系统03除了包含AI系统01中或AI系统02中所示的各个部件之外,AI系统03中的AI SoC的子系统中还进一步包括系统内存管理单元SMMU 210,以及二级高速缓存(L2 cache)206;可选的,AI系统03中的AI SoC的子系统中还可以进一步包括先进内存访问代理MATA 205。
下面结合上述图1C中所示的AI系统02的硬件结构,自上而下对本申请实施例中的AI系统03中所涉及到多种部件及其功能进行示例性描述,可以理解的是,以下仅对图1C中新增的部件进行描述,而针对图1C中与图1A或者图1B中的共同的部件,则可以参见图1A或者图1B中的相关说明,此处不再赘述。其中,
系统内存管理单元(System Memory Management Unit,SMMU)210,可位于各个子系统的内部,并位于各个Master和连接总线211之间。SMMU 210可进行权限管理:如程序之间地址空间各不相同,用于控制不同程序的权限;可进行地址映射:如将虚拟地址与物理地址转进行转换;物理内存管理:如对系统的物理内存资源进行管理,为用户程序提供物理内存的申请、释放等操作接口。进一步地,SMMU还可用于进行虚拟场景下的隔离,如不同进程之间的地址隔离、物理地址空间的隔离、内存带宽的隔离等。由于在本申请实施例中,每个虚拟机对应一个唯一的VM ID,所以在虚拟化场景下,系统内存管理单元进行地址转换包括转换QoS ID,具体就是通过SMMU查找当前进程对应的页表,然后获得这个页表中的虚拟机标识(VM ID)。也即是同一个VM中的不同的所有进程中的页表最终都是同一个QoS ID。本申请实施例中把同一个VM中的不同的所有进程中的QoS ID都转换成同一个QoS ID,而不同VM的VM ID不同,因此导致最终不同VM下的进程所对应的QoS ID也不同。需要说明的是,页表中的VM ID在虚拟机创建时候,系统就为其分配好了,只是在这里SMMU会 去进行查询。
需要说明的是,本申请实施例中的系统内存管理单元(SMMU)是各个子系统(201-1~201-M)内部的一个功能单元,作用是把程序地址"翻译"成物理地址;而内存控制器(203-1~203-N)对子系统(201-1~201-M)来说可以是一个外部设备,可以负责把物理地址对应到具体的内存位置中。
L2高速缓存(L2 Cache)206,L2高速缓存就是指可以进行高速数据交换的存储器,在本申请中它先于各个内存(204-1~204-N)与各个子系统(201-1~201-M)交换数据,因此速度更快。可以理解的是,若缓存处于各个子系统(201-1~201-M)的内部,则该高速缓存通常可称之为L1缓存,而如图1C中所示的L2 Cache则是处于各个子系统外部,即外部缓存,且处于各个子系统和各个内存(204-1~204-N)之间。在申请实施例中,可将L2 Cache应用在虚拟场景下,当各个虚拟机上线时,则可以通过相关的管理单元为各个虚拟机配置对应的VM ID,以及在L2 Cache中为各个虚拟机配置对应的存储区域(如存储空间的地址范围、大小等),也即是,各个虚拟机对应的访存请求只被允许访问其被配置的所属的虚拟机对应的存储区域,而不能访问其他虚拟机对应的存储区域,因而可以使得AI系统在虚拟场景中实现高速缓存的安全隔离。
可以理解的是,本申请实施例示意的结构并不构成对AI系统03的具体限定。在本申请另一些实施例中,AI系统03可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
需要说明的是,上述图1A、图1B和图1C中的AI系统架构只是本申请实施例中的几种示例性的实施方式,本申请实施例中的AI系统架构包括但不仅限于以上架构。
基于上述图1A、图1B或图1C中的AI系统的硬件架构,在本申请实施例中,该AI系统01、02或03具体所实现的功能可包括如下:
AI SoC 20中的目标子系统(以子系统201-1为例)中所包含的S个处理节点Master中的目标Master(以Master1为例)用于:接收待执行的计算任务,所述计算任务中携带有服务质量标识QoS ID;所述目标Master为所述S个Master中的任意一个Master;所述QoS ID用于指示所述计算任务所属的类别;根据所述计算任务所需要访问的内存地址和数据,生成所述计算任务的访存请求,所述访存请求中携带有所述QoS ID;将所述计算任务的访存请求,发送至所述N个内存控制器中的目标内存控制器;所述目标内存控制器(以内存控制器203-1为例)用于:接收所述访存请求,确定与所述QoS ID对应的第一服务质量QoS优先级;基于所述第一QoS优先级,对所述访存请求进行访存QoS控制。
具体地,本申请实施例是通过在为AI SoC中各个子系统中的Master分配计算任务时,在相应的计算任务中携带服务质量标识QoS ID,也即是该计算任务中所携带的QoS ID用于表示该计算任务所属的类别,且依据该所属的类别最终可以确定与之对应的访存QoS优先级,其依据在于,在AI计算领域,不同类别的计算任务(如不同业务流下的计算任务)对访存的服务质量的需求不一样、且某些类别的计算任务之间存在访存竞争而某些类别的计算任务之间不存在访存竞争,因此,按照所属类别对计算任务的访存请求设置与之匹配的QoS优先级,便可以更好地满足不同类别的计算任务的访存需求(可以理解的是,不同类别的计算任务可对应不同的QoS ID,但不同的QoS ID可能对应相同的QoS优先级也可能对应不同的QoS优先级);进一步地,在每个Master执行所接收到的计算任务的过程中,其根据每个计算任务 所需要访问的内存地址和数据生成该计算任务的访存请求,并在该访存请求中继续携带该计算任务中所携带的QoS ID,也即是,QoS ID随计算任务流转到其对应的访存请求中,以便于后续内存控制器在接收到访存请求时,可以根据其携带的QoS ID对该访存请求进行对应优先级的内存访问控制,例如,QoS ID对应的QoS优先级越高,则内存控制器可以为携带该QoS ID的访存请求提供更好的内存访问服务质量,即针对不同访存优先级需求的计算任务进行不同的访存QoS控制,避免现有技术中由于不区别对待导致的随机抢占关键的内存带宽资源,而造成的系统性能严重下降。可选的,其中所述QoS ID对应的第一QoS优先级可以为该QoS ID对应的初始QoS优先级,例如,由目标Master为其设置的初始QoS优先级、或是AI系统中的主机为其预先设置的QoS优先级等;或者第一QoS优先级为所述QoS ID对应的最终QoS优先级,也即是该QoS优先级可以是对初始的QoS优先级进行调整后的最终QoS优先级,例如,基于该QoS ID对应的初始QoS优先级,后续在访存请求流转的过程中对其进行一些调整、优化等,比如将QoS优先级进行临时的提升或降低,又比如将临时提升或降低后的QoS优先级进行恢复,还比如在初始QoS优先级经过一系列的临时调整后再经过最终的调整得到最终的第一QoS优先级等,本申请实施例对此不作具体限定。
可选的,本申请实施例中可以是不同的业务流对应不同的QoS ID,也即是由不同的业务流下拆分出来计算任务之间其所对应或者说其携带的QoS ID是不同的,而同一个业务流下所拆分出来的计算任务对应相同的QoS ID,也即是此时的QoS ID是用于指示不同的业务流类型的。换句话说,本申请实施例可按照所述业务流的类型将计算任务进行分类,同一类别的计算任务对应相同的QoS ID,不同类别的计算任务对应不同的QoS ID,且该申请实施例中,分类原则是按照计算任务所属的业务流的类型来分类的。或者,可选的,本申请实施例中的QoS ID还可以用于指示其他类型或分类方式(如计算任务的计算类型、计算任务的重要程度类型、或者按照计算任务执行的时间段、或者按照计算任务的访存时延要求、或者按照计算任务访问内存的目的来进行分类等等),也即是当在AI系统中可以判定出某两种类别的计算任务之间存在内存访问请求的竞争关系时,则可以按照这种方式来进行计算任务所属的类别的划分,并为每个类别的计算任务分配相同的QoS ID,而不同类别的计算任务之间分配不同的QoS ID,本发明对具体如何将计算任务进行类别的划分不进行具体限定。
本申请实施例,在AI计算领域,引入了一种片上内存访问服务质量(QoS)控制技术,通过对AI系统中的待分配至AI SoC上的各个计算任务进行QoS标记,且不同类别的计算任务其对应的QoS ID不同(例如,按照计算任务所属的业务流进行类别的划分、或者按照计算任务的不同访存时延需求进行类别的划分等等),以使得后续可依据各个计算任务中携带的QoS ID确定各个计算任务的访存请求所对应的QoS优先级,并最终基于确定的QoS优先级对各个访存请求进行QoS控制,从而实现从计算任务的粒度对AI系统的内存进行访存QoS控制的目的,同时也可实现针对不同类别(如不同业务流下)的计算任务的访存需求,可提供不同的内存访问服务保障的功能,最终在AI系统现有的算力和内存带宽资源的基础上,获得更优的AI计算性能。区别于现有技术中只针对SoC中的Master级别进行访存控制(即同一个Master上的所有访存请求均按照统一访存服务质量进行控制),因而导致无法满足AI系统中各类不同计算任务(如属于不同业务流的计算任务)的实际访存需求而最终造成AI系统的计算性能差的问题。
综上,本申请实施例最终实现了从计算任务的粒度来进行内存访问服务质量的控制,解决了AI训练和推理任务中,由于各种不同类别的计算任务(如不同类型的业务流)在并发竞争内存带宽的过程中导致的内存带宽不足的问题,并且,由于本申请实施例是依据访存请求 中的QoS ID对应的优先级来提供对应的访存服务,因此还可以优先保证AI训练和推理中对时延要求更高的计算任务,以更大化、更高效地方式利用好内存带宽资源,最终实现了整个AI系统访存的负载均衡,提升了整个AI系统的综合执行性能和效率。另外,本申请实施例,当计算任务的类别是按照所属的业务流的类型进行划分时,还可以解决由于缺乏业务流优先级识别和控制及调优手段,所导致的AI训练和推理任务的性能抖动问题,例如训练过程中的时延抖动,会很大程度上影响AI集群规模的线性度的提升,导致大量AI集群机器的算力无法最大程度的得到利用,造成宝贵的AI计算资源的浪费,增加客户的模型训练成本和时间开销。而本申请实施例可以避免性能抖动(如时延抖动),提升AI系统的线性度。
在一种可能的实现方式中,所述计算任务中还携带与所述QoS ID对应的第二QoS优先级,所述第二QoS优先级为所述计算任务中的所述QoS ID对应的初始QoS优先级(如基础优先级、默认优先级、或缺省优先级等)。本申请实施例中,AI系统中分配至子系统中各个Master的计算任务中除了携带该计算任务的QoS ID以外,还可以携带与该计算任务中的QoS ID对应的初始QoS优先级(即第二QoS优先级)。也即是,在本申请实施例中,可以通过在计算任务分配之初就为该计算任务配置好QoS ID和对应的初始QoS优先级,以便于后续可以根据该QoS ID和初始QoS优先级进行后续QoS优先级的调控,并进行相应的内存QoS访问控制。可选的,在本申请实施例中,计算任务的访存请求中携带的QoS ID在从Master到目标内存控制器之间的流转过程中可以一直保持不变,但其对应的QoS优先级可以根据访存请求在调度过程中的不同的需求的和情况进行不同的调整和优化。原因在于,由于各个计算任务其所属的类别在计算任务执行过程中并不会发生变化(至少在某一次执行过程中不会发生变化),因此用于指示该计算任务所属的类别的QoS ID自然可以不发生变化,但是由于不同的QoS ID其所对应的QoS优先级,有可能随着计算任务的访存请求在实际的调度过程中遇到不同的情况,或者是在实际的访问过程中,需要考虑到很多其他因素或者条件,因此某个QoS ID对应的QoS优先级是有可能发生变化的。
在本申请中,由于AI系统01或02中的AI SoC 20中的各个子系统中均包含一个或多个Master,当某个子系统中的多个Master并行执行计算任务时,则涉及到该子系统中的多个计算任务的访存请求按照何种规则调度至片上系统总线202上,以最终发送至相应的内存控制器上。因此,本申请实施例中的AI系统01或02还可以进一步包括:在各个子系统内部调度该子系统内部的多个Master生成的访存请求的功能。下面结合本申请提供的一些实施例,具体说明AI系统01或02是如何在子系统内部将各个Master在执行计算任务过程中产生的访存请求进行调度的。
在一种可能的实现方式中,所述目标子系统中还包括子调度器;所述目标Master,具体用于:将所述访存请求发送至所述子调度器,并通过所述子调度器调度至所述N个内存控制器中的目标内存控制器;所述子调度器,用于:接收所述目标子系统中的所述S个Master分别发送的访存请求;根据所述S个Master分别发送的访存请求中携带的QoS ID所对应的第二QoS优先级,将所述S个Master分别发送的访存请求调度至所述SoC总线,所述第二QoS优先级为对应QoS ID的初始QoS优先级;其中,所述第二QoS优先级用于指示对应的所述访存请求被调度至所述SoC总线的优先级。
本申请实施例中,AI系统中的AI SoC中的各个子系统中还包括子调度器,可用于调度该子系统中的所有Master中正在执行的计算任务的访存请求,这些子系统中的Master所产生的访存请求,经过其内部的子调度器的调度后被发送至SoC总线上进行仲裁、地址解析和路 由之后,下发至对应的内存控制器进行内存访问。由于各个Master中执行的计算任务的访存请求携带了对应的计算任务中的QoS ID,因此,在各个子系统内部的子调度器在调度访存请求的过程中,可以将携带了QoS优先级更高的QoS ID的访存请求优先调度至SoC总线中,而将携带的QoS优先级更低的QoS ID的访存请求延后调度至SoC总线中,以保证在将访存请求在下发至SoC总线的过程中,就已经考虑了各个访存请求其所对应的QoS优先级,从而从整个AI系统的源头为各个计算任务提供与其QoS ID所匹配的访存控制服务。
在一种可能的实现方式中,所述子调度器,具体用于:为所述S个Master分别建立任务队列,每个所述任务队列中包括对应Master发送的访存请求;其中,所述目标Master对应目标任务队列;当所述目标任务队列中当前插入目标访存请求时,将所述目标任务队列中所有访存请求中携带的QoS ID对应的所述第二QoS优先级分别提升至第三QoS优先级,所述目标访存请求为所携带的QoS ID对应的所述第二QoS优先级超过预设优先级的访存请求;根据所述S个Master的任务队列中的访存请求中携带的QoS ID对应的所述第二QoS优先级或所述第三QoS优先级,将所述S个Master的任务队列中的访存请求先后发送至所述SoC总线。也即是可以将第二QoS优先级理解为QoS ID对应的源头QoS优先级,而将第三QoS优先级理解为QoS ID对应的随路优先级。
本申请实施例中,在每个子系统的子调度器具体调度访存请求的过程中,通过为每个Master中的计算任务创建一个任务队列,将每个Master中产生的所有访存请求均放置在一个任务队列中,并根据各个任务队列中的访存请求所携带的QoS ID对应的QoS优先级来将其先后发送至SoC总线中;当某个任务队列中当前出现了QoS优先级较高的访存请求时,为了避免由于任务队列的前端的访存请求的QoS优先级过于低而导致该任务队列中所有访存请求均被迫阻塞(例如队头阻塞),本申请实施例通过Master中的子调度器对该任务队列中的所有访存请求的QoS优先级进行提升(即从第二QoS优先级提升至第三QoS优先级),从而使得该任务队列中任意一个访存请求(尤其是前面提到的QoS优先级较高的访存请求)都不会因为任务队列中前端(或者说是队头)的访存请求的QoS优先级较低而导致该整个任务队列的访存请求被阻塞的情况,从而整体优化了QoS访存等级控制的效率和效果。
在本申请中,当AI SoC 20中各个子系统将其内部的多个Master的访存请求均发送至SoC总线上后,则涉及到SoC总线应该如何针对多个子系统的发送过来的访存请求进行调度的问题。因此,本申请实施例中的AI系统还可以进一步包括:将多个子系统发出的访存请求调度至对应的内存控制器的功能。下面结合本申请提供的一些实施例,具体说明AI系统01或02是如何将各个子系统上的访存请求调度至对应的内存控制器上的。
在一种可能的实现方式中,所述SoC总线,用于:接收所述子调度器发送的所述目标任务队列中的一个或多个访存请求,所述一个或多个访存请求包括所述访存请求;将所述目标任务队列中的一个或多个访存请求中携带的QoS ID对应的所述第三QoS优先级恢复至对应的所述第二QoS优先级。
本申请实施例中,当从各个Master的子调度器中将各个任务队列中的访存请求进行了QoS优先级的调整,且按照调整后的QoS优先级去调度访存请求之后,这些访存请求已经被调度至SoC总线中,此时已经通过在子系统中进行QoS优先级调整,消除了各个任务队列中被低QoS优先级的访存请求阻塞的风险,因此,当访存请求从各个任务队列被调度至SoC总线之后,可以恢复到之前的QoS优先级,也即是由第三QoS优先级恢复至对应的第二QoS优先级,以便于按照AI系统初始为各个计算任务分配的QoS ID对应的QoS优先级进行访存 的QoS控制。
在一种可能的实现方式中,所述SoC总线,还用于:基于所述目标任务队列中的一个或多个访存请求恢复后的所述第二QoS优先级,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。
本申请实施例中,当SoC总线将从各个子系统中调度出来的访存请求的QoS ID的QoS优先级恢复至初始的第二QoS优先级之后,则可以根据该恢复后的第二QoS优先级来进行访存请求的调度,即按照恢复后的第二QoS优先级将各个访存请求调度至对应的内存控制器上,以使得内存控制器进行后续的访存QoS控制及内存访问。
在一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述SoC总线,具体用于:将所述目标任务队列中的一个或多个访存请求发送至所述MATA,并通过所述MATA将所述一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。可选的,在另一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述SoC总线,具体用于:将所述S个Master分别发送的访存请求发送至所述MATA,并通过所述MATA将所述S个Master分别发送的访存请求分别调度至所述N个内存控制器中对应的内存控制器上,所述S个Master分别发送的访存请求包括所述访存请求。
本申请实施例中,AI SoC中还可以进一步包括用于进行内存访问控制的内存访问代理MATA,在SoC总线将各个子系统中的访存请求调度至对应的内存控制器的过程中,还可以具体通过上述MATA将访存请求调度至对应的内存控制器,也即是通过MATA,可以对各个内存控制器进行统筹地控制和管理,也可以对接收到的各个访存请求进行进一步的调控,例如对各个访存请求中QoS ID对应的第二QoS优先级进行进一步优化等。
在一种可能的实现方式中,所述MATA,用于:接收所述访存请求,确定所述访存请求中携带的所述QoS ID对应的所述第二QoS优先级;基于所述QoS ID对应的所述第二QoS优先级、结合与所述QoS ID对应的历史内存带宽统计信息、以及与所述QoS ID对应的访存策略控制参数,确定所述QoS ID对应的所述第一QoS优先级,所述访存策略控制参数包括允许访问请求通过的最高带宽、最低带宽和访问优先级中的一种或多种。例如,MATA将该QoS ID对应的历史内存带宽统计信息(如所有的N个内存控制器上携带该QoS ID的所有访存请求所占用的总带宽的和)和访存策略控制参数(如为该QoS ID实际配置的最高带宽)进行比较,然后计算得到该QoS ID对应的浮动优先级,再将该浮动优先级与该QoS ID对应的第二QoS优先级进行相加或相减(提升则相加,降低则相减),最终得到该QoS ID的第一QoS优先级。
本申请实施例中,当MATA接收到由SoC总线调度过来的各个访存请求之后,可以对各个访存请求中携带的初始优先级(即第二QoS优先级)进行进一步的优化调整,具体调整原则可以包括,当访存请求被SoC总线调度至各个内存控制器上之前,可以先通过MATA根据该访存请求中携带的QoS ID所述对应的初始QoS优先级(即第二QoS优先级)并结合该MATA当前所记录保存的与该QoS ID对应的历史内存带宽统计信息和该QoS ID对应的访存策略控制参数,生成该QoS ID最终对应的QoS优先级(即第一QoS优先级),使得目标内存控制器最终可以根据该最终的QoS优先级来对该访存请求进行访存QoS控制。也即是,MATA在进行访存控制时,不仅考虑到了AI系统初始为各个QoS ID配置的QoS优先级,也同时进一步考虑各个QoS ID对应的历史带宽统计信息(如携带某个相同的QoS ID的一类计算任务当前已经获得的内存带宽信息)和该QoS ID对应的访存请求被配置应该获得的访存策略控制参数(如允许访问请求通过的最高带宽、最低带宽和访问优先级等)来综合考量为当前的访 存请求提供何种访存QoS控制服务,以最终得到与该访存请求匹配的QoS优先级,从而进行更为精准的访存QoS控制,进一步优化和提升AI系统的计算性能。比如,某个QoS ID对应的访存请求已经占用了大量内存带宽,为了均衡考虑,则可以将该QoS ID的QoS优先级进行降低,以平衡各个QoS ID对应的访存带宽占用量;而若某个QoS ID对应的访存请求当前占用了较少的内存带宽,则可以将该QoS ID对应的QoS优先级进行提升,以弥补该QoS ID对应的访存带宽占用量。
在一种可能的实现方式中,所述MATA,还用于:预先设置各个QoS ID对应的访存策略控制参数,统计并记录各个QoS ID对应的历史内存带宽;根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述访存策略控制参数。可选的,根据所述AI系统的访存性能实时监控信息,并通过寻优算法和自适应的机器学习算法来更新优化各个QoS ID对应的所述访存策略控制参数。
本申请实施例中,MATA还为各个QoS ID配置对应的访存策略控制参数,并且还统计并记录各个QoS ID对应的历史内存带宽,以依据这两种信息来确定是在某个QoS ID对应的初始优先级的基础上进行QoS优先级的提升,还是进行QoS优先级降低。最终确定出某个QoS ID对应访存请求的终极QoS优先级,从而内存控制器可以根据该终极QoS优先级进行具体的访存QoS控制。例如,MATA可设置携带了某个QoS ID的访存请求其所被允许访问请求通过的最高带宽、最低带宽和访问优先级等。
在一种可能的实现方式中,所述MATA,还用于:将所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。可选的,MATA除了将第一QoS优先级携带在访存请求中,还可以将QoS ID继续携带在访存请求中发送至内存控制器。
本申请实施例中,当MATA确定了访存请求的最终优先级之后,则可以将该最终的优先级(即第一QoS优先级)携带在访存请求中发送至对应的内存控制器,以便于对应的内存控制器可以依据该第一QoS优先级进行访存QoS控制。此外,MATA也可以基于该第一QoS优先级将该访存请求调度至目标内存控制器。可选的,若MATA除了将第一QoS优先级携带在访存请求中,还将QoS ID继续携带在访存请求中发送至内存控制器时,则内存控制器可以根据第一QoS优先级和QoS ID共同进行访存QoS控制的决策,例如,内存控制器也可以根据QoS ID计算该QoS ID对应的访存请求在其自身上所占用的历史内存带宽,并依据此对访存QoS控制进行进一步的优化。
可选的,在另一种可能的实现方式中,所述AI SoC还包括MATA;所述MATA用于:将确定的所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。也即是MATA可以将其自身确定的或者是由该AI系统中的其他模块确定的第一QoS优先级携带在该访存请求中,并基于该第一QoS优先级将该访存请求调度至对应的目标内存控制器中。例如,该第一QoS优先级可以为由MATA中根据访存请求中携带的初始QoS优先级并进行相关的调整、优化后得到的最终QoS优先级,也可以为由MATA为该访存请求中的QoS ID配置的初始优先级等,又或者是由其他模块(如主机或者相关Master)为该访存请求中的QoS ID配置的QoS优先级并通知给MATA,再由MATA存储并携带在该访存请求中等,本发明实施例对此不作具体限定。
在本申请中,当AI SoC 20中的各个子系统中的访存请求被SoC总线调度至对应的内存控制器上之后,则涉及内存控制器具体如何根据接收到的访存请求,对该访存请求进行访存控制。因此,本申请实施例中的AI系统还可以进一步包括:具体如何进行访存QoS控制的 功能。下面结合本申请提供的一些实施例,具体说明AI系统01或02是如何为不同的访存请求提供合适的访存QoS控制的。
在一种可能的实现方式中,所述目标内存控制器,具体用于:接收所述访存请求,确定与所述QoS ID对应的所述第一QoS优先级;基于所述QoS ID对应的所述第一QoS优先级,并结合所述目标内存控制器的访存服务情况对所述访存请求进行访存QoS控制,所述访存服务情况包括内存访问时序要求、或内存带宽总线利用率。例如,访存服务情况可包括DDR控制器的读写时序要求(因为当有多个访存请求需要同时访问某个内存控制器时,则必须要满足相应的时序要求);或者包括DDR带宽总线利用率,即访问效率情况,比如,为了让DDR的总线利用率更高,如果有很多并发的访存请求,则内存控制器会优先处理同一个bank同一行的数据;或者一些读写规则;或者内存控制器自身的一些读写情况等等。
本申请实施例中,当访存请求被SoC总线通过MATA调度至各个内存控制器上之后,内存控制器可以根据该访存请求中携带的最终QoS优先级(即第一QoS优先级)并结合该内存控制器当前的服务情况对该访存请求进行访存QoS控制。也即是,内存控制器在进行访存QoS控制时,不仅考虑到了MATA为各个QoS ID最终生成的QoS优先级,也同时进一步考虑各个内存控制器当前的服务情况(例如,该内存控制器上的访问时序要求、内存带宽总线利用率等),以进行更为精准的访存QoS控制,从而进一步优化和提升AI系统的计算性能。可选的,当内存控制器接收到的访存请求中还携带QoS ID时,则该内存控制器可以进一步依据该QoS ID计算该QoS ID对应的访存请求在其自身上所占用的历史内存带宽,并依据此对访存QoS控制进行进一步的优化。
在一种可能的实现方式中,所述目标内存控制器,还用于:当所述目标内存控制器接收到的访存请求的量大于预设阈值时,则向所述M个子系统广播反压指示,所述反压指示用于指示所述M个子系统中的一个或者多个子系统延迟、或者减少、或者停止发送访存请求。
本申请实施例中,当某个内存控制器接收到的访存请求的数量过多时,可以指示相关的子系统减少、或延迟甚至停止当前所发送的访存请求,而相关子系统在接收到上述指示后,可以根据自身的情况进行访存请求发送的调整,例如,暂缓向SoC总线发送访存请求,或者是停止向SoC总线发送访存请求等等。
在本申请中,由于AI SoC 20中子系统中的Master接收到待执行的计算任务时,该计算任务中已经携带了对应的QoS ID,因此,本申请实施例中的AI系统还可以进一步包括:识别计算任务、为各个计算任务分配和携带相应的QoS ID等功能。下面结合本申请提供的一些实施例,具体说明AI系统01或02是如何在将待执行的计算任务分发给目标Master之前,为各个计算任务分配并携带对应的QoS ID的。
在一种可能的实现方式中,本申请实施例中的AI系统还包括主机;所述主机,用于:
接收待执行任务,将所述待执行任务拆分为一个或多个待执行的计算任务;根据预设的业务流标签表,对拆分后的所述一个或多个待执行的计算任务的业务流类型进行识别,所述预设的业务流标签表中包括预先定义的计算任务所属的业务流类型与QoS ID之间的映射关系;根据识别结果,为所述一个或多个待执行的计算任务分别携带对应的QoS ID。
具体地,首先,主机10需要把一个完整的待执行的任务拆分为各个硬件(Master)可以理解的计算任务,例如,把一整个AI任务拆分为一个个待执行的矩阵计算任务、标量计算任务、矢量计算任务等等。然后,主机10侧要为所有的拆分后的计算任务打上访存的QoS标签,即匹配合适的QoS ID,例如可以通过主机系统(Host System)中的识别模块来为每个分 配到该AI系统上的计算任务分配合适的QoS ID。而本申请实施例中,主机10为计算任务分配QoS ID的依据则可以是根据该计算任务所属的业务流类型(关于本申请实施例中的业务流的分类可以参考表2对应的实施例的相关描述,此处不再赘述),也即是属于同一个业务流中的计算任务其所携带的QoS ID是相同的,而不同的业务流中的计算任务之间的QoS ID不同。需要说明的是,之后,系统调度器200在分配计算任务时是通过判断任务类型而不是判断计算任务中携带的QoS ID,即具体是通过识别计算任务中携带的任务描述符,如具体做什么运算的、做普通计算还是做图像处理计算等,根据该任务描述符,系统调度器200可以根据预设的调度原则为各个计算任务选择合适的子系统以及合适的Master(例如,当有多个Master可选择时,可以选择相对空闲的Master)。综上,主机10在为计算任务分配QoS ID时,是依据该计算任务所属的业务流类型;而系统调度器200在为各个子系统分配计算任务时,依据的是计算任务所携带的任务描述符,即任务类型。也即是这两个过程的虽然都涉及到对计算任务的识别过程,但是其识别规则和标准是不一样的。本申请实施例中,该AI系统中除了包括可以进行计算任务执行的多个子系统和多个内存控制器以外,还可以进一步包括统一接收用户下发的各类计算任务的主机Host,该主机可以通过对AI网络模型中的业务流的类别进行识别和标记,即针对不同的业务流下的计算任务给出不同的业务流的访存QoS标签即QoS ID,以便于后续整个AI系统可以依据这些QoS ID对携带该QoS ID的计算任务进行合理、匹配的访存QoS控制,从而最终实现整个AI系统的访存的负载均衡,提升整个AI系统的综合执行性能和效率。
进一步地,在一种可能的实现方式中,所述AI SoC还包括系统调度器;所述主机,还用于:将携带有对应的QoS ID的一个或多个计算任务发送至所述系统调度器。本申请实施例中,当AI系统中的主机在对业务流进行了识别以及携带QoS ID之后,可将这些携带有QoS ID的计算任务发送至AI SoC上的系统调度器中进行后续的分配。也即是当主机对待执行任务进行拆分、识别、打标签之后,会将处理后的计算任务下发至系统调度器,以便于后续由系统调度器对这些已经打好标签(即携带了匹配的QoS ID)的计算任务进行调度分配。
可选的,针对上述对主机10在识别计算任务所属的业务流类型的过程中,所涉及的预设的业务流标签表,可以针对有竞争关系的业务流之间进行优先级的区分。例如,不同的AI网络模型,其对应的业务流的类型可能是不一样的,因此其识别过程也是不一样的。并且,不同的AI网络模型,其并发运行的业务流也可能是不一样的,所以对于不同的AI网络模型来说,其可以对应有其专属的业务流分类、以及根据业务流分类及并发运行情况而形成的业务流(QoS ID)与QoS优先级的匹配关系。本申请中的一个计算任务(子任务)可以是一段代码、一个线程,一个线程有很多步骤,每个步骤可能会调度不同的函数,例如,有的计算任务是把图片归一化、有的计算任务做矩阵运算、有的计算任务是做加减法,这些计算可能会分别用到GPU、NPU以及通讯模块。上面这些计算任务都可以属于同一个业务流,拥有相同的QoS ID,拥有相同的访存QoS优先级。可以理解的是,在运行某个AI网络模型之前,就可以从该AI网络模型中识别出有哪些业务流即如何对业务流进行分类,以及业务流之间是否并发执行,然后再根据不同业务流对系统的影响来决定哪些业务流可以对应高访存QoS优先级,哪些业务流可以对应低访存QoS优先级。如图2A所示,图2A为本申请实施例提供的一种业务流和图节点及计算任务之间的关系示意图,在图2A中,一个执行任务可能会涉及到多个业务流,而一个业务流又可能会涉及到图运行阶段的多个图节点,一个图节点又可以包含多个计算任务,而每个计算任务又是由多个算子构成的。每个计算任务最终是分别依据其任务的类型(即对应本申请中的任务描述符所描述的任务类型),被分发到了相同或者不 同的子系统中的各个Master中进行执行。
以下示例性示出本申请实施例提供的预设的业务流标签表(即表2),该业务流标签表中包含了对AI模型训练或推理过程中发出的对存储器件的访问数据流的业务流分类,以及对应的流动方向、QoS ID、初始QoS优先级(即第二QoS优先级)和业务流描述。需要说明的是,该表2是本申请实施例对于目前已知的AI网络模型提供的一种可能的业务流分类方式以及相应的QoS ID、QoS优先级的匹配关系,对于未知的AI网络模型,其业务流类型以及与QoS ID、QoS优先级的匹配关可以作相应的适配和调整,此处不一一列举。具体如下表2所示:
表2
Figure PCTCN2022078504-appb-000001
在上述表2中,将业务流的类型分为了以下类型:控制流量、层内模型并行特征图通讯 数据流、数据并行参数预取流量、特征图共享流量、特征图预取流量、嵌入(Embedding)读写流量、数据并行参数全局归约操作(All Reduce)流量、AI CORE计算流量、CMO操作流量、通用CPU计算流量、图像视频加速器样本流量等。且根据业务流的流动方向,将其划分为D2D和H2D两类流动方向,请参见图2B,图2B为本申请实施例提供的一种业务流流动方向的示意图,如图2B中所示,H2D流量指的从Host与Device(加速卡)之间互相传输的业务流数据D2D流量是指不同Device(加速卡)之间互相传输的业务流数据,也包括在同一个Device上的SoC内部各个加速器访问本地内存的业务流数据。这些不同流向的业务流数据,经过复杂的片内总线,或通过的互相通讯技术,例如RDMA,PCIE,或海思自研的HCCS总线,最终到达各个SoC的内存控制器,成为一次次对内存单元(DDR/HBM)的读/写访问请求事务。通过研究不同的AI网络模型表明,不同类型及部署规模的AI网络(如resnet50,yolo,Wide&Deep,GPT2,GPT3等),其能够产生的对存储系统的读/写请求类型,基本上被总结在以上表格中,可以理解的是,不同的AI网络模型在运行过程中的流量模型可能不尽相同,但仍然可以采用上述分类方式对业务流进行分类,每种业务流都有其特定的作用,因此上述对业务流的分类方法或核心思想可以适用于任何网络模型上。并且,上述对业务流的划分的过程在对应的AI网络模型运行之前就可以进行,例如,在AI网络模型运行之前就可以通过判断其计算过程中涉及到了哪些内存访问操作、调用了哪些函数、调用了哪些API等来识别计算任务所属的业务流。
进一步地,在上述表2中,可以看出,针对上述各个不同的业务流,可以为其分配不同的类别标记也即是QoS ID,上表2中可以看出,控制流量、层内模型并行特征图通讯数据流、数据并行参数预取流量、特征图共享流量、特征图预取流量、嵌入(Embedding)读写流量、数据并行参数全局归约操作(All Reduce)流量、AI CORE计算流量、CMO操作流量、通用CPU计算流量、图像视频加速器样本流量分别对应不同的QoS ID,即1、2、3、4、5、6、7、8、9、10、11,而上述不同的QoS ID之间其对应的QoS优先级则可以是相同也可以不同,可以取决于这些流量之间有没有内存访问竞争。例如,在上述表2中,控制流量对应的QoS ID为1,其对应的QoS优先级为1;层内模型并行特征图通讯数据流对应的QoS ID为2,其对应的QoS优先级为2;数据并行参数预取流量对应的QoS ID为3、特征图共享流量对应的QoS ID为4、特征图预取流量对应的QoS ID为5、Embedding读写流量对应的QoS ID为6,但上述这些对应的QoS ID不同的业务流其对应的QoS优先级均为3等(在上述表2中,QoS优先级的数值越小,级别越高),因此可以看出在本申请实施例中不同的业务流之间其对应的QoS ID不同,但是不同的QoS ID之间可以对应相同的QoS优先级,也可以对应不同的QoS优先级,取决于这些不同的业务流之间是否有内存访问竞争,如果有竞争则不同的业务流之间可以对应不同的QoS优先级,以为不同业务流下的计算任务提供不同需求的QoS服务避免过多竞争,如果没有竞争,则不同的业务流之间也可以对应相同的QoS优先级。
以下示例性说明本申请中是如何通过对AI网络模型进行分析从而得到对应的业务流分类的。例如,请参见图2C,图2C是本申请实施例提供的一种resnet50网络在运行过程中所涉及的业务流类型与访存带宽之间的关系示意图,如图2C所示是resnet50网络在运行过程中的各种业务流,包括前向计算FP、后向计算BP、以及梯度聚合参数,参数更新和应用业务流。从图2C中可以看出,梯度聚合流量的stage1阶段1671GB与后向计算的流量(672GB)并发叠加后,超过了HBM能够提供的总带宽2000GB(2TGB),这种情况下,为了保证整个系统的最佳性能,需要分别为梯度聚合和后向计算流量分配不同的带宽和访存QoS优先级,例如在该图2C中所示的示例,本申请实施例考虑提供梯度聚合的内存访问带宽优先级,以 减少AI计算过程的拖尾时间,保证每一轮迭代的时延是最短。相反,如果不对这些业务流量的内存访问行为和优先级进行有效的管理和控制,各种业务流对宝贵的内存带宽的竞争难以避免,造成的最终性能抖动难以控制和优化。
本申请实施例能够针对AI网络模型的业务流进行分类识别,然后对不同的业务流类型给出不同的业务流标签(即QoS ID),在此基础上,基于AI系统中的AI SoC的硬件能力,根据用户的实际的AI网络的业务流量模型,给客户提供一种简单的技术手段,为不同的业务流量调整出一套符合客户实际AI网络模型的QoS配置参数,以保证用户在AI平台上获得最佳的模型训练性能,帮助客户释放出硬件最大的计算算力。
基于本申请中上述AI系统的硬件架构以及其中各个部件的功能,本申请实施例还提供一种运行于上述硬件架构中的软件栈的框架(即Davinci软件栈),可用于具体实现本申请中的AI系统的相应的功能。可以理解的是,该Davinci软件栈只是用于实现本申请中AI系统的其中一种可能的AI平台或软件架构(软件栈),并不用于限定本申请中任意一个实施例中的AI系统所适用的AI平台或软件栈。下面,本申请实施例对该软件栈框架中的各个软件模块、以及各个软件模块的功能或者其中涉及的软件流程进行示例性描述。
请参见图3A,图3A为本申请实施例提供的一种Davinci软件栈的框架示意图,在该图3A中所示的Davinci软件栈的框架中,其中,主要将该Davinci软件栈的框架分为HOST侧(即对应本申请中的主机10侧)以及DEVICE侧(即对应本申请中的AI SoC 20侧),并分别从HOST侧和DEVICE侧内部所涉及的软件模块及其相应的功能进行描述。与本申请中所述AI系统中的QoS访存控制功能相关的软件模块,具体可包括如下:
(一)HOST侧
1、图生成引擎_模型部署子系统_TensorFlow适配层(GE_MDS_TFD)
这些图生成引擎(GE)模块,可以基于模型graph上下文信息,训练脚本中的标签,以及每个子图使用的算子类型,可以识别训练过程中诸如前向计算,后向计算,集合通讯等不同的业务流,并且可以将不同的业务流打上框架事先预定义的标签,然后再传给GE/HCCL。
2、图生成引擎/图融合引擎/华为集合通信库(GE/FE/HCCL)
GE/FE/HCCL根据图切换和模型部署子系统(Model Distribution Subsystem,MDS)传递过来的业务流标签,调用QoS Manager模块提供的API接口,获取该业务流标签对应的QoS ID和QoS,之后将这些信息传递给RUNTIME去组装任务(task)的任务队列描述符(Send Queue Element,SQE),最后由RUNTIME将这些task的SQE下发到运行时刻的任务队列(RunTime Send Queue,RTSQ)中。
3、服务质量管理库(libQoSManager)
libQoSManager作为一个动态库,基于全局QoS规划表,为GE/HCCL/DVPP等业务的不同数据流提供QoS配置项的QoS资源的申请接口;GE/HCCL提供业务流标签,调用该接口获取全局规划表中该标签对应的QoS ID和QoS信息。
4、服务质量自适应工具(QoSAutoAdjustTools)
QoSAutoAdjustTools是一个运行在Host侧的命令行工具,主要功能包括:
(1)查询属于该Host server管理的所有的Device的QoS配置信息,也可以查询指定Device的QoS配置信息,这些配置信息与MATA的QoS寄存器中保持一致。以列表的方式显示每个paritd对应的QoS优先级,带宽高低水线,硬件的限制(hardlimit)是否使能,QoS ID对应的业务流名称,业务流的标签等信息。这些信息可以显示在命令行上,也可以保存到 指定的文件中。
(2)查询指定设备,或者所有设备的所有QoS ID的实际统计流量,需要启动一个线程定期下发命令到设备侧的QoS Monitor驱动中获取实时数据。获取数据后,可以保存到指定文件或显示到命令行上。
(3)支持对指定QoS ID进行QoS调整和配置,这个功能主要用调试阶段,能够帮助用户快速优化QoS配置,尝试不同的QoS方案带来的收益。获取QoS调整经验后,自动QoS调整算法也在该工具中实现。
(4)支持一次性将QoS全局规划表中的所有QoS ID对应的带宽,QoS优先级下发到设备侧的QoS驱动,由驱动配置到MATA寄存器中。
(5)基于GA/RL自动寻优算法,可自适应各种NN训练网络以及用户购买的设备的实际HBM能够提高的带宽,自动寻找最优的各个业务流的QoS优先级和带宽配置。
该工具的这些功能实现,都依赖Host侧的设备状态管理接口(Device State Manage Interfaace,DSMI)提供的QoS新增API接口,通过该接口以及Device的DSMI驱动框架的转发,最后由DEVICE侧的QoS驱动实现对应的功能。
5、服务质量设备状态管理接口(QoS_DSMI)
DSMI是D系列芯片通用的设备管理应用程序接口(Application Programming Interface,API)接口,底层通过主机与设备间的通信通道(Host Device Communication,HDC)与Device侧的DSMI驱动框架通讯,DSMI驱动框架调用device侧的QoS驱动事先注册的回调函数实现相关的配置下发和状态查询功能。利用该机制,可以大大简化对Device侧QoS相关功能的管理。由于QoS相关的配置下发和状态查询都是新的功能,需要在DSMI模块增加新的接口和功能支持。
(二)DEVICE侧
1、设备状态管理接口驱动框架(DSMI_DRV_FRAMWORK)
DSMI驱动框架是一套通用的运行在内核态的驱动模块,该模块提供了易于扩展的DSMI命令转发框架的实现,提供了内核态的命令注册接口,便于新增的驱动模块实现DSMI接口,QoS驱动只需将QoS需要的新增命令的处理函数注册到该框架,框架在收到HOST侧QoS相关的配置和查询命令后,就会自动将这些命令转发给QoS模块事先注册的回调处理函数。
2、服务质量驱动(QoS Driver)模块
QoS Driver模块部署在Device侧的内核态QoS Driver模块,该模块与Host侧的QoS Host Driver通过DSMI的内核态接口建立通讯通道,完成主要QoS管理功能,主要由以下四个子模块组成:
(1)设备状态管理接口钩子(DSMI_HOOK)模块
DSMI_HOOK模块负责实现host侧的各种QoS配置和查询命令,并且将这些命令实现接口注册到DSMI驱动框架中。当DSMI驱动框架收到QoS相关的命令后,自动调用QoS事先注册的这些回调函数,完成QoS相关的命令的查询和配置。
需要特别注意的地方是,对于给定QoS ID配置带宽的处理命令,如果是虚拟化场景,需要返回命令不支持。因为虚拟化场景的带宽配置是由QoS驱动根据每个虚拟功能单元(Vitual Function,VF)的算力自动计算出来进行配置的,而不是由用户通过DSMI接口配置的。
对于带宽查询命令,在虚拟化场景下,需要将DSMI命令中携带过来的虚拟功能单元编 号(Vitual Function ID,VFID)转化为该VFID对应的QoS ID,然后到QoS驱动维护的monitor中查找该QoS ID对应的带宽。因为虚拟化场景下,物理QoS ID对于最终虚拟机用户是不可见的。
(2)输入输出控制(IOCTL)模块
该IOCTL模块负责将QoS驱动封装成一个字符平台设备驱动,并且提供IOCTL接口给虚拟化场景下的DEVICE侧的该虚拟机用户态的进程调用。例如需要配置该进程的QoS ID。
(3)服务质量配置(QoS Config)模块
该QoS Config模块主要负责直接将Host侧申请的每个QoS ID对应的QoS配置到MATA的寄存器。还负责处理Host侧的QoS tools工具的QoS查询命令,将该Device上的所有QoS ID的QoS配置通过DSMI接口返回到Host的QoS tools工具展示给用户。
(4)服务质量监视器(QoS Monitor)
该模块主要实现两个功能:
①采集下发到Device侧的各个QoS ID的实际HBM的带宽流量;
②对采集到的带宽流量数据进行统计计算,并提供查询接口给Device自适应调整算法和Host侧的QoS tools工具,方便其定期查询这些带宽统计。
基于上述Davinci软件栈的框架中的HOST侧以及DEVICE侧的各个软件模块及其功能的介绍,下面对本申请中的AI系统中的QoS访存控制功能相关的软件模块,所涉及交互流程进行描述,请参见图3B,图3B为本申请实施例提供的一种Davinci软件栈中各个软件模块之间的交互流程示意图,具体可包括如下流程:
(1)QoS驱动模块加载后,会创建一个字符设备给用户态的设备管理协议(Device Manage Protocol,DMP)程序或者虚拟机进程提供一个IOCTL接口,以便控制QoS驱动执行QoS配置和查询命令;注意在虚拟化场景下,需要能够支持多个进程同时打开QoS设备驱动,并且可能存在并发的IOCTL命令;
(2)内核态DSMI驱动框架执行内部的初始化工作;
(3)QoS驱动模块向DSMI驱动框架注册QoS相关的命令处理钩子函数;
(4)libQoSManger模块提供了一个初始化接口,NPUTOOL工具调用该接口执行QoS配置初始化;
(5)libQoSManger模块在该初始化函数中,解析QoS全局配置表中的各个QoS ID的配置值,加载到内存;
(6)之后将每个QoS ID的QoS配置,打包成一个命令消息,通过DSMI接口下发到设备侧;
(7)DSMI模块实现一个QoS配置消息的传递接口,底层基于HDC通讯实现,经过HDC后将该消息传输到位于DEVICE侧的DSMI驱动框架;
(8)DEVICE侧的DSMI驱动框架解析出该消息,识别出这是一个QoS相关的命令,则调用QoS设备驱动提供的回调钩子函数;
(9)QoSHook模块解读配置消息中的QoS ID,逐个调用MataConfig的接口配置每个QOS ID的带宽和QoS值到MATA的硬件寄存器中;
(10)并将每个QoS ID加入到MATA的Monitor中,以便于可以统计每个QoS ID的实际HBM带宽;
(11)在libQoSManager的初始化函数中,还需要调用DSMI接口中实现对RoCE引擎 的QoS ID和QoS的配置工作,RoCE支持一个PF配置一个QoS ID和QoS,1981的RoCE最多支持两个PF。
(12)在libQoSManager的初始化函数中,还需要调用DSMI接口中实现对PCIE各通道的QoS ID和QoS的配置工作,PCIE支持一个48独立的通道,每个通道可以配置一个QOS ID和QoS,1981的数据面和管理面使用的PCIE通道的QoS ID和QoS需要分别配置,数据面的多个PCIE通道使用相同的QoS ID和QoS配置。
(13)GE中的MDS模块负责不同业务流的识别工作,将识别后的业务流标签(即QoS ID),以及该业务流将在哪个DEVICE和哪个DIE上运行的信息,在加载图的过程中传递个GE执行框架。
(14)GE执行框架根据MDS传递的业务流标签和DEVICE ID/DIE ID信息,为不同的业务流(集合通信,AI计算,DVPP等)分别申请不同的QoS ID,并且保存在其内部的上下文中;
(15)GE通过Runtime提供的接口加载算子kernel,并且在任务中携带这些QoS ID和QoS信息,Runtime收到这些task后,基于算子类型、QoS ID和QoS优先级值,硬件的SQE格式和其他任务信息,调用相关算子的接口,构造出SQE。
(16)构造出SQE后,再将task下发到RTSQ中,然后下发doorbell,触发stars执行任务调度;
(17)当需要充值QoS相关的配置时,需要libQoSManager提供的UninitQoSLib接口,例如由NPU TOOL工具调用;
(18)libQoSManger在UninitQoSLib接口中,解析QoS全局配置表中的每个目的DEVICE中的标准内存资源分区及监视(Memory System Resource Partitioning and Monitoring,MPAM)配置,并且打包成清除QoS配置的消息到DSMI接口中;
(19)DSMI借助其通用机制将该消息传递给DEVICE侧的QoSDriver中的IOCTL子模块;
(20)IOCTL子模块解析该消息,将每个QoS ID从Monitor中删除,并且清除内存中的对该QoS ID的统计数据;
(21)IOCTL子模块解析该消息,将每个QoS ID从MATA的配置寄存器中清除。
(22)和(23)对于RoCE和PCIE通道,也需要提供类似的DSMI接口实现,不同之处是最终在RoCE和PCIE的驱动中,需要清除的是给这些master的QoS ID和QoS的配置信息。而不是在MATA之处的分配给这些master对应的QoS ID的带宽水线和QoS优先级配置。
基于本申请中上述提供的业务流的分类及对应的QoS匹配方法,本申请还提供基于上述业务流的分类方法对业务流类型进行识别的方法,可以应对层出不穷的各种AI网络模型,以及业界不同的框架平台(如TensorFlow框架,Pytorch框架,MindSpore框架等)。准确可靠的识别AI网络模型中的各种业务流类型,是后续阶段为各种业务流配置合适的QoS参数的基础(即QoS ID)。下面,本申请实施例以TensorFlow框架为例,并且结合上述Davinci计算平台软件架构说明AI网络模型的各种业务流类型的识别方法。
本申请实施例中的业务流类型识别,是指在AI网络模型在图编译/图优化阶段对图上各计算或通讯节点上,在图执行阶段将要发出的数据访问类型进行标志和分类。本申请实施例采用了编译阶段和运行阶段分离标志的方法,如图4A所示,图4A是本申请实施例提供的一 种图编译阶段和图运行阶段的示意图,在AI网络的图生成、图优化和编译阶段,采用抽象的业务流类型标签给不同业务流进行分类和打标签,在图加载到设备上执行之前,通过查表的方式再将抽象的业务流标签转化为物理硬件可识别的QoS ID。例如,图4A中在图编译阶段,该任务是由图节点A、图节点B、图节点C、图节点D、图节点E构成的,且各个图节点上在图编译阶段可以携带对应的抽象的业务流类型标签(即QoS Lable),QoS value可以为对应的业务流的QoS优先级;而在图运行阶段,则可以经过查表之后,将各个图节点中携带的业务流类型标签(即QoS Lable)替换为AI系统可识别的QoS ID。
通过上述方法,可便于达成模型编译与运行分离的目标,因为模型的编译往往是在用户通用CPU系统上完成的(例如本申请中的主机Host),而模型执行则在专用AI计算SoC上完成(例如本申请中的AI SoC 20)。另外,通过这种抽象标签与物理QoS ID分离架构实现,可以对用户呈现易于理解的业务流类别,而不用感知物理实现相关的硬件QoS ID,平台框架也不用暴露底层的物理实现给普通用户,因此可以同时提高易用性,提升系统安全,便于平台框架后续演进变更而用户无需感知。本申请实施例的业务流标识方法上,需要在模型搭建完成后,再利用GE模块将模型转换为通用AI图节点,再将AI图解编译阶段的多个模型一起配合完成,涉及到的模块主要可包括TensorFlow框架,Davinci平台内部的图生成引擎GE,图融合引擎(FE),图切换和模型部署子系统(Model Distribution Subsystem,MDS)。
其中,TensorFlow框架在使用python脚本语言构建AI网络模型时,提供了一种叫做用户自定义scope属性的功能,用户在编写脚本时,可以通过该功能,为指定范围内的计算过程中的所有图节点自动添加上该属性值。例如,如图4B所示,图4B是本申请实施例提供的一种resnet50的AI模型的构建脚本图,该脚本中的106行的计算过程使用了105指定的QoSLable为1,而119行到124行使用的QoS Lable则是由118行指定的QoS Lable为2。这些QoS Lable值是预先定义好的枚举值,例如,在Davinic平台上,各种不同的业务流类型的枚举值的定义如下:
QoSServiceLableType对应取值由QoSManager定义:
Figure PCTCN2022078504-appb-000002
在图执行阶段,Davinci平台框架会调用内部的查表函数将该枚举值转换为硬件能够识别和处理的QoS ID,之后所有的内存访问请求中,都会携带上该QoS ID,以便最底层的内存访问控制器(即本申请中的内存控制器)根据软件配置的QoS策略来以及其他统计信息来决定每次内存访问请求的最终执行策略。
需要说明的是,在脚本语言中使用scope标签方式为不同的业务流进行QoS类型标注, 只适应于TensorFlow这种框架,对其他框架则不一定适用。为解决该问题,从而提供更加通用的方式,Davinci平台通过提供Python高级API方式,为用户提供用户指定范围和指定计算过程中的QoS标注能力。另外,在Davinci平台框架中,无论是哪种AI计算框架,在模型搭建完成后,需要经过中间解析器解析成前端表达式,然后经过图编译和优化成为最终可以在Davinci平台上可执行的一个具体的计算任务。如图4C所示,图4C是本申请实施例提供的一种经过图编译和优化后的可执行计算任务的示意图,当通过深度学习开发框架如TensorFlow/Pytorch/ME,构建AI网络模型之后(对应图4C中的①),再利用AI平台中的图生成引擎如GE模块,将构建好的AI网络模型转换为通用AI图节点(对应图4C中的④),而该转换过程可能需要经过计算引擎插件(Compute Engine Plugins)的优化处理(对应图4C中的②或③),如神经网络计算架构CANN、华为集合通信库HCCL、预处理模块Process等,最终将AI网络模型解析为可以在AI系统中可执行的计算任务以执行(Runtim)。在将原始的前端表达式编译转换成跟Davinci平台硬件相关的计算算子过程中,Daivinci平台中的内部的模块会对图中的节点进行合并,拆分或者新增通讯节点以及其他操作,如Cache刷新操作等。这些由Davinci平台在后台执行的操作而产生的节点,平台知道各种操作的类型,平台对于新增节点会继承新增节点的父亲节点的QoS标签,对于特定的操作,如Cache操作,模型跨片跨DIE部署而自动产生的模型并行数据通信业务流,平台也会自动插入预定义的QoS标签。
对于特定的硬件加速器,例如ISP,DVPP,ASP,GPU等,通过直接配置硬件寄存器的方式,由这些加速器在发出内存访问时直接携带QoS ID。对于AI计算SOC上的通用CPU发出的内存访问流量,则可采用ARM CPU提供标准内存资源分区及监视(Memory System Resource Partitioning and Monitoring,MPAM)机制,使用linux操作系统提供标准API函数接口配置CPU以及其进程的QoS ID和QoS信息。
在本申请中,由于AI SoC 20中子系统中的Master接收到待执行的计算任务时,该计算任务中已经携带了对应的QoS ID,并且已经配置好与该QoS ID对应的初始优先级(即第二QoS优先级),而本申请实施例中的AI系统还可以进一步包括:针对各个不同的QoS ID的QoS优先级所匹配的访存参数进行不断的更新和优化的功能。下面结合本申请提供的一些实施例,具体说明AI系统01或02是如何在整个系统运行的过程中,为各个QoS ID更新和优化访存参数的。
在一种可能的实现方式中,所述主机或者所述目标Master,还用于:预先为所述计算任务中的所述QoS ID配置对应的第二QoS优先级,所述第二QoS优先级为所述QoS ID对应的初始优先级。例如,各个QoS ID所匹配的初始的QoS优先级可以是由主机(Host)配置的,也可以是由目标Master中的寄存器配置的。
本申请实施例中,主机侧或者在目标Master内部还为每个计算任务配置初始的或源头的QoS优先级(即第二QoS优先级),也即是为各个QoS ID配置匹配的QoS优先级。从而使得后续AI SoC中的相关模块可以基于此初始的QoS优先级进行后续的随路QoS优先级或者是最终QoS优先级的调整。
在一种可能的实现方式中,所述主机,还用于:根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述第二QoS优先级。也即是,与QoS ID对应的初始优先级都可以不断地进行调优寻优。可选的,根据所述AI系统的访存性能实时监控信息,并通过寻优算法和自适应的机器学习算法,更新优化各个QoS ID对应的所述第二QoS优先级。
本申请实施例中,AI系统中的主机侧还可以根据访存性能的实时监控信息,对系统中的各个QoS ID对应的初始QoS优先级进行更新和优化。例如,通过寻优算法和自适应的机器学习算法自适应的进行QoS自动寻优。
上述实施例所提供的Davinci的AI计算平台,主要可应用于AI的训练场景和推理场景,并通过该平台提供内存访问QoS控制能力。该QoS控制功能在实际应用时主要面对的困难有两点,一是如何对业务流的进行准确的识别和分类,该问题请参见上述关于业务流类型的分类和识别的相关实施例。另一个难点是如何高效地取得一套适应实际应用需要的最优QoS配置参数。能够让普通用户在面对不同规模的训练网络、个性化的机器算力配置和内存带宽配置、AI Server集群规模不一致情况下,为各种业务流、各硬件加速器,提供一套通用的解决办法。为了解决该技术问题,本发明提出了一种基于机器学习算法的QoS自动寻优算法。该算法的软件架构请参见图5A,图5A为本申请实施例提供的一种QoS自动寻优的软件架构示意图,如图5A中所示,该软件架构主要包括系统性能实时监控模块501、系统工作环境及配套参数输入模块502、QoS寻优算法模块503、QoS配置接口及驱动504、寻优算法终止指示模块505、寻优算法输出模块506。进一步地,请参见图5B,图5B为本申请实施例提供的一种QoS自动寻优的方法流程示意图,结合图5A中的相关模块,对图5B中的方法流程进行描述,如图5A以及图5B中所示。
1、系统性能实时监控模块501,该模块可以实时采集AI网络在设备上运行过程中的各项关键性能数据,该关键性能数据具体可包括:
(1)AI网络在训练过程中每一轮迭代的时间长短;
(2)关键硬件资源,例如AI CORE的利用率;
(3)训练过程中通讯数据流时延;
(4)内存访问控制器的实时带宽。
2、系统工作环境及配套参数输入模块502,该模块通过配置文件方式,可以为算法提供当前工作环境参数,该当前工作环境参数具体可包括:
(1)硬件平台当前的最大理论带宽,可以达成的实际利用率;
(2)训练网络的业务流类型对应的所有QoS ID、缺省带宽高低水线和QoS值;
(3)CPU,DVPP,ROCE,PCIE,ISP,GPU等硬件加速器的QoS ID,缺省带宽高低水线和QoS值。
3、QoS寻优算法模块503,用于基于贝叶斯机器学习算法,经过多轮迭代探测,且采集系统对QoS调优参数修正后的性能反馈数据,最终输出一套给定工作环境下的最优QoS工作参数,并记录到文件中,以作为系统后续正式运行时的QoS配置参数。
4、QoS配置接口及驱动504:用于提供用户态API接口给QoS寻优算法,以支撑算法实时在线调整各个QoS ID对应的在内存访问控制器上配置的带宽高低水线和QoS优先级。
5、寻优算法终止指示模块505,用于在以下情况下进行寻优算法终止指示,具体情况可包括如下:
(1)N轮的迭代时间抖动到达设计要求:server内的抖动不能超过0.1ms,server间的抖动不超过0.5ms;
(2)关键计算资源的利用率达到预期指标:例如95%;
(3)系统的吞吐能力达到设定的指标:例如resnet50到达9200fps。
6、寻优算法输出模块506:用于在经过持续迭代达到最优QoS配置后,将这套配置输出 到QoS全局规划配置文件中,例如按照下表格式输出到配置文件中。
例如,结合上述图5A中所提供的QoS自动寻优的软件架构,下面结合图5B中的方法流程,对上述软件架构中的各个软件模块之间的交互流程进行描述。请参见图5B中提供的一种QoS自动寻优的方法流程示意图,在图5B中所示的交互流程中至少可以包括如下步骤S501~步骤S508。其中,步骤S501:QoS寻优算法模块503从系统工作环境及配套参数输入模块502中读取环境参数,并执行算法初始化;步骤S502:通过QoS配置接口及驱动504向硬件(如内存控制器)下发各个QoS ID的Qos参数(如QoS ID对应的初始QoS优先级也即是第二QoS优先级);步骤S503:系统性能实时监控模块501根据性能采集间隔、配置参数等采集系统性能数据;步骤S504:QoS寻优算法模块503对采集到的性能数据进行噪声过滤处理,(如高斯过滤,中位数滤波);步骤S505:判断系统性能指标是否达到停止寻优条件(如多伦迭代时长均方差最小或进入设定门限);步骤S506:如果达到了停止寻优的条件,则寻优算法输出模块506将得到的最优Qos参数保存到结果文件中,且寻优算法终止指示模块505进行寻优算法终止指示;步骤S507:如果没有达到了停止寻优的条件,则排除不合实际的问题域的解(如Qos取值范围0~7,每个QoS ID的带宽水线不可能超过总带宽);步骤S508:QoS寻优算法模块503采用贝叶斯预测算法,算出下一组候选Qos配置参数,再重新通过QoS配置接口及驱动504向硬件(如内存控制器)重新下发各个更新后的QoS ID的相关的Qos参数,也即是将步骤S502~步骤S503之间的步骤重新迭代执行,且在此不断优重复迭代的过程中不断的优化本申请实施例中的相关QoS参数,可以理解的是,上述QoS自动寻优的原理和流程同样适用于本申请实施例中的MATA更新优化各个QoS ID对应的访存策略控制参数的过程,此处不再赘述。
在本申请中,当待执行的某个计算任务由AI系统01中的主机10侧调度至AI SoC 20中时,则涉及到在AI SoC 20内部,各个待执行的计算任务是如何被分配到适合执行该计算任务的目标Master上的,因此,本申请实施例中的AI系统还可以进一步包括:调度分配各个计算任务的功能。下面结合本申请提供的一些实施例,具体说明在AI系统01或02中是如何将各个计算任务调度至合适的子系统上的合适的Master的。
在一种可能的实现方式中,所述系统调度器,用于:接收所述主机发送的所述一个或多个待执行的计算任务;其中,每个所述待执行的计算任务中还携带有用于描述计算任务类型的任务描述符;根据各个待执行的计算任务中携带的所述任务描述符,从所述M个子系统中为各个待执行的计算任务选择匹配的子系统,以及从所述匹配的子系统中的一个或多个Master中选择匹配的Master;将各个待执行的计算任务调度至所述匹配的子系统中所述匹配的Master上。
本申请实施例中,当AI系统中的主机在对业务流进行了识别以及携带QoS ID之后,将这些携带有QoS ID的计算任务发送至AI SoC上的系统调度器上后,该系统调度器可以对Host发送的所有计算任务进行合理分配,具体分配原则可以是按照计算任务中所携带的任务描述符进行分配,以根据该任务描述符所描述的任务的类型,为各个计算任务分配合适的子系统和Master,以更好地完成各个计算任务的执行或加速。例如,将某个AI矩阵计算任务分配至合适的AI子系统上、且分配至该AI子系统上的空闲的Master上。
在本申请中,上述图1C中的AI系统03还可以应用于虚拟化场景下虚拟机租户间的带宽隔离,也即是本申请实施例中的AI系统还可以进一步包括:对不同虚拟机租户之间的内存 带宽进行隔离、以及带宽承诺等功能。下面结合本申请提供的一些实施例,具体说明在AI系统03中是如何实现虚拟化场景下虚拟机租户间的带宽隔离以及访存控制功能的。
在一种可能的实现方式中,当所述AI系统应用于虚拟场景时,所述AI系统中包括多个虚拟机,其中,所述多个虚拟机中的每个虚拟机对应一个或多个进程,一个进程包括一个或多个计算任务;所述一个或多个进程运行在所述M个子系统中的至少一个子系统的一个或多个Master上;所述系统调度器,还用于:为每个所述虚拟机分配一个VM ID;其中,每个所述虚拟机对应的一个或多个进程的页表中都共享对应虚拟机的VM ID。
本申请实施例中,当AI系统应用于虚拟场景时,是通过以虚拟机为单位为每个虚拟机分配一个VM ID,并设置该虚拟机下的所有进程都对应该同一个VM ID,目的是为了将不同的虚拟机之间进行隔离,以保证不同虚拟机对应的用户之间的安全隔离和互不影响。
在一种可能的实现方式中,当本申请实施例中的AI系统为虚拟场景时,一个进程包括一个或多个计算任务;当所述系统为虚拟场景时,所述目标子系统还包括系统内存管理单元SMMU;所述目标Master,还用于:将所述计算任务的访存请求发送至所述SMMU,并通过所述SMMU更新所述计算任务的访存请求中携带的所述QoS ID;所述SMMU,用于:接收所述目标Master发送的所述计算任务的访存请求;根据所述访存请求中的虚拟地址和服务集标识SSID,确定所述计算任务所属的目标进程;根据所述目标进程的页表确定所述目标进程对应的目标虚拟机的VM ID,将所述计算任务的访存请求中携带的所述QoS ID替换为所述目标虚拟机的VM ID。具体地,每个虚拟机里有很多进程,每个虚拟机进程中又可以包括很多个计算任务,也即是一个进程包括了多个计算任务。可以通过先判断计算任务属于哪个进程(一般是32个),再看该进程对应哪个虚拟机,然后将计算任务中携带的QoS ID替换为该虚拟机的VM ID。需要说明的是,当AI系统应用于虚拟场景时,则在子系统内部将各个计算任务所携带的QoS ID进行了替换,当替换完成之后,从子系统内部到Soc总线、再从Soc总线到MATA、从MATA到内存控制器之间的所有流程都可以与非虚拟场景下保持一致,也即是当计算任务中携带的初始QoS ID被替换成该计算任务所属的虚拟机的VM ID时,则后续仍然可以利用非虚拟场景下的处理方式,比如通过子系统内部的子调度器将某个QoS ID对应的QoS优先级进行临时的调整,又比如通过SoC总线对子调度器临时调整过的QoS优先级进行恢复,还比如通过MATA对访存请求中携带的QoS ID对应的QoS优先级进行最终的确认优化等,具体可以参见上述图1A至图5B中相关实施例的相关描述,此处不再赘述。
本申请实施例中,当AI系统处于虚拟场景中时,就需要将初始的QoS ID的分配与流转的流程进行替换,而统一换成按照进程所属的虚拟机来分配QoS ID,即各个Master通过该Master中的SMMU对接收到的访存请求中携带的QoS ID进行替换,统一替换为该访存请求对应的计算任务其所属的进程对应的虚拟机的VM ID,其目的是为了在该场景下尽可能的以带宽安全隔离为首要目的,以满足虚拟机的用户之间可以进行数据隔离、算力资源隔离以及相互之间互不影响的基本需求。进一步的,也可以解决不同的虚拟机的用户之间的内存带宽隔离和带宽承诺问题。
在一种可能的实现方式中,所述AI SoC还包括L2 Cache高速缓存;所述L2高速缓存,用于:接收各个计算任务的访存请求,并根据各个计算任务的访存请求中携带的QoS ID对所述L2 Cache中对应的存储区域进行访问,其中,携带不同QoS ID的访存请求对应所述L2 Cache中的不同存储区域。
本申请实施例中,通过每个访存请求中携带的QoS ID,可控制各个访存请求所能访问的高速缓存的存储区域,也即是,通过访存请求中的QoS ID来对高速缓存中相应的存储区域 进行安全隔离。由于每个虚拟机下的进程其对应的虚拟机的ID即VM ID,因此可以将其所属的VM ID作为QoS ID携带在对应的访存请求中,从而可以基于此来进行高速缓存Cache的隔离以实现虚拟机场景下的安全隔离效果。
在上述虚拟化场景下,可以将Davinci AI平台的强大算力切分成多个独立计算功能单元,通常称为算力及资源独立的虚拟进程(VF),通过硬件支持虚拟化规范(Single Root I/O Virtualization,SR-IOV)以及成熟的软件虚拟化技术,如Qumeu+KVM,就可以通过云服务的方式给客户提供AI计算能力的虚拟机。不同的虚拟机用户之间,用户数据隔离,算力资源隔离,异常互不影响是基本需求。本申请实施例将内存访问的QoS控制技术引入虚拟化场景,可以有效解决不同虚拟机租户之间的内存带宽隔离和带宽承诺问题。
在虚拟机场景下,为防止恶意攻击,虚拟机上的应用在SQE里携带QoS ID信息是不被信任的,但是可以在SQE里面携带基于业务流的不同QoS信息,SQE里面携带的QoS ID信息到达设备(DEVICE)侧之后,会被虚拟机访问的SMMU里面配置的QoS ID即VM ID所替换掉。
请参见图6A,图6A为本申请实施例提供的一种虚拟场景下的AI系统的软件架构示意图,在该图6A中所示的软件栈的框架中,其中,主要将该软件栈的框架分为HOST侧(即本申请中的主机10侧)以及DEVICE侧(即本申请中的AI SoC20侧),并分别从HOST侧和DEVICE侧内部所涉及的软件模块及其相应的功能进行描述。与本申请中在虚拟应用场景下的所述AI系统的QoS访存控制功能相关的软件模块,具体可包括如下:
(一)HOST侧
1、图切换和模型部署子系统/图生成引擎/华为集合通信库/服务质量管理库(MDS/GE/HCCL/libQoSManager)
这个几个模块的功能与裸机场景下完全一致,唯一不同的地方是每个虚拟有自己的QoS全局配置表,各个虚拟机互不影响。
2、服务质量自适应工具(QoS Adjust Tools)
(1)该工具提供给虚拟机用户,用于支持用户自行调整不同业务流的QoS优先级,调整完之后,将调整结果保存到本虚拟机的QoS配置表中,由libQoSManager查询给表中的QoS返回给GE/HCCL使用。
(2)用户可以使用该工具查询本虚拟机用户的带宽使用情况。
(3)该工具不支持基于QoS ID的带宽和QoS配置下发功能。
(二)DEVICE侧
3、虚拟机进程(VM PROCESS)
每个虚拟机用户进程启动后,在DEVICE侧都会对应启动一个进程,该进程启动后,可以得到该进程绑定到的加速器上的SSID和所归属的虚拟机的VMID,基于该SSID和VMID,通过QoSDriver提供的IOCTL接口通知QoSDriver。
4、服务质量驱动(QoS Driver)
QoSDriver的主要功能与裸机场景基本一致,但是在虚拟化场景下,提供的IOCTL的命令的实现不同于裸机,主要差别在QoS ID的配置流程的实现,具体描述如下:
(1)当虚拟机进程启动时,为该虚拟机分配一个QoS ID,虚机多个进程共享同一个QoS ID;
(2)根据该虚拟机创建时的资源配置,调用设备管理驱动的接口得到该虚拟机的最大带 宽,调用QoSConfig的接口配置该QoS ID的带宽到MATA的寄存器中;
(3)QoSDriver还需将该虚机的QoS ID加入Monitor中去采集其实际带宽使用数据;
(4)虚拟机进程退出时,需要通过IOCTL接口调用QoSDriver执行反向的QoS资源释放和清理工作,如回收QoS ID,从Monitor中删除对该QoS ID的统计,清除MATA中对该QOS ID分配的带宽和优先级。
5、共享虚拟内存(Share Virtual Memory,SVM0)模块
SVM0可以负责统一集中管理虚拟机所用到的加速器设备驱动,该模块主要用于实现内核与用户态进程共享虚拟地址,是一个内核实现的模块,QoS驱动通过调用该模块提供的接口,该模块内部会遍历虚拟机所用到的全部加速器的设备驱动内核态设备结构体,并且调用SMMU DRV提供的QoS ID设置接口,为使用虚拟地址访问HBM内存的master在SMMU的CD表中配置QoS ID。
6、系统内存管理单元驱动(SMMU DRV)
SMMU DRV由内核提供的SMMU驱动,会提供配置SMMU的CD表项中的QoS ID功能。本方案中,每个虚拟机在每个SOC上只有一个唯一的QoS ID,不管虚拟机有多少个进程,所有的进程对应的SSID在SMMU的CD表中,都会被配置成一个相同的QoS ID。
7、高带宽内存驱动(HBM DRV)
HBM DRV可以为QoSDriver模块提供一个接口,通过该接口可以得到SoC当前的HBM有效理论总带宽,HBM驱动会需要考虑不同HBM容量配置情况下,不同HBM通道使能情况,不同工作主频情况,以及芯片筛片后获得的PG,FG等实际情况,综合计算出一个准确的当前理论带宽。
8、设备侧管理模块(Device Manage Module,Devmm)
该模块是设备侧设备管理驱动(Devmm),该模块可以给QoSDriver提供一个查询接口,可以查询给定VF的算力比例,一个VM可能有多个VF,QoSDriver需要将该VM的所有VF的算力的比例相加,以计算出该VM应该分配的HBM带宽比例,然后根据HBM返回的总带宽去配置该VM对应的QoS ID在MATA中的带宽水线。
9、资源控制(Resource Control,RESCTRL)模块
对于AI CPU直接发起的对HBM的Load/Store访问,需要借助Linux OS内核提供的基于MPAM机制的资源隔离功能,也就是RESCTRL功能。基于该功能接口,Linux OS可以为不同的进程或进程组配置QoS ID,之后OS的调度器在切换到该进程执行时,会根据该进程分配的QoS ID去配置CPU的MPAM寄存器,从而实现CPU上发出的HBM的读写请求,会携带每个进程事先分配好的QoS ID。RESCTRL模块可以为虚拟机在DEVICE的对应进程提供QoS ID设置接口。虚拟机在DEVICE侧的进程在调用该接口之前,需要事先调用QoSDriver模块提供的接口获取该进程的QoS ID。
请参见图6B,图6B为本申请实施例提供的一种在虚拟应用场景下的AI系统中各个软件模块之间的交互流程示意图,基于上述图6A中的软件架构,对上述软件架构中的各个模块在虚拟化场景中具体执行的流程进行描述,具体可包括如下流程:
(1)服务质量驱动(QoSDriver)驱动启动后,调用HBM驱动提供接口,获取SoC当前的总带宽大小。
(2)一个VM内部可能启动多个进程,HOST侧虚拟机每启动一个进程,在DEVICE侧也会启动一个对应虚机进程,该进程启动后,需要打开QoS驱动设备,并调用其IOCTL 命令,传递SSID和VMID,VFID给QoSDriver;以申请一个可用的QoS ID给该进程,注意属于同一个VM的所有进程共享一个QoS ID。
(3)QoSDriver中的IOCTL模块根据虚拟机的VM ID,VFID查找devmng驱动,得到该虚拟机的算力比例配置;如果该虚拟机有多个VF,需要将该虚拟机的所有的VF的算力比例相加,然后根据该虚拟机的总算力比例,以及从HBM获得的总带宽,计算出一个分配给该虚拟机的带宽。
(4)虚拟机进程还需要调用操作系统提供的RESCTRL接口,将从QoSDriver处得到QOS ID通过该接口设置到操作系统,以告知OS该进程的QoS ID;OS的调度器在调度到该任务执行时,会将该进程的QoS ID配置到CPU的MPMAM寄存器中,从而实现AI CPU进程对HMB的LOAD/STORE操作携带上正确的QoS ID。
(5)QoSDriver中为该虚拟机分配一个QoS ID。
(6)QoSDriver得到SSID,QoS ID后,需要调用SVM0的接口去查找出该虚拟机使用到的所有master设备的内核态驱动设备数据结构体struct device的实例。
(7)然后调用SMMU DRV提供的配置接口,为各个Master的SMMU配置其CD表项中的QoS ID。
(8)QoSDriver还需将该虚机的QoS ID加入Monitor中去采集其实际带宽使用数据。
(9)最后调用QoSConfig的接口配置该QoS ID的带宽到MATA的寄存器中。
(10)虚拟机进程退出时,需通过IOCTL接口调用QoSDriver执行反向QoS资源释放和清理工作。
(11)如回收QoS ID,从Monitor中删除对该QoS ID的统计,清除MATA中对该QoS ID分配的带宽和优先级。
(12)至于各个Master的SMMU对应的CD表项中的QoS ID配置,SMMU驱动自动回感知虚拟机进程的退出,并且自动完成相关清理释放工作。
请参见图7,图7是本申请实施例提供的一种内存访问控制方法的流程示意图,该内存访问控制方法,应用于人工智能AI系统,该AI系统包括AI片上系统SoC,该AI SoC包括M个子系统和N个内存控制器,所述M个子系统和所述N个内存控制器通过SoC总线互联;所述M个子系统包括目标子系统,所述目标子系统为所述M个子系统中的任意一个子系统,所述目标子系统包括S个Master,M、N、S均为大于或者等于1的整数;且该内存访问控制方法适用于上述图1A-图1C中的任意一种AI系统以及包含所述AI系统的设备(如手机、电脑、服务器等)。该方法可以包括以下步骤S701-步骤S702,其中,
步骤S701:通过所述S个处理节点中的目标处理节点,接收待执行的计算任务,所述计算任务中携带有服务质量标识QoS ID;生成所述计算任务的访存请求,所述访存请求中携带有所述QoS ID;将所述访存请求,发送至所述N个内存控制器中的目标内存控制器。
步骤S702:通过所述目标内存控制器,接收所述访存请求,确定与所述QoS ID对应的第一服务质量QoS优先级;基于所述第一QoS优先级,对所述访存请求进行访存QoS控制。
在一种可能的实现方式中,所述计算任务中还携带与所述QoS ID对应的第二QoS优先级,所述第二QoS优先级为所述计算任务中的所述QoS ID对应的初始QoS优先级。
在一种可能的实现方式中,所述目标子系统中还包括子调度器;所述通过目标Master将所述访存请求,发送至所述N个内存控制器中的所述目标内存控制器,包括:所述通过目标Master将所述访存请求发送至所述子调度器,并通过所述子调度器调度至所述N个内存控制 器中的目标内存控制器;所述方法还包括:通过所述子调度器,接收所述目标子系统中的所述S个Master分别发送的访存请求;根据所述S个Master分别发送的访存请求中携带的QoS ID所对应的第二QoS优先级,将所述S个Master分别发送的访存请求调度至所述SoC总线,所述第二QoS优先级为对应QoS ID的初始QoS优先级;其中,所述第二QoS优先级用于指示对应的所述访存请求被调度至所述SoC总线的优先级。
在一种可能的实现方式中,所述通过所述子调度器根据所述S个Master分别发送的访存请求中携带的QoS ID对应的第二QoS优先级,将所述S个Master分别发送的访存请求调度至所述SoC总线,包括:通过所述子调度器为所述S个Master分别建立任务队列,每个所述任务队列中包括对应Master发送的访存请求;其中,所述目标Master对应目标任务队列;当所述目标任务队列中当前插入目标访存请求时,将所述目标任务队列中所有访存请求中携带的QoS ID对应的所述第二QoS优先级分别提升至第三QoS优先级,所述目标访存请求为所携带的QoS ID对应的所述第二QoS优先级超过预设优先级的访存请求;根据所述S个Master的任务队列中的访存请求中携带的QoS ID对应的所述第二QoS优先级或所述第三QoS优先级,将所述S个Master的任务队列中的访存请求先后发送至所述SoC总线。
在一种可能的实现方式中,所述方法还包括:通过所述SoC总线,接收所述子调度器发送的所述目标任务队列中的一个或多个访存请求,所述一个或多个访存请求包括所述访存请求;将所述目标任务队列中的一个或多个访存请求中携带的QoS ID对应的所述第三QoS优先级恢复至对应的所述第二QoS优先级。
在一种可能的实现方式中,所述方法还包括:通过所述SoC总线,基于所述目标任务队列中的一个或多个访存请求恢复后的所述第二QoS优先级,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。
在一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述通过所述SoC总线,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上,包括:通过所述SoC总线,将所述目标任务队列中的一个或多个访存请求发送至所述MATA,并通过所述MATA将所述一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。可选的,在另一种可能的实现方式中,所述AI SoC还包括先进内存访问代理MATA:所述SoC总线,具体用于:将所述S个Master分别发送的访存请求发送至所述MATA,并通过所述MATA将所述S个Master分别发送的访存请求分别调度至所述N个内存控制器中对应的内存控制器上,所述S个Master分别发送的访存请求包括所述访存请求。
在一种可能的实现方式中,所述方法还包括:通过所述MATA,接收所述访存请求,确定所述访存请求中携带的所述QoS ID对应的所述第二QoS优先级;基于所述QoS ID对应的所述第二QoS优先级、结合与所述QoS ID对应的历史内存带宽统计信息、以及与所述QoS ID对应的访存策略控制参数,确定所述QoS ID对应的所述第一QoS优先级,所述访存策略控制参数包括允许访问请求通过的最高带宽、最低带宽和访问优先级中的一种或多种。
在一种可能的实现方式中,所述方法还包括:通过所述MATA,预先设置各个QoS ID对应的访存策略控制参数,统计并记录各个QoS ID对应的历史内存带宽;根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述访存策略控制参数。
在一种可能的实现方式中,所述方法还包括:通过所述MATA,将所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。可选的,在另一种可能的实现方式中,所述AI SoC还包括MATA;所述方法还包括: 通过所述MATA,将确定的所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。
在一种可能的实现方式中,所述通过所述目标内存控制器,基于所述第一QoS优先级,对所述访存请求进行访存QoS控制,包括:通过所述目标内存控制器,基于所述QoS ID对应的所述第一QoS优先级,并结合所述目标内存控制器的访存服务情况对所述访存请求进行访存QoS控制,所述访存服务情况包括内存访问时序要求、或内存带宽总线利用率。
在一种可能的实现方式中,所述方法还包括:当所述目标内存控制器接收到的访存请求的量大于预设阈值时,则通过所述目标内存控制器,向所述M个子系统广播反压指示,所述反压指示用于指示所述M个子系统中的一个或者多个子系统延迟、或者减少、或者停止发送访存请求。
在一种可能的实现方式中,所述AI系统还包括主机;所述方法还包括:通过所述主机,接收待执行任务,将所述待执行任务拆分为一个或多个待执行的计算任务;根据预设的业务流标签表,对拆分后的所述一个或多个待执行的计算任务的业务流类型进行识别,所述预设的业务流标签表中包括预先定义的计算任务所属的业务流类型与QoS ID之间的映射关系;根据识别结果,为所述一个或多个待执行的计算任务分别携带对应的QoS ID。
在一种可能的实现方式中,所述AI SoC还包括系统调度器;所述方法还包括:通过所述主机,将携带有对应的QoS ID的一个或多个计算任务发送至所述系统调度器。
在一种可能的实现方式中,所述方法还包括:通过所述主机或通过所述目标Maser预先为所述计算任务中的所述QoS ID配置对应的第二QoS优先级,所述第二QoS优先级为所述QoS ID对应的初始优先级。
在一种可能的实现方式中,所述方法还包括:通过所述主机根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述第二QoS优先级。
在一种可能的实现方式中,所述方法还包括:通过所述系统调度器,接收所述主机发送的所述一个或多个待执行的计算任务;其中,每个所述待执行的计算任务中还携带有用于描述计算任务类型的任务描述符;根据各个待执行的计算任务中携带的所述任务描述符,从所述M个子系统中为各个待执行的计算任务选择匹配的子系统,以及从所述匹配的子系统中的一个或多个Master中选择匹配的Master;将各个待执行的计算任务调度至所述匹配的子系统中所述匹配的Master上。
在一种可能的实现方式中,当所述AI系统应用于虚拟场景时,所述AI系统中包括多个虚拟机,其中,所述多个虚拟机中的每个虚拟机对应一个或多个进程,一个进程包括一个或多个计算任务;所述一个或多个进程运行在所述M个子系统中的至少一个子系统的一个或多个Master上;所述方法还包括:通过所述系统调度器,为每个所述虚拟机分配一个VM ID;其中,每个所述虚拟机对应的一个或多个进程的页表中都共享对应虚拟机的VM ID。
在一种可能的实现方式中,当所述系统为虚拟场景时,所述目标子系统还包括系统内存管理单元SMMU;所述方法还包括:通过所述目标Master,将所述计算任务的访存请求发送至所述SMMU,并通过所述SMMU更新所述计算任务的访存请求中携带的所述QoS ID;通过所述SMMU,接收所述目标Master发送的所述计算任务的访存请求;根据所述访存请求中的虚拟地址和服务集标识SSID,确定所述计算任务所属的目标进程;根据所述目标进程的页表确定所述目标进程对应的目标虚拟机的VM ID,将所述计算任务的访存请求中携带的所述QoS ID替换为所述目标虚拟机的VM ID。
在一种可能的实现方式中,所述AI SoC还包括L2 Cache高速缓存;所述方法还包括: 通过所述L2高速缓存,接收各个计算任务的访存请求,并根据各个计算任务的访存请求中携带的QoS ID对所述L2 Cache中对应的存储区域进行访问,其中,携带不同QoS ID的访存请求对应所述L2 Cache中的不同存储区域。
需要说明的是,本申请实施例中所描述的内存访问控制方法的具体流程,可参见上述图1A-图6B中所述的申请实施例中的相关描述,此处不再赘述。
本申请实施例还提供一种计算机可读存储介质,其中,该计算机可读存储介质可存储有程序,该程序被AI系统执行时包括上述方法实施例中记载的任意一种的部分或全部步骤。
本申请实施例还提供一种计算机程序,该计算机程序包括指令,当该计算机程序被AI系统执行时,使得所述AI系统可以执行任意一种内存访问控制方法的部分或全部步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(Read-Only Memory,缩写:ROM)或者随机存取存储器(Random Access Memory,缩写:RAM)等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (37)

  1. 一种人工智能AI系统,其特征在于,包括AI片上系统SoC,所述AI SoC包括M个子系统和N个内存控制器,所述M个子系统和所述N个内存控制器通过SoC总线互联;目标子系统包括S个处理节点,所述目标子系统为所述M个子系统中的任意一个子系统,M、N、S均为大于或者等于1的整数;其中,
    所述S个处理节点中的目标处理节点,用于:
    接收待执行的计算任务,所述计算任务中携带有服务质量标识QoS ID;所述目标处理节点为所述S个处理节点中的任意一个处理节点;所述QoS ID用于指示所述计算任务所属的类别;
    生成所述计算任务的访存请求,所述访存请求中携带有所述QoS ID;
    将所述访存请求,发送至所述N个内存控制器中的目标内存控制器;
    所述目标内存控制器,用于:
    接收所述访存请求,确定与所述QoS ID对应的第一服务质量QoS优先级;
    基于所述第一QoS优先级,对所述访存请求进行访存QoS控制。
  2. 如权利要求1所述的AI系统,其特征在于,所述计算任务中还携带与所述QoS ID对应的第二QoS优先级,所述第二QoS优先级为所述计算任务中的所述QoS ID对应的初始QoS优先级。
  3. 如权利要求1或2所述的AI系统,其特征在于,所述目标子系统中还包括子调度器;
    所述目标处理节点,具体用于:将所述访存请求发送至所述子调度器,并通过所述子调度器调度至所述N个内存控制器中的所述目标内存控制器;
    所述子调度器,用于:
    接收所述目标子系统中的所述S个处理节点分别发送的访存请求;
    根据所述S个处理节点分别发送的访存请求中携带的QoS ID所对应的第二QoS优先级,将所述S个处理节点分别发送的访存请求调度至所述SoC总线,所述第二QoS优先级为对应QoS ID的初始QoS优先级;其中,所述第二QoS优先级用于指示对应的所述访存请求被调度至所述SoC总线的优先级。
  4. 如权利要求3所述的AI系统,其特征在于,所述子调度器,具体用于:
    为所述S个处理节点分别建立任务队列,每个所述任务队列中包括对应处理节点发送的访存请求;其中,所述目标处理节点对应目标任务队列;
    当所述目标任务队列中当前插入目标访存请求时,将所述目标任务队列中所有访存请求中携带的QoS ID对应的所述第二QoS优先级分别提升至第三QoS优先级,所述目标访存请求为所携带的QoS ID对应的所述第二QoS优先级超过预设优先级的访存请求;
    根据所述S个处理节点的任务队列中的访存请求中携带的QoS ID对应的所述第二QoS优先级或所述第三QoS优先级,将所述S个处理节点的任务队列中的访存请求先后发送至所述SoC总线。
  5. 如权利要求4所述的AI系统,其特征在于,所述SoC总线,用于:
    接收所述子调度器发送的所述目标任务队列中的一个或多个访存请求,所述一个或多个访存请求包括所述访存请求;
    将所述目标任务队列中的一个或多个访存请求中携带的QoS ID对应的所述第三QoS优先级恢复至对应的所述第二QoS优先级。
  6. 如权利要求5所述的AI系统,其特征在于,所述SoC总线,还用于:
    基于所述目标任务队列中的一个或多个访存请求恢复后的所述第二QoS优先级,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。
  7. 如权利要求1-5任意一项所述的AI系统,其特征在于,所述AI SoC还包括先进内存访问代理MATA:所述SoC总线,用于:
    将所述S个处理节点分别发送的访存请求发送至所述MATA,并通过所述MATA将所述S个处理节点分别发送的访存请求分别调度至所述N个内存控制器中对应的内存控制器上,所述S个处理节点分别发送的访存请求包括所述访存请求。
  8. 如权利要求7所述的AI系统,其特征在于,所述MATA,用于:
    接收所述访存请求,确定所述访存请求中携带的所述QoS ID对应的所述第二QoS优先级;
    基于所述QoS ID对应的所述第二QoS优先级、结合与所述QoS ID对应的历史内存带宽统计信息、以及与所述QoS ID对应的访存策略控制参数,确定所述QoS ID对应的所述第一QoS优先级,所述访存策略控制参数包括允许访问请求通过的最高带宽、最低带宽和访问优先级中的一种或多种。
  9. 如权利要求8所述的AI系统,其特征在于,所述MATA,还用于:
    预先设置各个QoS ID对应的访存策略控制参数,统计并记录各个QoS ID对应的历史内存带宽;
    根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述访存策略控制参数。
  10. 如权利要求1-9任意一项所述的AI系统,其特征在于,所述AI SoC还包括MATA;所述MATA,用于:
    将确定的所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。
  11. 如权利要求10所述的AI系统,其特征在于,所述目标内存控制器,具体用于:
    基于所述QoS ID对应的所述第一QoS优先级,并结合所述目标内存控制器的访存服务情况对所述访存请求进行访存QoS控制,所述访存服务情况包括内存访问时序要求、或内存带宽总线利用率。
  12. 如权利要求1-11任意一项所述的AI系统,其特征在于,所述目标内存控制器,还用于:
    当所述目标内存控制器接收到的访存请求的量大于预设阈值时,则向所述M个子系统广播反压指示,所述反压指示用于指示所述M个子系统中的一个或者多个子系统延迟、或者减少、或者停止发送访存请求。
  13. 如权利要求1-12任意一项所述的AI系统,其特征在于,所述AI系统还包括主机;所述主机,用于:
    接收待执行任务,将所述待执行任务拆分为一个或多个待执行的计算任务;
    根据预设的业务流标签表,对拆分后的所述一个或多个待执行的计算任务的业务流类型进行识别,所述预设的业务流标签表中包括预先定义的计算任务所属的业务流类型与QoS ID之间的映射关系;
    根据识别结果,为所述一个或多个待执行的计算任务分别携带对应的QoS ID。
  14. 如权利要求13所述的AI系统,其特征在于,所述AI SoC还包括系统调度器;所述主机,还用于:
    将携带有对应的QoS ID的一个或多个计算任务发送至所述系统调度器。
  15. 如权利要求14所述的AI系统,其特征在于,所述主机或者所述目标处理节点,还用于:
    预先为所述计算任务中的所述QoS ID配置对应的第二QoS优先级,所述第二QoS优先级为所述QoS ID对应的初始优先级。
  16. 如权利要求15所述的AI系统,其特征在于,所述主机,还用于:
    根据所述AI系统的访存性能实时监控信息,更新优化各个QoS ID对应的所述第二QoS优先级。
  17. 如权利要求14-16任意一项所述的AI系统,其特征在于,所述系统调度器,用于:
    接收所述主机发送的所述一个或多个待执行的计算任务;其中,每个所述待执行的计算任务中还携带有用于描述计算任务类型的任务描述符;
    根据各个待执行的计算任务中携带的所述任务描述符,从所述M个子系统中为各个待执行的计算任务选择匹配的子系统,以及从所述匹配的子系统中的一个或多个处理节点中选择匹配的处理节点;
    将各个待执行的计算任务调度至所述匹配的子系统中所述匹配的处理节点上。
  18. 如权利要求1-17任意一项所述的AI系统,其特征在于,当所述AI系统应用于虚拟场景时,所述AI系统中包括多个虚拟机,其中,所述多个虚拟机中的每个虚拟机对应一个或多个进程,一个进程包括一个或多个计算任务;所述一个或多个进程运行在所述M个子系统中的至少一个子系统的一个或多个处理节点上;所述系统调度器,还用于:
    为每个所述虚拟机分配一个VM ID;其中,每个所述虚拟机对应的一个或多个进程的页表中都共享对应虚拟机的VM ID。
  19. 如权利要求1-18任意一项所述的AI系统,其特征在于,当所述系统为虚拟场景时,所述目标子系统还包括系统内存管理单元SMMU;
    所述目标处理节点,还用于:将所述计算任务的访存请求发送至所述SMMU,并通过所述SMMU更新所述计算任务的访存请求中携带的所述QoS ID;
    所述SMMU,用于:
    接收所述目标处理节点发送的所述计算任务的访存请求;
    根据所述访存请求中的虚拟地址和服务集标识SSID,确定所述计算任务所属的目标进程;
    根据所述目标进程的页表确定所述目标进程对应的目标虚拟机的VM ID,将所述计算任务的访存请求中携带的所述QoS ID替换为所述目标虚拟机的VM ID。
  20. 如权利要求18-19任意一项所述的AI系统,其特征在于,所述AI SoC还包括L2 Cache高速缓存;所述L2高速缓存,用于:
    接收各个计算任务的访存请求,并根据各个计算任务的访存请求中携带的QoS ID对所述L2 Cache中对应的存储区域进行访问,其中,携带不同QoS ID的访存请求对应所述L2 Cache中的不同存储区域。
  21. 一种内存访问控制方法,其特征在于,应用于人工智能AI系统,所述AI系统包括AI片上系统SoC,所述AI SoC包括M个子系统和N个内存控制器,所述M个子系统和所述N个内存控制器通过SoC总线互联;目标子系统包括S个处理节点,所述目标子系统为所述M个子系统中的任意一个子系统,M、N、S均为大于或者等于1的整数;所述方法包括:
    通过所述S个处理节点中的目标处理节点,接收待执行的计算任务,所述计算任务中携带有服务质量标识QoS ID;所述目标处理节点为所述S个处理节点中的任意一个处理节点;所述QoS ID用于指示所述计算任务所属的类别;生成所述计算任务的访存请求,所述访存请求中携带有所述QoS ID;将所述访存请求,发送至所述N个内存控制器中的目标内存控制器;
    通过所述目标内存控制器,接收所述访存请求,确定与所述QoS ID对应的第一服务质量QoS优先级;基于所述第一QoS优先级,对所述访存请求进行访存QoS控制。
  22. 如权利要求21所述的方法,其特征在于,所述计算任务中还携带与所述QoS ID对应的第二QoS优先级,所述第二QoS优先级为所述计算任务中的所述QoS ID对应的初始QoS优先级。
  23. 如权利要求21或22所述的方法,其特征在于,所述目标子系统中还包括子调度器;
    所述通过目标处理节点将所述访存请求,发送至所述N个内存控制器中的目标内存控制器,包括:所述通过目标处理节点将所述访存请求发送至所述子调度器,并通过所述子调度器调度至所述N个内存控制器中的所述目标内存控制器;
    所述方法还包括:通过所述子调度器,接收所述目标子系统中的所述S个处理节点分别发送的访存请求;根据所述S个处理节点分别发送的访存请求中携带的QoS ID所对应的第二QoS优先级,将所述S个处理节点分别发送的访存请求调度至所述SoC总线,所述第二QoS优先级为对应QoS ID的初始QoS优先级;其中,所述第二QoS优先级用于指示对应的所述 访存请求被调度至所述SoC总线的优先级。
  24. 如权利要求23所述的方法,其特征在于,所述通过所述子调度器根据所述S个处理节点分别发送的访存请求中携带的QoS ID对应的第二QoS优先级,将所述S个处理节点分别发送的访存请求调度至所述SoC总线,包括:
    通过所述子调度器为所述S个处理节点分别建立任务队列,每个所述任务队列中包括对应处理节点发送的访存请求;其中,所述目标处理节点对应目标任务队列;当所述目标任务队列中当前插入目标访存请求时,将所述目标任务队列中所有访存请求中携带的QoS ID对应的所述第二QoS优先级分别提升至第三QoS优先级,所述目标访存请求为所携带的QoS ID对应的所述第二QoS优先级超过预设优先级的访存请求;根据所述S个处理节点的任务队列中的访存请求中携带的QoS ID对应的所述第二QoS优先级或所述第三QoS优先级,将所述S个处理节点的任务队列中的访存请求先后发送至所述SoC总线。
  25. 如权利要求24所述的方法,其特征在于,所述方法还包括:
    通过所述SoC总线,接收所述子调度器发送的所述目标任务队列中的一个或多个访存请求,所述一个或多个访存请求包括所述访存请求;将所述目标任务队列中的一个或多个访存请求中携带的QoS ID对应的所述第三QoS优先级恢复至对应的所述第二QoS优先级。
  26. 如权利要求25所述的方法,其特征在于,所述方法还包括:
    通过所述SoC总线,基于所述目标任务队列中的一个或多个访存请求恢复后的所述第二QoS优先级,将所述目标任务队列中的一个或多个访存请求分别调度至所述N个内存控制器中对应的内存控制器上。
  27. 如权利要求21-25任意一项所述的方法,其特征在于,所述AI SoC还包括先进内存访问代理MATA:所述方法,还包括:
    通过所述SoC总线,将所述S个处理节点分别发送的访存请求发送至所述MATA,并通过所述MATA将所述S个处理节点分别发送的访存请求分别调度至所述N个内存控制器中对应的内存控制器上,所述S个处理节点分别发送的访存请求包括所述访存请求。
  28. 如权利要求27所述的方法,其特征在于,所述方法还包括:
    通过所述MATA,接收所述访存请求,确定所述访存请求中携带的所述QoS ID对应的所述第二QoS优先级;基于所述QoS ID对应的所述第二QoS优先级、结合与所述QoS ID对应的历史内存带宽统计信息、以及与所述QoS ID对应的访存策略控制参数,确定所述QoS ID对应的所述第一QoS优先级,所述访存策略控制参数包括允许访问请求通过的最高带宽、最低带宽和访问优先级中的一种或多种。
  29. 如权利要求21-28任意一项所述的方法,其特征在于,所述AI SoC还包括MATA:所述方法还包括:
    通过所述MATA,将确定的所述第一QoS优先级携带在所述访存请求中,并基于所述第一QoS优先级将所述访存请求调度至所述目标内存控制器。
  30. 如权利要求29所述的方法,其特征在于,所述通过所述目标内存控制器,基于所述第一QoS优先级,对所述访存请求进行访存QoS控制,包括:
    通过所述目标内存控制器,基于所述QoS ID对应的所述第一QoS优先级,并结合所述目标内存控制器的访存服务情况对所述访存请求进行访存QoS控制,所述访存服务情况包括内存访问时序要求、或内存带宽总线利用率。
  31. 如权利要求21-30任意一项所述的方法,其特征在于,所述AI系统还包括主机;所述方法还包括:
    通过所述主机,接收待执行任务,将所述待执行任务拆分为一个或多个待执行的计算任务;根据预设的业务流标签表,对拆分后的所述一个或多个待执行的计算任务的业务流类型进行识别,所述预设的业务流标签表中包括预先定义的计算任务所属的业务流类型与QoS ID之间的映射关系;根据识别结果,为所述一个或多个待执行的计算任务分别携带对应的QoS ID。
  32. 如权利要求31所述的方法,其特征在于,所述AI SoC还包括系统调度器;所述方法还包括:
    通过所述主机,将携带有对应的QoS ID的一个或多个计算任务发送至所述系统调度器;
    通过所述系统调度器,接收所述主机发送的所述一个或多个待执行的计算任务;其中,每个所述待执行的计算任务中还携带有用于描述计算任务类型的任务描述符;根据各个待执行的计算任务中携带的所述任务描述符,从所述M个子系统中为各个待执行的计算任务选择匹配的子系统,以及从所述匹配的子系统中的一个或多个处理节点中选择匹配的处理节点;将各个待执行的计算任务调度至所述匹配的子系统中所述匹配的处理节点上。
  33. 如权利要求21-32任意一项所述的方法,其特征在于,当所述AI系统应用于虚拟场景时,所述AI系统中包括多个虚拟机,其中,所述多个虚拟机中的每个虚拟机对应一个或多个进程,一个进程包括一个或多个计算任务;所述一个或多个进程运行在所述M个子系统中的至少一个子系统的一个或多个处理节点上;所述方法还包括:
    通过所述系统调度器,为每个所述虚拟机分配一个VM ID;其中,每个所述虚拟机对应的一个或多个进程的页表中都共享对应虚拟机的VM ID。
  34. 如权利要求21-33任意一项所述的方法,其特征在于,当所述系统为虚拟场景时,所述目标子系统还包括系统内存管理单元SMMU;所述方法还包括:
    通过所述目标处理节点,将所述计算任务的访存请求发送至所述SMMU,并通过所述SMMU更新所述计算任务的访存请求中携带的所述QoS ID;
    通过所述SMMU,接收所述目标处理节点发送的所述计算任务的访存请求;根据所述访存请求中的虚拟地址和服务集标识SSID,确定所述计算任务所属的目标进程;根据所述目标进程的页表确定所述目标进程对应的目标虚拟机的VM ID,将所述计算任务的访存请求中携带的所述QoS ID替换为所述目标虚拟机的VM ID。
  35. 如权利要求33-34任意一项所述的方法,其特征在于,所述AI SoC还包括L2 Cache高速缓存;所述方法还包括:
    通过所述L2高速缓存,接收各个计算任务的访存请求,并根据各个计算任务的访存请求中携带的QoS ID对所述L2 Cache中对应的存储区域进行访问,其中,携带不同QoS ID的访存请求对应所述L2 Cache中的不同存储区域。
  36. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,该计算机程序被多核处理器执行时实现上述权利要求21-35任意一项所述的方法。
  37. 一种计算机程序,其特征在于,所述计算机可读程序包括指令,当所述计算机程序被多核处理器执行时,使得所述多核处理器执行如上述权利要求21-35中任意一项所述的方法。
PCT/CN2022/078504 2022-02-28 2022-02-28 一种ai系统、内存访问控制方法及相关设备 WO2023159652A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/078504 WO2023159652A1 (zh) 2022-02-28 2022-02-28 一种ai系统、内存访问控制方法及相关设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/078504 WO2023159652A1 (zh) 2022-02-28 2022-02-28 一种ai系统、内存访问控制方法及相关设备

Publications (1)

Publication Number Publication Date
WO2023159652A1 true WO2023159652A1 (zh) 2023-08-31

Family

ID=87764530

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078504 WO2023159652A1 (zh) 2022-02-28 2022-02-28 一种ai系统、内存访问控制方法及相关设备

Country Status (1)

Country Link
WO (1) WO2023159652A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150216A (zh) * 2013-02-27 2013-06-12 东南大学 一种SoC集成的多端口DDR2/3调度器及调度方法
CN106330494A (zh) * 2015-06-23 2017-01-11 大唐半导体设计有限公司 一种SoC资源仲裁方法和装置
US20190245924A1 (en) * 2018-02-06 2019-08-08 Alibaba Group Holding Limited Three-stage cost-efficient disaggregation for high-performance computation, high-capacity storage with online expansion flexibility
CN111813536A (zh) * 2019-04-11 2020-10-23 华为技术有限公司 任务处理方法、装置、终端以及计算机可读存储介质
CN112236755A (zh) * 2018-09-30 2021-01-15 华为技术有限公司 一种内存访问方法及装置
CN113515473A (zh) * 2020-04-09 2021-10-19 珠海全志科技股份有限公司 一种QoS控制方法、总线系统、计算装置和存储介质
CN113568734A (zh) * 2020-04-29 2021-10-29 安徽寒武纪信息科技有限公司 基于多核处理器的虚拟化方法、系统、多核处理器和电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150216A (zh) * 2013-02-27 2013-06-12 东南大学 一种SoC集成的多端口DDR2/3调度器及调度方法
CN106330494A (zh) * 2015-06-23 2017-01-11 大唐半导体设计有限公司 一种SoC资源仲裁方法和装置
US20190245924A1 (en) * 2018-02-06 2019-08-08 Alibaba Group Holding Limited Three-stage cost-efficient disaggregation for high-performance computation, high-capacity storage with online expansion flexibility
CN112236755A (zh) * 2018-09-30 2021-01-15 华为技术有限公司 一种内存访问方法及装置
CN111813536A (zh) * 2019-04-11 2020-10-23 华为技术有限公司 任务处理方法、装置、终端以及计算机可读存储介质
CN113515473A (zh) * 2020-04-09 2021-10-19 珠海全志科技股份有限公司 一种QoS控制方法、总线系统、计算装置和存储介质
CN113568734A (zh) * 2020-04-29 2021-10-29 安徽寒武纪信息科技有限公司 基于多核处理器的虚拟化方法、系统、多核处理器和电子设备

Similar Documents

Publication Publication Date Title
US10325343B1 (en) Topology aware grouping and provisioning of GPU resources in GPU-as-a-Service platform
US12045652B2 (en) Technologies for batching requests in an edge infrastructure
EP3754498B1 (en) Architecture for offload of linked work assignments
US11669372B2 (en) Flexible allocation of compute resources
EP4398556A2 (en) Storage transactions with predictable latency
WO2022262167A1 (zh) 集群资源调度方法及装置、电子设备和存储介质
JP6200497B2 (ja) 仮想マシンのフローの物理的なキューへのオフロード
US8205208B2 (en) Scheduling grid jobs using dynamic grid scheduling policy
WO2016078178A1 (zh) 一种虚拟cpu调度方法
WO2023125493A1 (zh) 资源管理方法、装置及资源管理平台
US20160216991A1 (en) Allocation of virtual machines to physical machines through dominant resource assisted heuristics
JP2023511467A (ja) 機械学習ワークロードのためのタスクスケジューリング
CN110990154A (zh) 一种大数据应用优化方法、装置及存储介质
US20220086226A1 (en) Virtual device portability
US20200403909A1 (en) Interconnect address based qos regulation
JP2022068110A (ja) データ処理方法、データ処理装置及びデータ処理装置を含む電子装置
CN111459668A (zh) 用于服务器的轻量级资源虚拟化方法及轻量级资源虚拟化装置
KR102320324B1 (ko) 쿠버네티스 환경에서의 이종 하드웨어 가속기 활용 방법 및 이를 이용한 장치
WO2023159652A1 (zh) 一种ai系统、内存访问控制方法及相关设备
Huang et al. A quantum computing simulator scheme using MPI technology on cloud platform
EP4030284A1 (en) Virtual device portability
CN111143059A (zh) 改进的Kubernetes资源调度方法
CN117311910B (zh) 一种高性能虚拟密码机运行方法
US20230247486A1 (en) Dynamic resource reconfiguration based on workload semantics and behavior
KR20140056743A (ko) 가상 클라우드 환경 내 맵리듀스 클러스터 및 이를 위한 설계방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22927933

Country of ref document: EP

Kind code of ref document: A1