WO2022205224A1 - 一种同步方法及装置 - Google Patents

一种同步方法及装置 Download PDF

Info

Publication number
WO2022205224A1
WO2022205224A1 PCT/CN2021/084747 CN2021084747W WO2022205224A1 WO 2022205224 A1 WO2022205224 A1 WO 2022205224A1 CN 2021084747 W CN2021084747 W CN 2021084747W WO 2022205224 A1 WO2022205224 A1 WO 2022205224A1
Authority
WO
WIPO (PCT)
Prior art keywords
synchronization
processor
value
register
event
Prior art date
Application number
PCT/CN2021/084747
Other languages
English (en)
French (fr)
Inventor
朱湘毅
林灏勋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to KR1020237035925A priority Critical patent/KR20230157503A/ko
Priority to EP21933885.2A priority patent/EP4296906A4/en
Priority to CN202180001205.XA priority patent/CN113227975B/zh
Priority to PCT/CN2021/084747 priority patent/WO2022205224A1/zh
Publication of WO2022205224A1 publication Critical patent/WO2022205224A1/zh
Priority to US18/477,117 priority patent/US20240028423A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17325Synchronisation; Hardware support therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/522Barrier synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular, to a synchronization method and device.
  • AI Artificial intelligence
  • a single AI accelerator for example, a neural-network process unit (NPU)
  • a single AI server for example, including multiple The AI server of the AI accelerator
  • multiple AI servers are required to form a cluster to provide the computing power required for AI scenarios.
  • multiple AI servers form a cluster for AI training, in order to reduce the synchronous transmission and synchronization waiting time within an AI accelerator, between different AI accelerators within an AI server, and between AI servers, it is very important to provide a reasonable synchronization mechanism. Necessary.
  • Embodiments of the present application provide a synchronization method and apparatus, which can realize synchronization within one AI accelerator, among different AI accelerators within one AI server, and among AI servers.
  • a first aspect of the embodiments of the present application provides a method synchronization method, the method includes: a first processor creates a first synchronization object for a first synchronization event; the first synchronization object includes an identifier of a first synchronization register, The value of the first synchronization register includes a first value or a second value, the first value is used to indicate that the first synchronization event has not occurred, and the second value is used to indicate that the first synchronization event has occurred; the second processor is based on the The value of the first synchronization register determines whether the first synchronization event occurs.
  • the above-mentioned first processor includes a first central processing unit CPU, and the second processor includes a first neural network processor NPU.
  • the first processor may be a CPU in the AI server, and the second processor may be an AI accelerator in the AI server. This CPU and AI accelerator reside in the same AI server.
  • the second processor is an AI accelerator waiting for the first synchronization event to occur.
  • the first synchronization event may occur in one NPU, or between different NPUs in one AI server, or between different AI servers.
  • the AI accelerator can determine whether the synchronization event corresponding to the synchronization register occurs based on the value of the synchronization register, so that an AI can be realized. Synchronization within an accelerator, between different AI accelerators within an AI server, and between AI servers.
  • the above-mentioned first processor creates a first synchronization object for the first synchronization event, including: the first processor calls the first application program interface API, in the above-mentioned second processing
  • the first synchronization register is allocated for the first synchronization event among the plurality of synchronization registers included in the device, and the identifier of the first synchronization register is stored in the first synchronization object.
  • the first API is used to create a synchronization object for the synchronization event.
  • the first API may be NotifyCreat(deviceID, notify), wherein the input deviceID is the ID of the AI accelerator, the output notify is the synchronization object, and the NotifyCreat interface is used to create the synchronization object.
  • the deviceID is the ID of the AI accelerator waiting for the synchronization event to occur.
  • the CPU can allocate the first synchronization register for the first synchronization event among the plurality of synchronization registers included in the AI accelerator waiting for the synchronization event to occur, Therefore, once the value of the first synchronization register changes, the AI accelerator can immediately check the modification of the value of the first synchronization register, and can quickly determine whether the first synchronization event has occurred.
  • the API interface provided by the solution of the embodiment of the present application is relatively simple, and the synchronization overhead is small, so the efficiency of AI training can be improved.
  • the above method further includes: by calling the second API, the above-mentioned first processor sends, to the above-mentioned second processor, the corresponding data of the first synchronization event. Waiting task; the waiting task corresponding to the first synchronization event is used to wait for the first synchronization event to occur, and the waiting task corresponding to the first synchronization event includes the first queue identifier and the identifier of the first synchronization register; the first queue identifier is waiting The identifier of the queue where the task is located; the second processor receives the waiting task corresponding to the first synchronization event.
  • the CPU can issue a waiting task for waiting for a synchronization event to the AI accelerator through a simple API, and carry the identification of the synchronization register in the waiting task, so that the AI accelerator can use the different values of the synchronization register. Determine whether a synchronization event occurs, so it can achieve synchronization within an AI accelerator, between different AI accelerators within an AI server, and between AI servers.
  • the second API is used to deliver the waiting task corresponding to the synchronization event.
  • the second API may be a NotifyWait(notify, stream) interface, which is used to wait for the synchronization event corresponding to the synchronization object to occur in the stream.
  • the second processor determines whether the first synchronization event occurs based on the value of the first synchronization register, including: in the first synchronization register If the value of the first synchronization register is the first value, the second processor determines that the first synchronization event has not occurred, and the second processor continues to wait for the first synchronization event to occur until the value of the first synchronization register is the second value. The processor determines that the first synchronization event has occurred, and the second processor resets the value of the first synchronization register to the first value.
  • the AI accelerator can wait for the first synchronization event to occur when the first synchronization event does not occur until the first synchronization event has occurred, then reset the first synchronization register to the first value, and continue Perform the next task. Therefore, it is possible to achieve synchronization within one AI accelerator, between different AI accelerators within one AI server, and between different AI servers.
  • the controller of the second processor The value modification of the first synchronization register can be checked immediately, the second processor determines that the first synchronization event has occurred, and the second processor resets the value of the first synchronization register to the first value so that the first synchronization register can continue. perform a synchronization operation.
  • the second processor determines whether the first synchronization event occurs based on the value of the first synchronization register, and further includes: in the first synchronization register If the value of is the second value, the second processor determines that the first synchronization event has occurred, and the second processor resets the value of the first synchronization register to the first value.
  • the second processor when the second processor checks that the value of the first synchronization register is the second value, the second processor determines that the first synchronization event has occurred, and the second processor resets the value of the first synchronization register to the second value. a value.
  • the second processor can then continue to perform subsequent tasks, thereby ensuring proper synchronization, enabling synchronization within one AI accelerator, among different AI accelerators within one AI server, and among different AI servers.
  • the above method further includes: the first processor sends a recording task corresponding to the first synchronization event to the second processor by calling a third API
  • the record task corresponding to this first synchronization event is used to indicate that the above-mentioned first synchronization event has occurred, and the record task corresponding to the first synchronization event includes the second queue mark, and the mark of the first synchronization register, and the second queue mark is the first The identifier of the queue where the recording task corresponding to a synchronization event is located; the second processor receives the recording task corresponding to the first synchronization event, and resets the value of the first synchronization register to the second value based on the identifier of the first synchronization register.
  • the CPU can issue a recording task to the AI accelerator (second processor) through a simple API to indicate that a synchronization event has occurred, and carry the identifier of the synchronization register in the waiting task, so that the AI accelerator can perform the synchronization according to the synchronization
  • the identifier of the register is written into the second value, so that the value of the synchronization register can correspond to the occurrence state of the first synchronization event. Since the first synchronization register is a synchronization register in the second processor, the controller of the second processor can immediately check the value modification of the first synchronization register, the second processor determines that the first synchronization event has occurred, and the second processor determines that the first synchronization event has occurred. The processor can continue to perform subsequent tasks to ensure proper synchronization within the second processor.
  • the third API is used to deliver the recording task corresponding to the synchronization event.
  • the third API may be a NotifyRecord(notify, stream) interface, which is used to set the synchronization event occurrence corresponding to the synchronization object in the stream.
  • the above-mentioned second processor performs both the Wait task and the Record task.
  • the Wait task and the Record task can be tasks in two streams respectively.
  • the second processor executes the Wait task
  • the third processor executes the Record task.
  • the above method further includes: the above-mentioned first processor sends a record corresponding to the first synchronization event to the third processor by calling a third API task; the recording task corresponding to the first synchronization event is used to indicate that the first synchronization event has occurred, and the recording task corresponding to the first synchronization event includes the second queue identifier and the identifier of the first synchronization register, and the second queue identifier is the described The identifier of the queue where the recording task corresponding to the first synchronization event is located; the third processor includes a second NPU; the third processor receives the recording task corresponding to the first synchronization event, and based on the identifier of the first synchronization register, The value of a synchronization register is reset to the second value.
  • the third processor and the foregoing second processor may be different NPUs in an AI server.
  • the CPU can issue a recording task to the AI accelerator (third processor) through a simple API to indicate that a synchronization event has occurred, and carry the identification of the synchronization register in the waiting task, so that the AI accelerator can perform synchronization according to the synchronization
  • the identifier of the register is written into the second value, so that the value of the synchronization register can correspond to the occurrence state of the first synchronization event. Since the first synchronization register is a synchronization register in the second processor, the controller of the second processor can immediately check the value modification of the first synchronization register, the second processor determines that the first synchronization event has occurred, and the second processor determines that the first synchronization event has occurred.
  • the processor can continue to perform subsequent tasks, thereby ensuring proper synchronization between the second processor and the third processing device within the AI server.
  • the synchronization overhead is the overhead of the controller of the AI accelerator writing the register through the bus, and the synchronization overhead is relatively small.
  • the synchronization overhead is less than 50ns for synchronization within one NPU, and the synchronization overhead is less than 1us for synchronization between different NPUs within an AI server.
  • this solution provides a simple API interface, similar to the semaphore interface of general-purpose OS, which can greatly facilitate developers to use AI accelerators.
  • the above-mentioned method further includes: the above-mentioned first processor invokes the first application program by calling The fourth API of the first synchronization object sets the name of the first synchronization object as the preset name; the first processor obtains the identifier of the first synchronization register corresponding to the preset name by calling the fifth API of the second application program.
  • the synchronization event is an inter-process synchronization event
  • the synchronization objects between different processes can be corresponding to the same synchronization register, and then by calling the second API and the third API , can achieve synchronization between processes.
  • the fourth API is used to set the global name of the synchronization object.
  • the fourth API may be IpcSetNotifyName(notify, name), which is used to set the global name of the synchronization object notify.
  • the fifth API is used to obtain the identifier of the register corresponding to the preset name.
  • the fifth API may be IpcOpenNotify(notify, name), which is used to open the synchronization object according to the global name name of the synchronization object notify.
  • the above-mentioned first synchronization event is a synchronization event between the above-mentioned first application program and the above-mentioned second application program, and the above-mentioned preset name is this The pre-agreed name of the first application and the second application.
  • the global name of the synchronization object is preset by different applications, so that the synchronization objects between different processes can be corresponding to the same synchronization register, thereby realizing the synchronization between processes. .
  • the first synchronization event can occur in an AI accelerator or in different AIs in an AI server. Accelerator room.
  • the above method further includes: the first processor obtains the virtual address of the second synchronization register by calling the sixth API; the second synchronization register The register is a register corresponding to the second synchronization event, and different values of the second synchronization register are used to indicate whether the second synchronization event occurs; the first processor sends the virtual address of the second synchronization register to the fourth processor; the first processor sends the virtual address of the second synchronization register to the fourth processor;
  • the processor and the fourth processor are processors in different AI servers, and the fourth processor includes the second CPU.
  • the first processor and the fourth processor may be CPUs in the two AI accelerators, respectively.
  • the embodiment of the present application provides a simple API interface, which is similar to the semaphore interface of a general-purpose OS, which greatly facilitates developers to use the AI accelerator.
  • the sixth API is used to obtain the virtual address of the register corresponding to the synchronization object.
  • the sixth API may be NotifyGetAddr(notify, addr), wherein the input is the synchronization object notify, and the output is the virtual address of the synchronization register corresponding to the synchronization object notify.
  • the above method further includes: the first processor cancels the correspondence between the above-mentioned first synchronization register and the above-mentioned first synchronization event by calling the seventh API relationship, and reset the value of the first synchronization register to the first value; the seventh API is used to release the first synchronization register.
  • the first synchronization register can be recycled, so that when synchronization is required subsequently, the synchronization register can be allocated to other synchronization objects, which improves synchronization Register utilization.
  • the seventh API is used to release the first synchronization register.
  • the seventh API may be NotifyDestroy(notify), and this interface may be used to destroy the synchronization object notify and release the synchronization register corresponding to the synchronization object.
  • the physical address of the first synchronization register is addressed in a global addressing manner.
  • the controller of each AI accelerator can know the physical address of the synchronization register in other AI accelerators in the AI server, and can also access other AI accelerators through the physical address.
  • the synchronization register of the AI accelerator can realize the synchronization within the AI accelerator and among the AI accelerators.
  • a synchronization method includes: a fourth processor receives a virtual address of a second synchronization register from the first processor; the second synchronization register corresponds to a second synchronization event Register, the value of the second synchronization register includes a first value or a second value, the first value is used to indicate that the second synchronization event has not occurred, and the second value is used to indicate that the second synchronization event has occurred;
  • the four processors are processors in different AI servers, the first processor includes the first central processing unit (CPU), and the fourth processor includes the second CPU; the fourth processor sends the second synchronization event corresponding to the fifth processor to the fifth processor remote direct memory access RDMA task; the RDMA task corresponding to the second synchronization event is used to indicate that the second synchronization event has occurred, and the RDMA task corresponding to the second synchronization event includes the virtual address of the second synchronization register; the fifth processor The RDMA task corresponding to the
  • the first processor and the fourth processor may be CPUs in different AI accelerators, respectively.
  • the fourth processor and the fifth processor are different processors in the same AI accelerator.
  • the fourth processor is a CPU in the AI accelerator
  • the fifth processor is an NPU in the AI accelerator.
  • the AI accelerator in an AI server obtains the virtual address of the synchronization register, so that the AI accelerator can write a value in the synchronization register corresponding to the virtual address through the RDMA device when a synchronization event occurs, indicating that the synchronization event has occurred. , so that the AI accelerator in another AI server can immediately check that the value of the synchronization register has changed, so as to determine that the synchronization event occurs, and synchronization between different AI accelerators can be realized.
  • the fourth processor may send the RDMA task corresponding to the second synchronization event to the fifth processor by calling the eighth application program interface API.
  • the eighth API is used to deliver the RDMA task corresponding to the synchronization event.
  • the eighth API may be RDMAsend(addr, 1), which is used to instruct to write the second value 1 to the virtual address addr.
  • a synchronization method includes: a fourth processor receives a virtual address of a second synchronization register from the first processor, where the second synchronization register corresponds to a second synchronization event register, the value of the second synchronization register includes a first value or a second value, the first value is used to indicate that the second synchronization event has not occurred, and the second value is used to indicate that the second synchronization event has occurred; the first processor and the fourth The processors are processors in different AI servers; the first processor includes a first central processing unit CPU, and the fourth processor includes a second CPU; the fourth processor is based on the virtual address of the second synchronization register, through the remote direct memory The accessing RDMA device resets the value of the second synchronization register to the second value.
  • the first processor and the fourth processor may be CPUs in the two AI accelerators, respectively.
  • the CPU in an AI server obtains the virtual address of the synchronization register, so that the CPU can write a value in the synchronization register corresponding to the virtual address through RDMA when a synchronization event occurs, indicating that the synchronization event has occurred, so that another
  • the AI accelerator in an AI server can immediately check that the value of the synchronization register has changed, so as to determine that the synchronization event has occurred, which can realize synchronization between different AI accelerators.
  • a fourth aspect of the embodiments of the present application provides a synchronization device, the synchronization device includes a second processor, the second processor includes a plurality of synchronization registers, each synchronization register is used to correspond to a synchronization event, each The value of the synchronization register includes a first value or a second value, the first value is used to indicate that the synchronization event corresponding to the synchronization register has not occurred, and the second value is used to indicate that the synchronization event corresponding to the synchronization register has occurred; the second processor A first neural network processor NPU is included.
  • the above synchronization apparatus further includes a first processor; the first processor is configured to create a first synchronization object for the first synchronization event; the first synchronization object includes The identifier of the first synchronization register; the different values of the first synchronization register are used to indicate whether the first synchronization event occurs; the second processor is used to determine whether the first synchronization event occurs based on the value of the first synchronization register;
  • a processor includes a first central processing unit CPU.
  • the above-mentioned first processor is specifically configured to, by calling the first application program interface API, execute multiple synchronization functions included in the above-mentioned second processor A first synchronization register is allocated for the first synchronization event in the register, and an identifier of the first synchronization register is stored in the first synchronization object.
  • the above-mentioned first processor is further configured to send the wait corresponding to the first synchronization event to the above-mentioned second processor by calling the second API task; the waiting task corresponding to the first synchronization event is used to wait for the above-mentioned first synchronization event to occur, and the waiting task corresponding to the first synchronization event includes a first queue identifier and an identifier of the first synchronization register; the first queue identifier is The identifier of the queue where the above waiting task is located; the second processor is further configured to receive the waiting task corresponding to the first synchronization event.
  • the second processor is specifically configured to determine the first synchronization when the value of the first synchronization register is the first value. The event does not occur, the second processor continues to wait for the first synchronization event to occur until the value of the first synchronization register is the second value, the second processor determines that the first synchronization event has occurred, and resets the value of the first synchronization register. set to the first value.
  • the above-mentioned second processor is further configured to determine the above-mentioned second value when the value of the above-mentioned first synchronization register is the above-mentioned second value. The first synchronization event has occurred, and the value of the first synchronization register is reset to the first value.
  • the above-mentioned first processor is further configured to send a record corresponding to the first synchronization event to the above-mentioned second processor by calling a third API task;
  • the recording task corresponding to the first synchronization event is used to indicate that the first synchronization event has occurred, and the recording task corresponding to the first synchronization event includes the second queue identification, and the identification of the first synchronization register, the second queue identification is the identification of the queue where the recording task corresponding to the first synchronization event is located;
  • the second processor is also used to receive the recording task corresponding to the first synchronization event, and based on the identification of the first synchronization register, re-values the value of the first synchronization register set to the second value.
  • the synchronization apparatus further includes a third processor, and the third processor includes a second NPU; the first processor is further configured to pass The third API is called to send the recording task corresponding to the first synchronization event to the third processor; the recording task corresponding to the first synchronization event is used to indicate that the first synchronization event has occurred, and the recording task corresponding to the first synchronization event Including the second queue identification, and the identification of the first synchronization register, the second queue identification is the identification of the queue where the recording task corresponding to the first synchronization event is located; the third processor is used to receive the first synchronization event corresponding to and based on the identification of the first synchronization register, reset the value of the first synchronization register to the second value.
  • the above-mentioned first synchronization event is an inter-process synchronization event; the above-mentioned first processor is also used to call the first application program's
  • the fourth API is used to set the name of the first synchronization object as a preset name; the first processor is further configured to obtain the identifier of the first synchronization register corresponding to the preset name by calling the fifth API of the second application program.
  • the above-mentioned first synchronization event is a synchronization event between the above-mentioned first application program and the above-mentioned second application program, and the above-mentioned preset name is this The pre-agreed name of the first application and the second application.
  • the above-mentioned first processor is further configured to obtain the virtual address of the second synchronization register by calling the sixth API;
  • the second synchronization register is a register corresponding to the second synchronization event, and the different values of the second synchronization register are used to indicate whether the second synchronization event occurs;
  • the first processor is also used to send the virtual address of the second synchronization register to the fourth processor;
  • the first processor and the fourth processor are processors in different AI servers, and the fourth processor includes the second CPU.
  • the above-mentioned first processor is further configured to release the correspondence between the above-mentioned first synchronization register and the above-mentioned first synchronization event by calling the seventh API relationship, and reset the value of the first synchronization register to the first value; the seventh API is used to release the first synchronization register.
  • the physical address of the first synchronization register is addressed in a global addressing manner.
  • a fifth aspect of the embodiments of the present application provides a synchronization apparatus, the synchronization apparatus includes a fourth processor and a fifth processor; the fourth processor is configured to receive a virtual address of a second synchronization register from the first processor ;
  • the second synchronization register is the register corresponding to the second synchronization event, the value of the second synchronization register includes the first numerical value or the second numerical value, the first numerical value is used to indicate that the second synchronization event does not occur, and the second numerical value is used to indicate The second synchronization event has occurred;
  • the first processor and the fourth processor are processors in different AI servers; the first processor includes the first central processing unit CPU, the fourth processor includes the second CPU; the fourth processor The processor is also used to send the remote direct memory access RDMA task corresponding to the second synchronization event to the fifth processor; the RDMA task corresponding to the second synchronization event is used to indicate that the second synchronization event has occurred, and the corresponding RDMA task of the second synchronization event is used.
  • the RDMA task includes the virtual address of the second synchronization register; the fifth processor includes a third NPU; the fifth processor is used for receiving the RDMA task corresponding to the second synchronization event, and based on the virtual address of the second synchronization register, through the RDMA The apparatus resets the value of the second synchronization register to the second value.
  • the fourth processor may send the RDMA task corresponding to the second synchronization event to the fifth processor by calling the eighth application program interface API.
  • a sixth aspect of the embodiments of the present application provides a synchronization apparatus, the synchronization apparatus includes a fourth processor; the fourth processor is configured to receive a virtual address of a second synchronization register from the first processor, and the second synchronization
  • the register is a register corresponding to the second synchronization event.
  • the value of the second synchronization register includes a first value or a second value.
  • the first value is used to indicate that the second synchronization event has not occurred, and the second value is used to indicate that the second synchronization event has occurred.
  • the first processor and the fourth processor are processors in different AI servers; the first processor includes the first central processing unit CPU, and the fourth processor includes the second CPU; the fourth processor is also used for The virtual address of the second synchronization register.
  • the value of the second synchronization register is reset to the second value by the remote direct memory access RDMA device.
  • a first processor is provided, and the first processor is configured to create a first synchronization object for a first synchronization event; the first synchronization object includes an identifier of the first synchronization register; The value of the first register includes a first value or a second value, the first value is used to indicate that the synchronization event has not occurred, and the second value is used to indicate that the synchronization event has occurred; the first processor includes a first central processing unit CPU.
  • the first processor is further configured to reset the value of the first register to the first value.
  • the above-mentioned first processor is specifically configured to allocate the first synchronization event in a plurality of synchronization registers included in the second processor by calling the first application program interface API
  • the first synchronization register, and the identifier of the first synchronization register is stored in the first synchronization object.
  • the above-mentioned first processor is further configured to send the waiting task corresponding to the first synchronization event to the second processor by calling the second API
  • the waiting task corresponding to the first synchronization event is used to wait for the first synchronization event to occur, and the waiting task corresponding to the first synchronization event includes the first queue identification and the identification of the first synchronization register; the first queue identification is where the waiting task is located The identifier of the queue.
  • the above-mentioned first processor is further configured to send a record corresponding to the first synchronization event to the above-mentioned second processor by calling a third API task; the recording task corresponding to the first synchronization event is used to indicate that the first synchronization event has occurred, and the recording task corresponding to the first synchronization event includes the second queue identification and the identification of the first synchronization register, and the second queue identification is The identifier of the queue where the recording task corresponding to the first synchronization event is located.
  • the above-mentioned first processor is further configured to send the recording task corresponding to the first synchronization event to the third processor by calling the third API
  • the record task corresponding to this first synchronization event is used to indicate that the first synchronization event has occurred, and the record task corresponding to this first synchronization event includes the second queue mark, and the mark of the first synchronization register, the second queue
  • the identifier is the identifier of the queue where the recording task corresponding to the first synchronization event is located.
  • the above-mentioned first synchronization event is an inter-process synchronization event; the above-mentioned first processor is also used for calling the first application program's
  • the fourth API is to set the name of the first synchronization object as a preset name; the first processor is also used to obtain the identifier of the first synchronization register corresponding to the preset name by calling the fifth API of the second application .
  • the first synchronization event is a synchronization event between the first application and the second application
  • the preset name is the first synchronization event.
  • the above-mentioned first processor is further configured to obtain the virtual address of the second synchronization register by calling the sixth API;
  • the second synchronization register is a register corresponding to the second synchronization event, and different values of the second synchronization register are used to indicate whether the second synchronization event occurs;
  • the first processor is also used to send the virtual address of the second synchronization register to the fourth processor;
  • the first processor and the fourth processor are processors in different AI servers, and the fourth processor includes the second CPU.
  • the above-mentioned first processor is further configured to release the correspondence between the above-mentioned first synchronization register and the above-mentioned first synchronization event by calling the seventh API relationship, and reset the value of the first synchronization register to the first value; the seventh API is used to release the first synchronization register.
  • the physical address of the first synchronization register is addressed in a global addressing manner.
  • An eighth aspect of the embodiments of the present application provides a second processor, the second processor includes a plurality of synchronization registers, each synchronization register is used to correspond to a synchronization event, and the value of each synchronization register includes the first A numerical value or a second numerical value, the first numerical value is used to indicate that the synchronization event corresponding to the synchronization register has not occurred, and the second numerical value is used to indicate that the synchronization event corresponding to the synchronization register has occurred; the second processor includes the first neural network processor NPU.
  • the above-mentioned second processor is configured to determine whether the first synchronization event occurs based on the value of the first synchronization register.
  • the above-mentioned second processor is specifically configured to determine the first synchronization event when the value of the first synchronization register is the first value If it does not occur, the second processor continues to wait for the first synchronization event to occur until the value of the first synchronization register is the second value, the second processor determines that the first synchronization event has occurred, and resets the value of the first synchronization register to the first value. a value.
  • the above-mentioned second processor is further configured to determine the first synchronization when the value of the first synchronization register is the second numerical value. An event has occurred that resets the value of the first synchronization register to the first value.
  • the above-mentioned second processor is further configured to receive the waiting task corresponding to the first synchronization event; the waiting task corresponding to the first synchronization event is After waiting for the first synchronization event to occur, the waiting task corresponding to the first synchronization event includes the first queue identifier and the identifier of the first synchronization register; the first queue identifier is the identifier of the queue where the waiting task is located.
  • the second processor is further configured to receive the recording task corresponding to the first synchronization event, and based on the identifier of the first synchronization register, The value of a synchronization register is reset to the second value; the recording task corresponding to the first synchronization event is used to indicate that the first synchronization event has occurred, and the recording task corresponding to the first synchronization event includes the second queue identifier and the first synchronization event.
  • the identifier of the register, and the identifier of the second queue is the identifier of the queue where the recording task corresponding to the first synchronization event is located.
  • a fourth processor is provided, and the fourth processor is configured to receive a virtual address of a second synchronization register from the first processor; the second synchronization register corresponds to the second synchronization event
  • the value of the second synchronization register includes a first value or a second value, the first value is used to indicate that the second synchronization event has not occurred, and the second value is used to indicate that the second synchronization event has occurred;
  • the first processor and the fourth processor are processors in different AI servers;
  • the first processor includes the first central processing unit (CPU), and the fourth processor includes the second CPU; the fourth processor is also used to send data to the fifth processor
  • the fifth processor includes a third NPU.
  • a tenth aspect of the embodiments of the present application provides a fifth processor, where the fifth processor is configured to receive an RDMA task corresponding to a second synchronization event, and based on the virtual address of the second synchronization register, use an RDMA device to The value of the second synchronization register is reset to the second value; the RDMA task corresponding to the second synchronization event is used to indicate that the second synchronization event has occurred, and the RDMA task corresponding to the second synchronization event includes the virtual address of the second synchronization register;
  • the fifth processor includes a third NPU; the value of the second synchronization register includes a first value or a second value, the first value is used to indicate that the second synchronization event has not occurred, and the second value is used to indicate that the second synchronization event has occurred.
  • An eleventh aspect of the embodiments of the present application provides an electronic device, the electronic device includes a memory, and the synchronization apparatus according to any one of the fourth, fifth, and sixth aspects.
  • a twelfth aspect of the embodiments of the present application provides a chip, the chip includes an interface circuit, and the first processor according to the first aspect above, where the first processor is configured to communicate with other devices through the interface circuit The device communicates to implement the method described in the first aspect above.
  • a thirteenth aspect of the embodiments of the present application provides a chip, the chip includes an interface circuit, and the first processor and the second processor according to the above-mentioned first aspect, the first processor and the The second processor communicates through the interface circuit to implement the method described in the first aspect above.
  • a fourteenth aspect of the embodiments of the present application provides a chip, the chip includes an interface circuit, and the first processor, the second processor, and the third processor as described in the first aspect above, the first processor The processor, the second processor and the third processor communicate through the interface circuit to implement the method described in the first aspect.
  • a fifteenth aspect of the embodiments of the present application provides a chip, the chip includes an interface circuit, and the fourth processor and the fifth processor according to the second aspect or the third aspect, the fourth processor The processor and the fifth processor communicate through the interface circuit to implement the method described in any one of the above aspects.
  • a sixteenth aspect of the embodiments of the present application provides an AI server, where the AI server includes a CPU and one or more AI accelerators, the CPU is the first processor described in any one of the preceding aspects, and the one or more AI accelerators
  • the plurality of AI accelerators include at least one of the second processor or the third processor described in any aspect above.
  • a seventeenth aspect of the embodiments of the present application provides an AI server, where the AI server includes a CPU and one or more AI accelerators, the CPU is the fourth processor described in any one of the preceding aspects, and the AI accelerator is the fifth processor according to any one of the above aspects.
  • An eighteenth aspect of the embodiments of the present application provides an AI cluster, where the AI cluster includes multiple AI servers, the AI servers include a CPU and one or more AI accelerators, and the CPU includes any of the above-mentioned aspects.
  • the first processor, the AI accelerator includes at least one of the second processor or the third processor described in any aspect above.
  • a nineteenth aspect of the embodiments of the present application provides an AI cluster, where the AI cluster includes multiple AI servers, the AI servers include a CPU and one or more AI accelerators, and the CPU includes any of the above-mentioned aspects.
  • the fourth processor, the AI accelerator includes the fifth processor described in any one of the preceding aspects.
  • a communication system in a twentieth aspect of the embodiments of the present application, includes an AI accelerator, the AI server according to the eleventh aspect, the AI server according to the twelfth aspect, and the thirteenth aspect.
  • the AI cluster or at least one of the AI clusters described in the fourteenth aspect above.
  • the AI accelerator includes at least one of the second processor, the third processor, and the fifth processor described in any aspect above.
  • a twenty-first aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to create a synchronization object for a synchronization event.
  • the API can be NotifyCreat(deviceID, notify), where the input deviceID is the ID of the AI accelerator, and the output notify is the synchronization object.
  • a twenty-second aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to deliver a waiting task corresponding to a synchronization event.
  • the API may be the NotifyWait(notify, stream) interface, which is used to wait for the synchronization event corresponding to the synchronization object to occur in the stream.
  • a twenty-third aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to issue a recording task corresponding to a synchronization event.
  • the API can be the NotifyRecord(notify, stream) interface, which is used to set the synchronization event occurrence corresponding to the synchronization object in the stream.
  • a twenty-fourth aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to set a global name of a synchronization object.
  • the API can be IpcSetNotifyName(notify, name), which is used to set the global name of the synchronization object notify.
  • a twenty-fifth aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to open a synchronization object.
  • the API can be IpcOpenNotify(notify, name), which is used to open the synchronization object according to the global name name of the synchronization object notify.
  • a twenty-sixth aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to obtain a virtual address of a register corresponding to a synchronization object.
  • the API may be NotifyGetAddr(notify, addr), where the input is the synchronization object notify, and the output is the virtual address of the synchronization register corresponding to the synchronization object notify.
  • a twenty-seventh aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to release a synchronization register.
  • the API can be NotifyDestroy(notify), which can be used to destroy the synchronization object notify and release the synchronization register corresponding to the synchronization object.
  • a twenty-eighth aspect of the embodiments of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to deliver an RDMA task corresponding to a synchronization event.
  • the API may be RDMAsend(addr, 1), which is used to instruct to write the second value 1 to the virtual address addr.
  • 1A is a schematic diagram of an AI training process provided by an embodiment of the application.
  • 1B is a schematic structural diagram of a Ring algorithm in a single AI server provided by an embodiment of the application;
  • 1C is a schematic diagram of the calculation process of the reduce-scatter stage in the Ring algorithm in a single AI server provided by an embodiment of the application;
  • 1D is a schematic diagram of the calculation process of the all-gather stage in the Ring algorithm in a single AI server provided by an embodiment of the present application;
  • FIG. 2A is a schematic structural diagram of an AI accelerator provided by an embodiment of the present application.
  • FIG. 2B is a schematic structural diagram of a computing architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a synchronization method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a computing architecture of an AI server according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a computing task provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of another synchronization method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computing architecture for inter-process synchronization provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of another synchronization method provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of another synchronization method provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a computing architecture for synchronization between AI servers according to an embodiment of the present application
  • FIG. 11 is a schematic flowchart of another synchronization method provided by an embodiment of the present application.
  • At least one (a) of a, b or c may represent: a, b, c, a and b, a and c, b and c, or a and b and c, where a, b and c can be single or multiple.
  • words such as “first” and “second” are used to distinguish the same or similar items that have basically the same function and effect, Those skilled in the art can understand that words such as “first” and “second” do not limit the quantity and execution order.
  • the "first” in the first processor and the "second” in the second processor in the embodiments of the present application are only used to distinguish different processors.
  • the descriptions of the first, second, etc. appearing in the embodiments of the present application are only used for illustration and distinguishing the description objects, and have no order. any limitations of the examples.
  • an AI server can include one or more AI accelerators.
  • an AI accelerator as a computing device, can be a type of microprocessor that accelerates special tasks such as machine learning processes or algorithms for intelligent computing or other data-intensive or sensor-driven tasks, and can also include microprocessors related to this type of microprocessor. related instruction set.
  • Dedicated tasks can include AI processing, such as artificial neural networks, machine learning (ML) training, ML optimization/learning, inference, classification and other operations, visual data processing, network data processing, object detection, rule analysis, content processing operation, etc.
  • the AI accelerator can be a neural network processor NPU, which can include a graphics processing unit GPU, a digital signal processor (DSP), a system on chip (SOC), and a Field-Programmable Gate Array (Field-Programmable Gate Array). , FPGA), application specific integrated circuit (application specific integrated circuit, ASIC), etc. one or more.
  • AI accelerators can run relevant AI instruction sets by loading weights, biases, training data or code to complete specialized tasks.
  • the embodiments of the present application do not limit the specific form of the AI accelerator. The following embodiments are described by taking the AI accelerator as an NPU as an example.
  • the training process of a neural network generally includes multiple iterations, and each iteration includes three stages: forward calculation, backward calculation, and gradient convergence.
  • Each AI accelerator performs forward calculation and reverse calculation independently, and the calculated gradients need to be aggregated on multiple AI accelerators. Since the reverse calculation is generally the back-propagation of the error, after the error (the difference between the recognition value of the neural network and the supervision data) is obtained, the weight of the neural network is adjusted based on the gradient descent method. Therefore, the reverse calculation includes the process of "obtaining the error value” and “backpropagating based on the error value", and the latter process (the process of error backpropagation) includes adjusting the layers of the neural network based on the gradient. weighting process.
  • the back propagation process is to sequentially and reverse the error value from the output layer of the neural network.
  • the gradient of the weight parameter is calculated based on the error, and then the weight of the neural network is updated according to the direction of the gradient of the weight parameter. Therefore, in the reverse calculation process, when the gradient values of some neuron layers are calculated, the gradient convergence can be started. For example, for a 100-layer neural network, the gradient calculation of the 100-80 layers can be completed. Then the gradient convergence can be started. In this way, after all the reverse calculations are completed, the time for gradient aggregation of the remaining data is shorter than the time for gradient aggregation of all data, which can improve training efficiency.
  • Gradient aggregation mainly includes data transmission between multiple AI accelerators in the AI server, network transmission between AI servers, synchronization waiting between AI accelerators, AI Gradient data accumulation between accelerators, etc.
  • Gradient convergence does not require the participation of the computing unit of the AI accelerator, so the computing unit of the AI accelerator is in an idle state during gradient convergence.
  • the AI accelerator performs forward calculation from time T0 to time T1, and the AI accelerator performs reverse calculation from time T1 to time T2.
  • the AI accelerator performs forward calculation from time T0 to time T1
  • the AI accelerator performs reverse calculation from time T1 to time T2.
  • gradient convergence can be performed from time T4. That is, from time T4 to time T2, reverse calculation and gradient convergence 1 are performed at the same time.
  • the reverse calculation is completed, only the gradient convergence of the remaining data is performed from time T2 to time T3. Therefore, from time T0 to time T2 is the time-consuming calculation, and from time T2 to time T3, the computing unit of the AI accelerator neither performs forward calculation nor reverse calculation, and is in an idle state. Since the AI cluster only performs gradient convergence, this Time T2 to time T3 may also be referred to as gradient convergence time.
  • the All-reduce algorithm may be used for the above-mentioned gradient convergence.
  • the All-reduce algorithm is a type of algorithm used to efficiently integrate data in different AI accelerators, and then distribute the results to each AI accelerator.
  • the performance of gradient pooling is a key factor that reflects the performance of cluster training.
  • the linearity L of this cluster can be calculated by the following formula:
  • T idle is the time when the computing unit of the AI accelerator is in an idle state, that is, T idle is the gradient convergence time (for example, the All-reduce time).
  • T idle is the gradient convergence time (for example, the All-reduce time).
  • the shorter the gradient convergence time the shorter the time that the computing unit of the AI accelerator is in an idle state, and the higher the cluster linearity L is.
  • the computing unit of the AI accelerator is in an idle state, that is, the time from T2 to T3 is the gradient convergence time. Therefore, the shorter the time from T2 to T3, the higher the cluster linearity, and the longer the time from T2 to T3, the lower the cluster linearity. Therefore, the cluster linearity can be improved by reducing the synchronization transmission and synchronization waiting time during gradient convergence. .
  • the gradient convergence algorithm of the cluster is the Ring algorithm in a single AI server, if the AI server includes five AI accelerators, for example, GPU0 to GPU4.
  • the Ring algorithm includes two stages, namely reduce-scatter stage and all-gather stage.
  • reduce-scatter phase data is exchanged between GPUs so that each GPU eventually gets a part of the final result.
  • all-gather stage the GPUs will swap these blocks so that all GPUs end up with the full final result.
  • each GPU has a left neighbor and a right neighbor, and each GPU will only send data to its right neighbor and receive data from its left neighbor.
  • each GPU has a left neighbor and a right neighbor, GPU0 will only send data to its right neighbor GPU1, and Receives data from its left neighbor GPU4.
  • GPU1 will only send data to its right neighbor GPU2 and receive data from its left neighbor GPU0.
  • GPU2 will only send data to its right neighbor GPU3 and receive data from its left neighbor GPU1.
  • GPU3 will only send data to its right neighbor GPU4 and receive data from its left neighbor GPU2.
  • GPU4 will only send data to its right neighbor GPU0 and receive data from its left neighbor GPU3.
  • each GPU divides the data into 5 smaller data blocks as an example, combined with Figure 1B, as shown in Figure 1C, in the reduce-scatter stage, each The GPU will do 4 iterations of reduce-scatter, in each iteration each GPU will send one of the data blocks to its right neighbor and receive one from its left neighbor and accumulate into that block.
  • the chunks of data sent and received are different for each iteration.
  • GPU0 sends data block a0 to its right neighbor GPU1 and receives data block e4 from its left neighbor GPU4 and accumulates data block e0.
  • GPU1 sends data block b1 to its right neighbor GPU2, receives data block a0 from its left neighbor GPU0, accumulates data block a1, and so on.
  • Figure 1C in the reduce-scatter stage, after 4 iterations from GPU0 to GPU4, one data block of each GPU can get a final value.
  • GPU0 to GPU4 again perform 4 iterations, except that in each iteration, GPU will send one of the data blocks to its right neighbor , and receives a data block from its left neighbor and overwrites it into the data block.
  • GPU0 sends data block b2+b1+b3+b4+b0 to its right neighbor GPU1, and receives data block a1+a0+a2+a3+a4 from its left neighbor GPU4, and uses data block a1+a0+ a2+a3+a4 covers data block a0.
  • GPU1 sends data block c3+c2+c4+c0+c1 to its right neighbor GPU2, and receives data block b2+b1+b3+b4+b0 from its left neighbor GPU0, and uses data block b2+b1+b3+ b4+b0 covers data block b1, and so on.
  • Figure 1D in the all-gather stage, after 4 iterations from GPU0 to GPU4, all GPUs have fully accumulated values for the entire array.
  • the embodiments of the present application take the Ring algorithm in a single AI server as an example to illustrate that a synchronization mechanism is required in an AI training scenario to ensure the normal operation of the algorithm.
  • the embodiments of the present application do not limit the specific application scenarios of the synchronization mechanism. .
  • Synchronization mechanism is very necessary.
  • One synchronization mechanism is to ensure mutual exclusion of synchronization within and between processes through a semaphore mechanism.
  • this method only supports synchronization on general-purpose processor architectures (eg, X86 or ARM), and does not support synchronization on chips such as AI accelerators, and does not support synchronization between AI servers.
  • Another synchronization method is the event event synchronization mechanism provided by NVIDIA's unified computing device architecture (CUDA), which is used for intra-process, inter-process, graphics processing unit (GPU) Intra-chip and inter-GPU synchronization.
  • CUDA unified computing device architecture
  • the event mechanism does not support synchronization between AI servers.
  • the overhead of synchronization within GPU chips and between GPU chips is large, in the order of 10us, and when the event mechanism is used for inter-process synchronization, the application program interface (application program interface). , API) design is more complex and inconvenient for developers to use.
  • the embodiment of the present application provides a synchronization method, which can realize synchronization within one AI accelerator, among different AI accelerators within one AI server, and among AI servers, and has less synchronization overhead, simpler API design, and convenience for developers. use.
  • the synchronization method provided by the embodiment of the present application may be applied to a computing architecture, and the computing architecture may be a computing architecture of an AI server.
  • the computing architecture of the AI server is a hardware architecture of heterogeneous computing, and the architecture includes a central processing unit (CPU) and one or more AI accelerators.
  • the CPU can send an AI computing task to the AI accelerator.
  • the AI accelerator executes the AI computing task and reports the execution result to the CPU.
  • FIG. 2A is an AI accelerator provided by an embodiment of the present application.
  • the AI accelerator includes a controller, an arithmetic logic unit, and a plurality of synchronization registers.
  • the controller is used to receive the AI computing task sent by the CPU, and report the execution result of the computing task to the CPU.
  • the arithmetic logic unit is used to execute the calculation task issued by the controller, and return the execution result of each calculation task to the controller.
  • the AI accelerator includes multiple synchronization registers, and the multiple synchronization registers are Reg0, Reg1 to Regn, respectively.
  • Each synchronization register is used to correspond to a synchronization event, and different values of the synchronization register can be used to indicate whether the corresponding synchronization event occurs.
  • the multiple synchronization registers may be set in the controller of the AI accelerator.
  • the value of each synchronization register may include a first value and a second value.
  • the first value is used to indicate that the synchronization event corresponding to the synchronization register has not occurred
  • the second value is used to indicate that the synchronization event corresponding to the synchronization register has occurred.
  • the first numerical value and the second numerical value are different numerical values. The specific values of the first numerical value and the second numerical value are not limited in the embodiments of the present application. In the following embodiments, the first numerical value is 0 and the second numerical value is 1 for exemplary description.
  • the synchronization event corresponding to the synchronization register can occur in one AI accelerator, between different AI accelerators in one AI server, or between different AI servers (each AI server includes at least one AI server).
  • AI accelerator can determine whether the synchronization event occurs based on the value of the synchronization register, thereby realizing synchronization in the AI accelerator.
  • the AI accelerator can determine whether the synchronization event occurs based on the value of the synchronization register, so as to realize the synchronization between different AI accelerators in an AI server. Synchronize.
  • the AI accelerator of one AI server can determine whether the synchronization event occurs based on the value of the synchronization register, thereby realizing synchronization between AI accelerators.
  • this embodiment of the present application does not limit the specific number of synchronization registers set in each AI accelerator.
  • 1024 synchronization registers can be set in the AI accelerator, and one synchronization register can correspond to one synchronization event.
  • multiple synchronization registers are set in the AI accelerator, and each synchronization register is used to correspond to a synchronization event, so that the AI accelerator can be based on the value of the synchronization register, Determine whether a synchronization event corresponding to the synchronization register occurs, so as to realize synchronization within an AI accelerator, among different AI accelerators within an AI server, and among AI servers.
  • the AI server may include a CPU and multiple AI accelerators, each AI accelerator includes a set of synchronization registers, and each AI accelerator includes a set of synchronization registers.
  • Each synchronization register may correspond to a synchronization event, and different values of the synchronization register may be used to indicate whether the corresponding synchronization event occurs.
  • the driver in the CPU is used to provide the driver function for the AI accelerator.
  • the user-mode driver layer runtime (runtime) is deployed in the application (application, App), and the runtime is used to provide the user-mode driver function of the AI accelerator.
  • the runtime includes multiple APIs.
  • the CPU runs the APP, the interaction between software and hardware can be realized by calling different API interfaces.
  • the CPU can send AI computing tasks to the AI accelerator by calling APIs. After receiving the AI computing tasks sent by the CPU, the controller in the AI accelerator executes the AI computing tasks and reports the execution results to the CPU.
  • the runtime of the user-mode driver layer of the APP provides an API.
  • the upper-layer business APP can split the AI model (computation graph), convert it into tasks such as stream, task, and event that the AI accelerator can process, and send it to the AI accelerator for processing through the API provided by the runtime.
  • a task is a computing task, which is generally processed by an arithmetic logic unit in an AI accelerator.
  • Event is an event synchronization mechanism, which is generally handled by the controller.
  • the controller in the AI accelerator can schedule the execution of tasks of multiple streams concurrently, but the tasks in the same stream can only be executed sequentially.
  • the number of synchronization registers set in different AI accelerators may be the same or different, which is not limited in the embodiment of the present application.
  • the AI server includes m+ For one AI accelerator, n synchronization registers are set in both AI accelerator 0 and AI accelerator m as an example for illustration.
  • multiple synchronization registers can be set in each AI accelerator, and the physical addresses of the synchronization registers set in different AI accelerators in one AI server can be addressed by global addressing.
  • global addressing of synchronization registers in an AI server may be implemented according to an AI accelerator's identity (ID) plus offset or other methods.
  • ID AI accelerator's identity
  • the controller of each AI accelerator can know the physical addresses of the synchronization registers in other AI accelerators in the AI server, and can also Access the synchronization registers of other AI accelerators through physical addresses.
  • the AI accelerator and the CPU can be integrated on one chip, or can be integrated on different chips respectively.
  • the multiple AI accelerators can be integrated on one or more chips, the CPU can be integrated on another chip, or the CPU and AI accelerator can be integrated on one chip.
  • This embodiment of the present application does not limit the hardware form of the heterogeneous computing composed of the CPU and the AI accelerator in the AI server, which is exemplified here.
  • a group of synchronization registers are set in the AI accelerator in the AI server, and each synchronization register can correspond to a synchronization event, so that the AI accelerator can determine the value of the synchronization register based on the value of the synchronization register. Whether the synchronization event corresponding to the synchronization register occurs can realize synchronization within an AI accelerator, among different AI accelerators within an AI server, and among AI servers.
  • a synchronization method is provided in an embodiment of the present application, and the method includes the following steps:
  • the first processor creates a first synchronization object for the first synchronization event.
  • the first processor may be a central control unit in the AI server, eg, a CPU.
  • the first processor includes a first CPU.
  • the first processor creates a first synchronization object for the first synchronization event, which may include: the first processor calls the first API, among the plurality of synchronization registers included in the second processor, is:
  • the first synchronization event allocates the first synchronization register, and stores the identifier of the first synchronization register in the first synchronization object.
  • the second processor includes a second NPU, and the second processor is an NPU waiting for the first synchronization event to occur. That is, the synchronization register allocated for the synchronization event in the embodiment of the present application is the synchronization register in the NPU waiting for the synchronization event to occur.
  • the first API is used to create synchronization objects for synchronization events.
  • the first API may be NotifyCreat(deviceID, notify), where the input deviceID is the ID of the AI accelerator, and the output notify is a synchronization object, and the NotifyCreat interface is used to create a synchronization object.
  • the deviceID in the above NotifyCreat interface is the ID of the second processor.
  • the first processor when it allocates the first synchronization register for the first synchronization event, it may also reset the value of the first synchronization register to the first value, so that the value of the first synchronization register is the same as the value of the first synchronization event. corresponds to the current state.
  • the above-mentioned resetting the value of the first synchronization register to the first value may also be to set the value of the first synchronization register to the first value, which is not limited in this embodiment of the present application. In practical applications, the setting method can be adopted, or the reset (Reset) method can be adopted to change the value of the synchronization register.
  • the first processor may be a CPU in an AI server
  • the second processor may be an AI accelerator in the AI server.
  • the first processor and the second processor form a heterogeneous computing architecture.
  • the AI server can be a heterogeneous server.
  • the first processor may be the host CPU in the AI server
  • the second processor may be the NPU in the AI server
  • the host CPU may call the first API to register in multiple synchronization registers included in the NPU waiting for the synchronization event to occur , allocate the first synchronization register for the first synchronization event.
  • the above-mentioned first synchronization event may occur in one NPU, or between different NPUs in one AI server, or between different AI servers, which is not limited in this embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a computing architecture of an AI server.
  • the two NPUs are NPU0 and NPU1 as an example, the CPU can send computing tasks, recording tasks and waiting tasks to NPU0 and NPU1.
  • the computing task (Task) is a computing task processed by the arithmetic logic unit, the recording task (record) is used to indicate that a synchronization event has occurred, and the waiting task (wait) is used to wait for the synchronization event to occur.
  • queue 1 of NPU0 can execute computing task 12 .
  • the synchronization requires queue 1 of NPU0 to wait for a synchronization event 1 to occur.
  • the synchronization event 1 is that queue 0 of NPU0 completes computing task 01 and sends the execution result.
  • the queue 1 of the NPU0 keeps waiting after the computing task 11 is executed.
  • synchronization event 1 When synchronization event 1 has occurred (queue 0 of NPU0 finishes executing computing task 01 and sends the execution result), queue 0 of NPU0 can continue to execute computing task 12 . Understandably, synchronization event 1 occurs between two different queues of AI accelerator NPU0.
  • synchronization event 2 occurs between different AI accelerators (between NPU0 and NPU1) within an AI server.
  • the queue 1 of the NPU0 waits for the synchronization event 1 to occur, so the CPU can allocate a synchronization register for the synchronization event 1 in the multiple synchronization registers included in the NPU0, and store the synchronization in the synchronization object 1.
  • the identifier of the register, the synchronization object 1 can be recorded as notify1.
  • the queue 1 of the NPU1 waits for the synchronization event 2 to occur, so the CPU can allocate a synchronization register for the synchronization event 2 in the multiple synchronization registers included in the NPU1, and save the synchronization register in the synchronization object 2.
  • the identity of the register, The synchronization object 2 can be recorded as notify2.
  • the embodiment of the present application sets a set of synchronization registers in each NPU, when the APP determines that synchronization is required, it can call the NotifyCreat(deviceID, notify) interface to set the synchronization register for each NPU waiting for a synchronization event to occur. Each synchronization event allocates a synchronization register.
  • the APP sends the API of NotifyCreate, creates the synchronization object notify1 on NPU0, and the Runtime calls the NPU-driven interface to request the NPU driver to allocate a synchronization event on NPU0.
  • Synchronization register As shown in FIG. 4 , the NPU driver may allocate a synchronization register Reg0 among multiple synchronization registers in the NPU0, record the identification of the synchronization register Reg0, and reset the value of the synchronization register to the first value 0.
  • the NPU driver returns the id of the synchronization register Reg0 to the Runtime.
  • the Runtime builds the synchronization object notify1, saves the id of the synchronization register Reg0 in notify1, and returns notify1 to the APP.
  • the APP sends the API of NotifyCreate, creates the synchronization object notify2 on NPU1, and the Runtime calls the NPU-driven interface to request the NPU driver to allocate the synchronization event on NPU1.
  • a synchronization register As shown in FIG. 4 , the NPU driver may allocate a synchronization register Reg1 among multiple synchronization registers in the NPU1, record the identifier of the synchronization register Reg1, and reset the value of the synchronization register Reg1 to the first value 0.
  • the NPU driver returns the id of the synchronization register Reg1 to the Runtime.
  • the Runtime builds the synchronization object notify2, saves the id of the synchronization register Reg1 in notify2, and returns notify2 to the APP.
  • the NPU driver when it allocates a synchronization register for a synchronization event, it may allocate a synchronization register in an idle state in the NPU to the synchronization event.
  • the synchronization register in the idle state in the NPU refers to the synchronization register that has not been associated with other synchronization events, or, although it has been associated with other synchronization events but has been reclaimed (that is, disassociated from other synchronization events or synchronization objects. relation) synchronization register.
  • the synchronization event in this embodiment of the present application may occur in one NPU, between different NPUs in one AI server, or between NPUs of different AI servers (each AI server includes at least one NPU) .
  • This embodiment is described by taking (a) of FIG. 5 that synchronization event 1 occurs in one NPU, and (b) of FIG. 5 that synchronization event 2 occurs between different NPUs in an AI server as an example.
  • the second processor determines whether the first synchronization event occurs based on the value of the first synchronization register.
  • the second processor determines whether the first synchronization event occurs based on the value of the first synchronization register, which can be divided into the following two implementation manners.
  • the above step S302 may include: when the value of the first synchronization register is the first value, the second processor determines that the first synchronization event has not occurred, and the second processor continues to wait for the occurrence of the first synchronization event , until the value of the first synchronization register is the second value, the second processor determines that the first synchronization event has occurred, and the second processor resets the value of the first synchronization register to the first value.
  • the second processor will continue to wait for the first synchronization event to occur until the value of the first synchronization register is the second value. , the second processor resets the value of the first synchronization register to the first value, and then executes the next task, thereby ensuring correct synchronization.
  • the controller of the second processor will always check the value of the first synchronization register.
  • the controller of the second processor can immediately check the value modification of the first synchronization register, The second processor determines that the first synchronization event has occurred, and the second processor clears the first synchronization register to 0, so that the first synchronization register can continue the synchronization operation.
  • step S302 may include: when the value of the first synchronization register is the second value, the second processor determines that the first synchronization event has occurred, and the second processor determines that the value of the first synchronization register has occurred. Reset to the first value.
  • the second processor determines that the first synchronization event has occurred, and the second processor resets the value of the first synchronization register to the first value. The second processor can then proceed with subsequent tasks, thereby ensuring proper synchronization.
  • the synchronization method provided by the embodiment of the present application creates a first synchronization object for the first synchronization event, so that the first synchronization event can correspond to the first synchronization register, and the AI accelerator can determine the synchronization register based on the value of the synchronization register Whether the corresponding synchronization event occurs, thereby enabling synchronization within an AI accelerator, among different AI accelerators within an AI server, and among AI servers.
  • FIG. 6 is a synchronization method provided by an embodiment of the present application. As shown in FIG. 6 , the method may include the following steps:
  • the first processor creates a first synchronization object for the first synchronization event.
  • the first synchronization event may occur within one NPU, or may occur between different NPUs within an AI server.
  • step S601 reference may be made to step S301, and details are not repeated here.
  • the first processor sends the waiting task corresponding to the first synchronization event to the second processor by calling the second API.
  • the second API is used to deliver the waiting task corresponding to the synchronization event.
  • the second API may be a NotifyWait(notify, stream) interface, which is used to wait for the synchronization event corresponding to the synchronization object to occur in the stream.
  • the waiting task corresponding to the first synchronization event is used to wait for the first synchronization event to occur, and the waiting task corresponding to the first synchronization event includes the first queue identifier and the identifier of the first synchronization register.
  • the first queue identifier is the identifier of the queue where the waiting task is located. That is, the waiting task corresponding to the first synchronization event is the task in the first queue.
  • the first queue identifier may be the identifier of the stream where the waiting task is located.
  • the CPU sends Wait waiting task 1 to NPU0 by calling NotifyWait(notify1, queue 1), instructing NPU0 to wait in queue 1 for notify1 corresponding to Sync event 1 occurs.
  • the CPU sends a waiting task 2 to NPU1 by calling NotifyWait(notify2, queue 1), instructing NPU1 to wait in queue 1 for notify2 corresponding to Sync event 2 occurs.
  • the second processor receives the waiting task corresponding to the first synchronization event.
  • the second processor determines whether the first synchronization event occurs based on the value of the first synchronization register.
  • the second processor may read the value of the first synchronization register based on the identifier of the first synchronization register carried in the waiting task, because the first synchronization register A different value of is used to indicate whether the first synchronization event occurred. The second processor therefore determines whether the first synchronization event has occurred based on the value of the first synchronization register.
  • step S604 reference may be made to step S302, and details are not described herein again.
  • CPU sends waiting task 1 to NPU0 through NotifyWait.
  • NPU0 receives waiting task 1, based on the identification of synchronization register Reg0 in waiting task 1, read Take the value of this Reg0. If the value of Reg0 is 0, indicating that the synchronization event 1 corresponding to notify1 has not occurred, then NPU0 continues to wait for the synchronization event 1 to occur, and the controller of NPU0 always checks the value of Reg0. When the value of Reg0 changes from 0 to 1, it means that the synchronization event 1 corresponding to notify1 has occurred. The controller of NPU0 immediately checks the value modification of Reg0 to determine that the synchronization event 1 has occurred, and the controller of NPU0 clears the value of Reg0. zero.
  • CPU sends waiting task 1 to NPU0 through NotifyWait.
  • NPU0 receives waiting task 1, based on the identification of synchronization register Reg0 in waiting task 1, Read the value of this Reg0. If the value of Reg0 is 1, NPU0 determines that the synchronization event 1 corresponding to notify1 has occurred, and the controller of NPU0 clears the value of Reg0 to zero.
  • the second processor resets the value of the first synchronization register to the first value, so that the first synchronization register can continue to perform other synchronization operations. For example, if the synchronization event corresponding to the first synchronization object occurs periodically, the second processor may perform synchronization based on the value of the first synchronization register the next time the synchronization event corresponding to the first synchronization object occurs.
  • the first processor sends the recording task corresponding to the first synchronization event to the third processor by calling the third API.
  • the third processor may be an NPU, and the third processor and the second processor may be the same NPU, or may be different NPUs in the same AI server.
  • the third API is used to deliver the recording task corresponding to the synchronization event.
  • the third API may be a NotifyRecord(notify, stream) interface, which is used to set the synchronization event occurrence corresponding to the synchronization object in the stream.
  • the recording task corresponding to the first synchronization event is used to indicate that the first synchronization event has occurred.
  • the recording task corresponding to the first synchronization event includes the second queue identifier and the identifier of the first synchronization register, and the second queue identifier is the first synchronization event.
  • the second queue identifier may be the identifier of the stream where the recording task is located.
  • the above-mentioned second processor and the third processor are the same AI accelerator (for example, NPU), that is, the same AI accelerator executes both the Wait task and the execution. Record tasks.
  • the second processor and the third processor are two different AI accelerators in the AI server. That is, one AI accelerator performs the Wait task, and the other AI accelerator performs the Record task.
  • the AI accelerator executes both the Wait task and the Record task, the Wait task and the Record task can be tasks in two streams respectively.
  • the synchronization event 1 occurs in one NPU.
  • the above-mentioned second processor and third processor are the same as NPU0, that is, NPU0 performs both the Wait task and the Record task.
  • the CPU sends record task 1 to NPU0 by calling NotifyRecord(notify1, queue 0), indicating that synchronization event 1 corresponding to synchronization object notify1 in queue 0 of NPU0 has occurred.
  • the CPU After the CPU finishes executing the computing task 01 on the NPU0 and sends the execution result of the computing task 01 to the CPU, the CPU sends the recording task 1 to the NPU0, indicating that the synchronization event 1 corresponding to the notify1 has occurred.
  • the synchronization event 2 occurs between different NPUs in the AI server.
  • the above-mentioned second processor is NPU1
  • the third processor is NPU0.
  • the CPU sends record task 2 to NPU0 by calling NotifyRecord(notify2, queue 2), instructing NPU0 to synchronize the corresponding object notify2 in queue 2. Synchronization event 2 has occurred.
  • the CPU sends the recording task 2 to the NPU0, indicating that the synchronization event 2 corresponding to notify2 has occurred.
  • the third processor receives the recording task corresponding to the first synchronization event.
  • the third processor receives the recording task corresponding to the first synchronization event, and can learn that the first synchronization event has occurred.
  • the third processor resets the value of the first synchronization register to the second value based on the identifier of the first synchronization register.
  • the third processor may reset the value of the first synchronization register to the second value based on the identifier of the first synchronization register in the recording task corresponding to the first synchronization event, so that the first synchronization The value of the synchronization register corresponds to the occurrence state of the first synchronization event.
  • NPU0 can reset the value of Reg0 in NPU0 to 1 based on the identifier of Reg0, so that the controller of NPU0 can immediately check the value of Reg0 If the value is modified, NPU0 determines that synchronization event 1 has occurred, and clears the value of Reg0.
  • NPU0 can reset the value of Reg1 in NPU1 to 1 based on the identifier of Reg1, so that the controller of NPU1 can immediately check Reg1
  • the value of Reg1 is modified, NPU1 determines that synchronization event 2 has occurred, and clears the value of Reg1.
  • NotifyWait and NotifyRecord are in one-to-one correspondence.
  • the third processor learns that the synchronization event corresponding to the synchronization object has occurred, and resets the value of the synchronization register corresponding to the synchronization object. Set to 1.
  • the second processor receives the waiting task, it reads the value of the synchronization register corresponding to the synchronization object. If the value of the synchronization register is 0, it is determined that the synchronization event has not occurred, and the second processor will continue to wait for the synchronization event to occur until the third processor.
  • the value of the synchronization register corresponding to the synchronization object is set to 1, and the second processor immediately checks that the value of the synchronization register is 1, then the second processor determines that the synchronization event has occurred, and the second processor resets the value of the synchronization register. Set to 0 so that the synchronization register can continue other subsequent synchronization operations.
  • the synchronization overhead is the overhead of the controller of the AI accelerator writing the register through the bus, and the synchronization overhead is relatively small.
  • the synchronization overhead is less than 50ns, and for synchronization between different NPUs within an AI server, the synchronization overhead is less than 1us.
  • the embodiment of the present application provides a simple API interface, which is similar to the semaphore interface of a general-purpose OS, which can greatly facilitate developers to use the AI accelerator.
  • FIG. 6 is only an exemplary illustration.
  • the above method may further include step S608.
  • the first processor releases the correspondence between the first synchronization register and the first synchronization object by calling the seventh API, and resets the value of the first synchronization register to the first value.
  • the seventh API is used to release the first synchronization register.
  • the seventh API may be NotifyDestroy(notify), and this interface may be used to destroy the synchronization object notify and release the synchronization register corresponding to the synchronization object.
  • the APP sends the API of NotifyDestroy, destroys the created synchronization object notify1, the runtime calls the NPU-driven interface, releases the notify1 on NPU0, the NPU driver recycles the notify1 of NPU0, and synchronizes the corresponding synchronization of notify1
  • the value of register Reg0 is reset to 0.
  • the synchronization register corresponding to the synchronization object can be recovered, so that when synchronization is required subsequently, the synchronization register can be allocated to other synchronization events.
  • each register may correspond to a synchronization event, and different values of the register are used to indicate whether the corresponding synchronization event occurs.
  • the AI accelerator receives the waiting task, by reading the value of the corresponding synchronization register, it can wait for the synchronization event to occur when the synchronization event does not occur, and reset the value of the synchronization register to the first synchronization event when the synchronization event has occurred. numerical value.
  • the AI accelerator receives the recording task, by writing a value in the corresponding synchronization register, it indicates that the synchronization event has occurred, so that the AI accelerator that needs to be synchronized can be accurately synchronized.
  • the synchronization method provided by the embodiment of the present application can not only realize the synchronization within an AI accelerator through the synchronization register, but also realize the synchronization between different AI accelerators in an AI server. Moreover, it provides a simple API interface, and the synchronization overhead is small, which can improve the efficiency of AI training.
  • the above-mentioned first synchronization event may be a synchronization event of one APP, or may be a synchronization event between different APPs. Regardless of whether the synchronization event is a synchronization event of an APP or a synchronization event between multiple APPs, the synchronization event can occur in one AI accelerator or between different AI accelerators in an AI server. However, when the first synchronization event is a synchronization event between multiple APPs, in order to realize the synchronization between processes, the multiple APPs need to pre-agreed the names of the synchronization objects.
  • APP1 and APP3 can pre-agreed the names of the synchronization objects, so as to achieve different Synchronization between processes.
  • An embodiment of the present application further provides a synchronization method.
  • the first synchronization event is an inter-process synchronization event, and the method includes the following steps:
  • the first processor creates a first synchronization object for the first synchronization event.
  • the first synchronization event is an inter-process synchronization event, and the first synchronization event may occur in one AI accelerator or between different AI accelerators in an AI server, which is not limited in this embodiment of the present application.
  • step S801 reference may be made to the specific implementation of the foregoing step S301, and details are not described herein again.
  • the first processor sets the name of the first synchronization object as a preset name by calling the fourth API of the first application.
  • the fourth API is used to set the global name of the synchronization object.
  • the fourth API may be IpcSetNotifyName(notify, name), which is used to set the global name of the synchronization object notify.
  • the first synchronization event may be synchronization between a first application program and a second application program, and the above-mentioned preset name is a name pre-agreed by the first application program and the second application program.
  • APP1 can create a synchronization object A by calling the NotifyCreate interface.
  • the synchronization object A can be recorded as notifyA
  • the NPU driver allocates the synchronization register Regn of NPU1 to the runtime of APP1, and notifyA saves the identification of the synchronization register Regn.
  • the identification of the synchronization register Regn is 1-n as an example for example.
  • APP1 calls the IpcSetNotifyName interface, sets notifyA as the synchronization object of inter process communication (IPC), and the NPU driver marks the name of the synchronization object notifyA as NotifyForTest1.
  • the first processor acquires the identifier of the first synchronization register corresponding to the preset name by calling the fifth API of the second application.
  • the fifth API is used to obtain the identifier of the register corresponding to the preset name.
  • the fifth API may be IpcOpenNotify(notify, name), which is used to open the synchronization object according to the global name name of the synchronization object notify.
  • APP3 calls IpcOpenNotify
  • runtime calls the NPU driver interface, and passes in NotifyForTest1
  • the NPU driver finds the synchronization object notifyA according to NotifyForTest1
  • the runtime creates a synchronization object B to APP3.
  • the synchronization object B can be recorded as notifyB, and the identifier of the synchronization register Reg1- n.
  • the same synchronization register Reg1-n can correspond to notifyA of APP1 and notifyB of APP3 respectively, and then APP1 and APP3 can use the NotifyRecord and NotifyWait interfaces for synchronization.
  • the first processor sends the waiting task corresponding to the first synchronization event to the second processor by calling the second API.
  • APP1 can call the NotifyWait(notifyA, queue 1) interface, and send a waiting task to NPU1, instructing NPU1 to wait in queue 1 for the synchronization event corresponding to notifyA to occur.
  • the second processor receives the waiting task corresponding to the first synchronization event.
  • the second processor determines whether the first synchronization event occurs based on the value of the first synchronization register.
  • the CPU sends a waiting task to NPU1 through NotifyWait.
  • NPU1 reads the value of the Reg1-n based on the identifier of the synchronization register Reg1-n in the waiting task. If the value of Reg1-n is 0, indicating that the synchronization event corresponding to notifyA has not occurred, then NPU1 keeps waiting, and the controller of NPU1 always checks the value of Reg1-n. When the value of Reg1-n changes from 0 to 1, It means that the synchronization event corresponding to notifyA has occurred, the controller of NPU1 immediately checks the value modification of Reg1-n, the controller of NPU1 ends the waiting, and clears the value of Reg1-n to zero.
  • the first processor sends the recording task corresponding to the first synchronization event to the third processor by calling the third API.
  • the third processor and the second processor are the same AI accelerator.
  • the third processor and the second processor are two different AI accelerators in one AI server. The following embodiments are described by taking as an example that the first synchronization event occurs between different AI accelerators in one AI server.
  • APP2 can call the NotifyRecord(notifyB, queue 0) interface to issue a recording task to NPU0, indicating that the synchronization event corresponding to the synchronization object notifyB in queue 0 has occurred.
  • the third processor receives the recording task corresponding to the first synchronization event.
  • the third processor resets the value of the first synchronization register to the second value based on the identifier of the first synchronization register.
  • NPU0 can reset the value of Reg1-n in NPU1 to 1 based on the identification of Reg1-n, so that the controller of NPU1 can immediately check the value modification of Reg1-n, and the control of NPU1 The controller finishes waiting and clears the value of Reg1-n.
  • FIG. 8 is only an exemplary illustration.
  • the above method may further include step S810.
  • the first processor releases the correspondence between the first synchronization register and the first synchronization event by calling the seventh API, and resets the value of the first synchronization register to the first value.
  • step S810 for a specific implementation manner of the foregoing step S810, reference may be made to step S608, and details are not described herein again.
  • each register can be used to correspond to a synchronization event, and different values of the register are used to indicate its corresponding synchronization event Whether the synchronization event occurs, and if the synchronization event is inter-process synchronization, by presetting the global name of the synchronization event, the synchronization events between different processes can be corresponding to the same register, so as to realize the synchronization between the processes.
  • An embodiment of the present application further provides a method for synchronizing chips.
  • a second synchronization event occurs between different AI servers.
  • the method includes the following steps:
  • the first processor creates a second synchronization object for the second synchronization event.
  • the second synchronization event is a synchronization event between different AI servers.
  • AI server 1 runs APP1
  • AI server 2 runs APP2
  • APP1 and APP2 need to be synchronized
  • APP1 Wait for APP2 to transmit data to APP1.
  • APP2 notifies APP1 that the transmission is completed, indicating that APP1 can perform subsequent tasks.
  • the CPU1 can allocate a synchronization register Regm for the synchronization event in the multiple synchronization registers included in the NPU0 of the AI server 1, and save the identification Reg0-m of the synchronization register Regm in the synchronization object K.
  • the synchronization object K Can be recorded as notifyK.
  • the CPU1 of the AI server 1 can create a synchronization object notifyK by calling the NotifyCreat interface, and the synchronization object notifyK stores the identifier Reg0-m of the synchronization register allocated by the NPU driver for the synchronization event.
  • the first processor sends the waiting task corresponding to the second synchronization event to the second processor by calling the second API.
  • the first processor and the second processor are processors in the same AI server.
  • the first processor may be CPU1 in AI server 1
  • the second processor may be AI accelerator NPU0 in AI server 1 .
  • the second processor receives the waiting task corresponding to the second synchronization event.
  • the second processor determines whether a second synchronization event occurs based on the value of the second synchronization register.
  • the first processor acquires the virtual address of the second synchronization register by calling the sixth API.
  • the sixth API is used to obtain the virtual address of the register corresponding to the synchronization object.
  • the sixth API may be NotifyGetAddr(notify, addr), wherein the input is the synchronization object notify, and the output is the virtual address of the synchronization register corresponding to the synchronization object notify.
  • APP1 when synchronizing between AI servers, APP1 calls the NotifyGetAddr interface to map the physical address of the synchronization register Reg0-m corresponding to the synchronization object notifyK to the virtual address (Virtual Address, VA) of APP1, denoted as VA1.
  • VA Virtual Address
  • APP1 calls the NotifyGetAddr interface of Runtime, and passes in the synchronization object notifyK.
  • Runtime obtains the identity of the synchronization register Reg0-m according to the synchronization object notifyK, and the NPU driver obtains the physical address of the synchronization register according to the identity of the synchronization register Reg0-m, and transfers the The physical address maps the virtual address of APP1, the NPU driver returns the virtual address to the Runtime, and the Runtime returns the virtual address to the APP to complete the virtual address mapping process of the synchronization register.
  • the embodiment of the present application does not limit the specific implementation manner of mapping the physical address of the synchronization register to the virtual address.
  • mapping the physical address of the synchronization register to the virtual address.
  • the first processor sends the virtual address of the second synchronization register to the fourth processor.
  • the fourth processor may be a central control unit in the AI server, eg, a CPU.
  • the fourth processor includes the second CPU.
  • the first processor and the fourth processor are processors in different AI servers.
  • the first processor and the fourth processor may be CPUs in different AI servers.
  • the first processor may be CPU1 in AI server 1
  • the fourth processor may be CPU2 in AI server 2
  • CPU1 in AI server 1 sends a synchronization object to CPU2 in AI server 2
  • the fourth processor receives the virtual address of the second synchronization register.
  • the fourth processor sends a remote direct memory access (Remote Direct Memory Access, RDMA) task corresponding to the second synchronization event to the fifth processor.
  • RDMA Remote Direct Memory Access
  • the RDMA task corresponding to the second synchronization event is used to indicate that the second synchronization event has occurred, and the RDMA task corresponding to the second synchronization event includes the virtual address of the second synchronization register.
  • the fourth processor and the fifth processor are processors in the same AI server, the fourth processor may be a CPU in the AI server, and the fifth processor may be an AI accelerator (eg, NPU) in the AI server.
  • the fourth processor may be a CPU in the AI server
  • the fifth processor may be an AI accelerator (eg, NPU) in the AI server.
  • the fourth processor is the CPU2 in the AI server 2
  • the fifth processor may be the NPU1 in the AI server 2 .
  • the CPU2 can issue the RDMA task to the NPU1 by calling RDMAsend(VA1, 1).
  • the fourth processor may send the RDMA task corresponding to the second synchronization event to the fifth processor by calling the eighth API.
  • the eighth API is used to deliver the RDMA task corresponding to the synchronization event.
  • the eighth API is RDMAsend(addr, 1), which is used to instruct to write the second value 1 to the virtual address addr.
  • the fifth processor receives the RDMA task corresponding to the second synchronization event.
  • the fifth processor resets the value of the second synchronization register to the second value through the RDMA device based on the virtual address of the second synchronization register.
  • the fifth processor may reset the value of the second synchronization register to the second value based on the virtual address of the second synchronization register in the RDMA task corresponding to the second synchronization event, so that the first The value of the second synchronization register corresponds to the occurrence state of the second synchronization event.
  • NPU1 in AI server 2 can reset the value of Reg0-m in NPU0 to 1 based on VA1, so that the controller of NPU0 can immediately check the value modification of Reg0-m, and the value of NPU0 The controller ends the wait and clears the value of Reg0-m.
  • NotifyWait and RDMAsend are in one-to-one correspondence.
  • the fifth processor receives the RDMAsend task, it learns that the synchronization event corresponding to the synchronization object occurs, and uses the RDMA device to enter the synchronization register corresponding to the synchronization object. The value is reset to 1.
  • the second processor receives the waiting task, it reads the value of the synchronization register corresponding to the synchronization object.
  • the second processor If the value of the synchronization register is 0, it is determined that the synchronization event has not occurred, and the second processor will keep waiting until the fifth processor
  • the value of the synchronization register corresponding to the synchronization object is set to 1
  • the second processor checks that the value of the synchronization register is 1, and determines that the synchronization event has occurred, then the second processor ends the wait and resets the value of the synchronization register to 0 , so that the synchronization register can continue to perform other subsequent synchronization operations.
  • FIG. 9 is only an exemplary illustration.
  • the synchronization overhead is only the time overhead of network communication, and there is no other additional overhead, so the synchronization overhead is relatively small.
  • the embodiment of the present application provides a simple API interface, which is similar to the semaphore interface of a general-purpose OS, which can greatly facilitate developers to use the AI accelerator.
  • the above method may further include step S911.
  • the first processor releases the correspondence between the second synchronization register and the second synchronization event by calling the seventh API, and resets the value of the second synchronization register to the first value.
  • step S911 For a specific implementation manner of the foregoing step S911, reference may be made to step S608, and details are not described herein again.
  • each register may correspond to a synchronization object, and different values of the register are used to indicate whether a synchronization event corresponding to the synchronization object occurs.
  • the AI accelerator receives the waiting task, by reading the value of the corresponding synchronization register, it can keep waiting when the synchronization event does not occur, and end the waiting when the synchronization event has occurred.
  • the AI accelerator receives the RDMA task, by writing a value in the synchronization register corresponding to the virtual address, it indicates that the synchronization event has occurred, so that the AI accelerator that needs to be synchronized can be accurately synchronized.
  • this solution can realize synchronization between different nodes (AI servers) by converting the physical address of the synchronization register into a virtual address, and writing a value in the virtual address through RDMA.
  • AI servers nodes
  • it provides a simple API interface, and the synchronization overhead is small, which improves the efficiency of AI training.
  • An embodiment of the present application further provides a method for synchronizing chips. As shown in FIG. 11 , in this embodiment, a second synchronization event occurs between AI servers, and the method includes the following steps:
  • the first processor creates a second synchronization object for the second synchronization event.
  • the first processor sends the waiting task corresponding to the second synchronization event to the second processor by calling the second API.
  • the second processor receives the waiting task corresponding to the second synchronization event.
  • the second processor determines whether a second synchronization event occurs based on the value of the second synchronization register.
  • the first processor acquires the virtual address of the second synchronization register by calling the sixth API.
  • the first processor sends the virtual address of the second synchronization register to the fourth processor.
  • the fourth processor receives the virtual address of the second synchronization register.
  • the fourth processor resets the value of the second synchronization register to the second value through the RDMA device based on the virtual address of the second synchronization register.
  • FIG. 11 is only an exemplary illustration.
  • the above method may further include step S1109.
  • the first processor releases the correspondence between the second synchronization register and the second synchronization event by calling the seventh API, and resets the value of the second synchronization register to the first value.
  • step S1109 reference may be made to step S608, and details are not described herein again.
  • each register may correspond to a synchronization object, and different values of the register are used to indicate whether a synchronization event corresponding to the synchronization object occurs.
  • the AI accelerator receives the waiting task, by reading the value of the corresponding synchronization register, it can keep waiting when the synchronization event does not occur, and end the waiting when the synchronization event has occurred.
  • the processor directly writes a value to the synchronization register based on the virtual address of the synchronization register, indicating that the synchronization event has occurred, thereby enabling accurate synchronization between AI servers that need to be synchronized.
  • each APP can call one or more of the above APIs according to its own business requirements. , to achieve synchronization within an AI accelerator, between different AI accelerators within an AI server, or between AI servers.
  • An embodiment of the present application further provides a chip, where the chip includes the above-mentioned first processor and an interface circuit, and the first processor is configured to communicate with other devices through the interface circuit, so as to realize FIG. 3 , FIG. 6 , FIG. 8 , FIG. 9 or FIG. 9 .
  • the chip may further include a memory for storing computer instructions.
  • An embodiment of the present application further provides a chip, where the chip includes the above-mentioned second processor and an interface circuit, and the second processor is configured to communicate with other devices through the interface circuit, so as to realize FIG. 3 , FIG. 6 , FIG. 8 , FIG. 9 or FIG. Synchronization method shown in 11.
  • An embodiment of the present application further provides a chip, where the chip includes the above-mentioned third processor and an interface circuit, where the third processor is configured to communicate with other devices through the interface circuit, so as to implement the synchronization method shown in FIG. 6 or FIG. 8 .
  • An embodiment of the present application further provides a chip, where the chip includes the above-mentioned fourth processor and an interface circuit, where the fourth processor is configured to communicate with other devices through the interface circuit, so as to implement the synchronization method shown in FIG. 9 or FIG. 11 .
  • An embodiment of the present application further provides a chip, where the chip includes the aforementioned fifth processor and an interface circuit, where the fifth processor is configured to communicate with other devices through the interface circuit, so as to implement the synchronization method shown in FIG. 11 .
  • Embodiments of the present application further provide an AI server, where the AI server includes the first processor, the second processor, and an interface circuit, and the first processor and the second processor communicate through the interface circuit, so as to implement FIG. 3, The synchronization method shown in FIG. 6 , FIG. 8 , FIG. 9 or FIG. 11 .
  • Embodiments of the present application further provide an AI server, where the AI server includes the first processor, the second processor, the third processor, and an interface circuit, the first processor, the second processor, and the third processor Communicate through the interface circuit to realize the synchronization method shown in FIG. 6 or FIG. 8 .
  • Embodiments of the present application further provide an AI server, where the AI server includes the fourth processor, the fifth processor, and an interface circuit, and the fourth processor and the fifth processor communicate through the interface circuit, so as to realize the process shown in FIG. synchronization method shown.
  • An embodiment of the present application provides an AI cluster, where the AI cluster includes multiple AI servers, the AI server includes a CPU and one or more AI accelerators, the CPU may include the above-mentioned first processor, and the AI accelerator may include the above-mentioned second processor or at least one of the third processors.
  • An embodiment of the present application provides an AI cluster, where the AI cluster includes multiple AI servers, the AI server includes a CPU and one or more AI accelerators, the CPU may include the above-mentioned fourth processor, and the AI accelerator may include the above-mentioned fifth processor .
  • An embodiment of the present application provides a communication system, where the communication system includes at least one of the above-mentioned AI accelerator, the above-mentioned AI server, or the above-mentioned AI cluster.
  • the embodiment of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to create a synchronization object for a synchronization event.
  • the API can be NotifyCreat(deviceID, notify), where the input deviceID is the ID of the AI accelerator, and the output notify is the synchronization object.
  • the embodiment of the present application provides an application program interface API, the API is deployed in the processor, and the API is used to deliver a waiting task corresponding to a synchronization event.
  • the API may be the NotifyWait(notify, stream) interface, which is used to wait for the synchronization event corresponding to the synchronization object to occur in the stream.
  • the embodiment of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to issue a recording task corresponding to a synchronization event.
  • the API can be the NotifyRecord(notify, stream) interface, which is used to set the synchronization event occurrence corresponding to the synchronization object in the stream.
  • the embodiment of the present application provides an application program interface API, the API is deployed in the processor, and the API is used to set the global name of the synchronization object.
  • the API can be IpcSetNotifyName(notify, name), which is used to set the global name of the synchronization object notify.
  • the embodiment of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to open a synchronization object.
  • the API can be IpcOpenNotify(notify, name), which is used to open the synchronization object according to the global name name of the synchronization object notify.
  • the embodiment of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to obtain a virtual address of a register corresponding to a synchronization object.
  • the API may be NotifyGetAddr(notify, addr), where the input is the synchronization object notify, and the output is the virtual address of the synchronization register corresponding to the synchronization object notify.
  • the embodiment of the present application provides an application program interface API, the API is deployed in the processor, and the API is used to release the synchronization register.
  • the API can be NotifyDestroy(notify), which can be used to destroy the synchronization object notify and release the synchronization register corresponding to the synchronization object.
  • the embodiment of the present application provides an application program interface API, where the API is deployed in a processor, and the API is used to deliver an RDMA task corresponding to a synchronization event.
  • the API may be RDMAsend(addr, 1), which is used to instruct to write the second value 1 to the virtual address addr.
  • Embodiments of the present application further provide a computer-readable storage medium, where computer program codes are stored in the computer-readable storage medium.
  • the electronic device executes FIG. 3 , FIG. 6 , FIG. 8 , The synchronization method shown in FIG. 9 or FIG. 11 .
  • Embodiments of the present application also provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the synchronization method shown in FIG. 3 , FIG. 6 , FIG. 8 , FIG. 9 or FIG. 11 .
  • the steps of the methods or algorithms described in conjunction with the disclosure of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions.
  • Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (RAM), flash memory, erasable programmable read-only memory (erasable programmable read-only memory, EPROM), electrically erasable programmable Programmable read only memory (electrically EPROM, EEPROM), registers, hard disk, removable hard disk, compact disk read only memory (CD-ROM), or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage medium may reside in an ASIC.
  • the ASIC may be located in the core network interface device.
  • the processor and the storage medium may also exist in the core network interface device as discrete components.
  • the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof.
  • the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Multi Processors (AREA)
  • Hardware Redundancy (AREA)

Abstract

本申请实施例公开了一种同步方法及装置,涉及人工智能领域,改善了现有技术不支持AI服务器间同步等问题。具体方案为:第一处理器为第一同步事件创建第一同步对象;第一同步对象中包括第一同步寄存器的标识,第一同步寄存器的值包括第一数值或第二数值,第一数值用于指示第一同步事件未发生,第二数值用于指示第一同步事件已经发生;第一处理器包括第一中央处理器CPU;第二处理器基于第一同步寄存器的值,确定第一同步事件是否发生;第二处理器包括第一神经网络处理器NPU。

Description

一种同步方法及装置 技术领域
本申请实施例涉及人工智能领域,尤其涉及一种同步方法及装置。
背景技术
人工智能(artificial intelligence,AI)场景一般需要海量的计算算力,由于单块AI加速器(例如,神经网络处理器(neural-network process unit,NPU)),或者单个AI服务器(例如,包括多块AI加速器的AI服务器)的计算算力有限,通常不能满足AI场景的算力需求,因此需要多个AI服务器组成集群来提供AI场景所需的算力。当多个AI服务器组成集群进行AI训练时,为了减少一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步传输及同步等待时间,提供一种合理的同步机制是非常有必要的。
发明内容
本申请实施例提供一种同步方法及装置,能够实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步。
为达到上述目的,本申请实施例采用如下技术方案:
本申请实施例的第一方面,提供一种方法同步方法,所述方法包括:第一处理器为第一同步事件创建第一同步对象;该第一同步对象中包括第一同步寄存器的标识,该第一同步寄存器的值包括第一数值或第二数值,第一数值用于指示该第一同步事件未发生,第二数值用于指示该第一同步事件已经发生;第二处理器基于该第一同步寄存器的值,确定第一同步事件是否发生。
可选的,上述第一处理器包括第一中央处理器CPU,第二处理器包括第一神经网络处理器NPU。例如,第一处理器可以为AI服务器内的CPU,第二处理器可以为AI服务器内的AI加速器。该CPU和AI加速器位于同一个AI服务器内。该第二处理器为等待第一同步事件发生的AI加速器。
可选的,第一同步事件可以发生在一个NPU内,也可以发生在一个AI服务器内的不同NPU之间,还可以发生在不同AI服务器间。
基于本方案,通过为同步事件创建同步对象,而且每个同步对象与一个同步寄存器相对应,AI加速器可以基于该同步寄存器的值,确定该同步寄存器对应的同步事件是否发生,从而能够实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步。
结合第一方面,在一种可能的实现方式中,上述第一处理器为第一同步事件创建第一同步对象,包括:第一处理器通过调用第一应用程序接口API,在上述第二处理器包括的多个同步寄存器中为第一同步事件分配第一同步寄存器,并在第一同步对象中保存第一同步寄存器的标识。
可选的,第一API用于为同步事件创建同步对象。该第一API可以为NotifyCreat (deviceID,notify),其中,输入deviceID为AI加速器的ID,输出notify为同步对象,该NotifyCreat接口用于创建同步对象。该deviceID为等待同步事件发生的AI加速器的ID。
基于本方案,通过在AI加速器中设置一组同步寄存器,从而在需要进行同步时,CPU可以在等待同步事件发生的AI加速器包括的多个同步寄存器中为第一同步事件分配第一同步寄存器,从而使得第一同步寄存器的值一旦发生变化,该AI加速器可以立刻检查到第一同步寄存器的取值修改,能够较快的确定第一同步事件是否发生,实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及不同AI服务器间的同步。而且本申请实施例的方案提供的API接口较简单,同步开销较小,因此能够提升AI训练的效率。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述方法还包括:上述第一处理器通过调用第二API,向上述第二处理器发送第一同步事件对应的等待任务;该第一同步事件对应的等待任务用于等待第一同步事件发生,第一同步事件对应的等待任务包括第一队列标识,以及第一同步寄存器的标识;该第一队列标识为等待任务所在的队列的标识;第二处理器接收该第一同步事件对应的等待任务。
基于本方案,CPU可以通过简单的API向AI加速器下发用于等待同步事件发生的等待任务,并在该等待任务中携带同步寄存器的标识,从而使得AI加速器根据该同步寄存器的不同取值可以确定同步事件是否发生,因此能够实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步。
可选的,第二API用于下发同步事件对应的等待任务。该第二API可以为NotifyWait(notify,stream)接口,该接口用于在stream等待同步对象对应的同步事件发生。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述第二处理器基于上述第一同步寄存器的值,确定第一同步事件是否发生,包括:在第一同步寄存器的值为第一数值的情况下,第二处理器确定第一同步事件未发生,第二处理器继续等待该第一同步事件发生,直至第一同步寄存器的值为第二数值,第二处理器确定该第一同步事件已经发生,第二处理器将第一同步寄存器的值重置为第一数值。
基于本方案,AI加速器可以在第一同步事件未发生的情况下,一直等待第一同步事件发生,直至第一同步事件已经发生,再将第一同步寄存器的重置为第一数值,并继续执行接下来的任务。因此能够实现一个AI加速器内,一个AI服务器内的不同AI加速器间,以及不同AI服务器间的同步。
可以理解的,当第一同步事件发生时,第一寄存器的值从第一数值变为第二数值,由于第一同步寄存器为第二处理器中的同步寄存器,因此第二处理器的控制器可以立刻检查到第一同步寄存器的取值修改,第二处理器确定第一同步事件已经发生,第二处理器将第一同步寄存器的值重置为第一数值,以便第一同步寄存器可以继续进行同步操作。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述第二处理器基于第一同步寄存器的值,确定第一同步事件是否发生,还包括:在第一同步寄存器的值为第二数值的情况下,第二处理器确定第一同步事件已经发生,第二处理器 将第一同步寄存器的值重置为第一数值。
基于本方案,当第二处理器检查到第一同步寄存器的值为第二数值时,第二处理器确定第一同步事件已经发生,第二处理器将第一同步寄存器的值重置为第一数值。然后第二处理器可以继续执行后续的任务,从而能够确保正确同步,实现一个AI加速器内,一个AI服务器内的不同AI加速器间,以及不同AI服务器间的同步。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述方法还包括:第一处理器通过调用第三API,向第二处理器发送第一同步事件对应的记录任务;该第一同步事件对应的记录任务用于指示上述第一同步事件已经发生,第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,第二队列标识为第一同步事件对应的记录任务所在的队列的标识;第二处理器接收第一同步事件对应的记录任务,并基于第一同步寄存器的标识,将第一同步寄存器的值重置为第二数值。
基于本方案,CPU可以通过简单的API向AI加速器(第二处理器)下发用于指示同步事件已经发生的记录任务,并在该等待任务中携带同步寄存器的标识,使得AI加速器根据该同步寄存器的标识写入第二数值,从而同步寄存器的值可以与第一同步事件的发生状态相对应。由于第一同步寄存器为第二处理器中的同步寄存器,因此第二处理器的控制器可以立刻检查到第一同步寄存器的取值修改,第二处理器确定第一同步事件已经发生,第二处理器可以继续执行后续的任务,从而确保第二处理器内能够正确同步。
可选的,第三API用于下发同步事件对应的记录任务。该第三API可以为NotifyRecord(notify,stream)接口,该接口用于在stream设置同步对象对应的同步事件发生。
可选的,当第一同步事件发生在一个AI加速器内时,上述第二处理器既执行Wait任务,又执行Record任务。第二处理器既执行Wait任务,又执行Record任务时,Wait任务和Record任务可以分别为两个Stream中的任务。
可选的,当第一同步事件发生在一个AI服务器内的两个AI加速器之间时,第二处理器执行Wait任务,第三处理器执行Record任务。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述方法还包括:上述第一处理器通过调用第三API,向第三处理器发送第一同步事件对应的记录任务;第一同步事件对应的记录任务用于指示第一同步事件已经发生,第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,第二队列标识为所述第一同步事件对应的记录任务所在的队列的标识;该第三处理器包括第二NPU;第三处理器接收该第一同步事件对应的记录任务,并基于第一同步寄存器的标识,将第一同步寄存器的值重置为所述第二数值。
可选的,该第三处理器与上述第二处理器可以为一个AI服务器内的不同NPU。
基于本方案,CPU可以通过简单的API向AI加速器(第三处理器)下发用于指示同步事件已经发生的记录任务,并在该等待任务中携带同步寄存器的标识,使得AI加速器根据该同步寄存器的标识写入第二数值,从而同步寄存器的值可以与第一同步事件的发生状态相对应。由于第一同步寄存器为第二处理器中的同步寄存器,因此第 二处理器的控制器可以立刻检查到第一同步寄存器的取值修改,第二处理器确定第一同步事件已经发生,第二处理器可以继续执行后续的任务,从而确保AI服务器内的第二处理器和第三处理器件间能够正确同步。
可以理解的,本方案提供的同步方法中,同步开销为AI加速器的控制器通过总线写寄存器的开销,该同步开销较小。例如,采用本方案提供的同步方法,对于一个NPU内的同步而言,同步开销小于50ns,对于一个AI服务器内的不同NPU间的同步,同步开销小于1us。而且本方案提供了简单的API接口,和通用OS的semaphore接口类似,可以大大方便开发者使用AI加速器。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,若上述第一同步事件为进程间的同步事件,上述方法还包括:上述第一处理器通过调用第一应用程序的第四API,将第一同步对象的名称设置为预设名称;第一处理器通过调用第二应用程序的第五API,获取预设名称对应的第一同步寄存器的标识。
基于本方案,如果同步事件为进程间的同步事件,通过预先设置该同步对象的全局名称,从而能够将不同进程间的同步对象与同一个同步寄存器对应,再通过调用第二API和第三API,能够实现进程间的同步。
可选的,第四API用于设置同步对象的全局名称。该第四API可以为IpcSetNotifyName(notify,name),用于设置同步对象notify的全局名称。第五API用于获取预设名称对应的寄存器的标识。该第五API可以为IpcOpenNotify(notify,name),用于根据同步对象notify的全局名称name,打开同步对象。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一同步事件为上述第一应用程序和上述第二应用程序之间的同步事件,上述预设名称为该第一应用程序和该第二应用程序预先约定的名称。
基于本方案,在同步事件为进程间同步的情况下,通过不同应用程序预先设置该同步对象的全局名称,从而能够将不同进程间的同步对象与同一个同步寄存器对应,进而实现进程间的同步。
可选的,无论第一同步事件是一个APP的同步事件,还是多个APP之间的同步事件,该第一同步事件可以发生在一个AI加速器内,也可以发生在一个AI服务器内的不同AI加速器间。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述方法还包括:上述第一处理器通过调用第六API,获取第二同步寄存器的虚拟地址;该第二同步寄存器为第二同步事件对应的寄存器,该第二同步寄存器的不同值用于指示第二同步事件是否发生;第一处理器向第四处理器发送该第二同步寄存器的虚拟地址;该第一处理器和第四处理器为不同AI服务器中的处理器,该第四处理器包括第二CPU。
可选的,第一处理器和第四处理器可以分别为两个AI加速器中的CPU。
基于本方案,通过将同步寄存器的物理地址转换为虚拟地址,从而通过在虚拟地址对应的同步寄存器中写入数值,指示同步事件已经发生,从而能够实现AI加速器间的同步。而且本方案在进行AI服务器间的同步时,同步开销仅是网络通讯的时间开销,没有其它额外开销,因此同步开销较小。而且本申请实施例提供了简单的API接口,和通用OS的semaphore接口类似,大大方便开发者使用AI加速器。
可选的,第六API用于获取同步对象对应的寄存器的虚拟地址。该第六API可以为NotifyGetAddr(notify,addr),其中输入为同步对象notify,输出为同步对象notify对应的同步寄存器的虚拟地址。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述方法还包括:第一处理器通过调用第七API,解除上述第一同步寄存器与上述第一同步事件的对应关系,并将上述第一同步寄存器的值重置为上述第一数值;该第七API用于释放第一同步寄存器。
基于本方案,通过解除第一同步寄存器与第一同步事件的对应关系,可以将该第一同步寄存器回收,从而在后续需要进行同步时,可以将该同步寄存器分配给其他同步对象,提升了同步寄存器的利用率。
可选的,第七API用于释放第一同步寄存器。第七API可以为NotifyDestroy(notify),该接口可以用于销毁同步对象notify,释放同步对象对应的同步寄存器。
结合第一方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一同步寄存器的物理地址采用全局编址方式编址。
基于本方案,通过将同步寄存器采用全局编址方式编址,从而每个AI加速器的控制器,可以获知AI服务器内其他AI加速器中的同步寄存器的物理地址,同时也可以通过物理地址,访问其他AI加速器的同步寄存器,可以实现AI加速器内及AI加速器间的同步。
本申请实施例的第二方面,提供一种同步方法,该方法包括:第四处理器接收来自第一处理器的第二同步寄存器的虚拟地址;该第二同步寄存器为第二同步事件对应的寄存器,该第二同步寄存器的值包括第一数值或第二数值,第一数值用于指示第二同步事件未发生,第二数值用于指示第二同步事件已经发生;第一处理器和第四处理器为不同AI服务器中的处理器,第一处理器包括第一中央处理器CPU,第四处理器包括第二CPU;第四处理器通过向第五处理器发送第二同步事件对应的远程直接内存存取RDMA任务;第二同步事件对应的RDMA任务用于指示第二同步事件已经发生,第二同步事件对应的RDMA任务中包括所述第二同步寄存器的虚拟地址;第五处理器接收第二同步事件对应的RDMA任务,并基于第二同步寄存器的虚拟地址,通过RDMA装置将第二同步寄存器的值重置为所述第二数值,第五处理器包括第三NPU。
可选的,第一处理器和第四处理器可以分别为不同AI加速器中的CPU。第四处理器和第五处理器为同一个AI加速器内不同处理器,例如,第四处理器为AI加速器内的CPU,第五处理器为该AI加速器内的NPU。
基于本方案,一个AI服务器内的AI加速器通过获取同步寄存器的虚拟地址,从而该AI加速器可以在同步事件发生时,通过RDMA装置在虚拟地址对应的同步寄存器中写入数值,指示同步事件已经发生,使得另一个AI服务器内的AI加速器可以立刻检查到该同步寄存器的数值发生变化,从而确定该同步事件发生,能够实现不同AI加速器间的同步。
可选的,第四处理器可以通过调用第八应用程序接口API,向第五处理器发送第二同步事件对应的RDMA任务。该第八API用于下发同步事件对应的RDMA任务。该第八API可以为RDMAsend(addr,1),用于指示向虚拟地址addr写入第二数值1。
本申请实施例的第三方面,提供一种同步方法,该方法包括:第四处理器接收来自第一处理器的第二同步寄存器的虚拟地址,该第二同步寄存器为第二同步事件对应的寄存器,第二同步寄存器的值包括第一数值或第二数值,第一数值用于指示第二同步事件未发生,第二数值用于指示第二同步事件已经发生;第一处理器和第四处理器为不同AI服务器中的处理器;第一处理器包括第一中央处理器CPU,第四处理器包括第二CPU;第四处理器基于该第二同步寄存器的虚拟地址,通过远程直接内存存取RDMA装置将第二同步寄存器的值重置为第二数值。
可选的,第一处理器和第四处理器可以分别为两个AI加速器中的CPU。
基于本方案,一个AI服务器内的CPU通过获取同步寄存器的虚拟地址,从而该CPU可以在同步事件发生时,通过RDMA在虚拟地址对应的同步寄存器中写入数值,指示同步事件已经发生,使得另一个AI服务器内的AI加速器可以立刻检查到该同步寄存器的数值发生变化,从而确定该同步事件发生,能够实现不同AI加速器间的同步。
本申请实施例的第四方面,提供一种同步装置,该同步装置包括第二处理器,该第二处理器包括多个同步寄存器,每个同步寄存器用于与一个同步事件相对应,每个同步寄存器的值包括第一数值或第二数值,该第一数值用于指示同步寄存器对应的同步事件未发生,该第二数值用于指示同步寄存器对应的同步事件已经发生;该第二处理器包括第一神经网络处理器NPU。
结合第四方面,在一种可能的实现方式中,上述同步装置还包括第一处理器;该第一处理器,用于为第一同步事件创建第一同步对象;该第一同步对象中包括第一同步寄存器的标识;该第一同步寄存器的不同值用于指示第一同步事件是否发生;第二处理器,用于基于该第一同步寄存器的值,确定第一同步事件是否发生;第一处理器包括第一中央处理器CPU。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,具体用于通过调用第一应用程序接口API,在上述第二处理器包括的多个同步寄存器中为第一同步事件分配第一同步寄存器,并在第一同步对象中保存第一同步寄存器的标识。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第二API,向上述第二处理器发送第一同步事件对应的等待任务;第一同步事件对应的等待任务用于等待上述第一同步事件发生,该第一同步事件对应的等待任务包括第一队列标识,以及上述第一同步寄存器的标识;该第一队列标识为上述等待任务所在的队列的标识;第二处理器,还用于接收该第一同步事件对应的等待任务。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第二处理器,具体用于在上述第一同步寄存器的值为第一数值的情况下,确定第一同步事件未发生,第二处理器继续等待该第一同步事件发生,直至第一同步寄存器的值为第二数值,第二处理器确定该第一同步事件已经发生,将第一同步寄存器的值重置为第一数值。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第二处理器,具体还用于在上述第一同步寄存器的值为上述第二数值的情况下,确定上述第 一同步事件已经发生,将上述第一同步寄存器的值重置为上述第一数值。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第三API,向上述第二处理器发送第一同步事件对应的记录任务;该第一同步事件对应的记录任务用于指示第一同步事件已经发生,该第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,该第二队列标识为第一同步事件对应的记录任务所在的队列的标识;第二处理器,还用于接收第一同步事件对应的记录任务,并基于第一同步寄存器的标识,将第一同步寄存器的值重置为第二数值。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述同步装置还包括第三处理器,第三处理器包括第二NPU;上述第一处理器,还用于通过调用第三API,向该第三处理器发送第一同步事件对应的记录任务;该第一同步事件对应的记录任务用于指示上述第一同步事件已经发生,该第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,该第二队列标识为该第一同步事件对应的记录任务所在的队列的标识;第三处理器,用于接收该第一同步事件对应的记录任务,并基于第一同步寄存器的标识,将第一同步寄存器的值重置为第二数值。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,若上述第一同步事件为进程间的同步事件;上述第一处理器,还用于通过调用第一应用程序的第四API,将第一同步对象的名称设置为预设名称;第一处理器,还用于通过调用第二应用程序的第五API,获取该预设名称对应的第一同步寄存器的标识。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一同步事件为上述第一应用程序和上述第二应用程序之间的同步事件,上述预设名称为该第一应用程序和该第二应用程序预先约定的名称。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第六API,获取第二同步寄存器的虚拟地址;该第二同步寄存器为第二同步事件对应的寄存器,该第二同步寄存器的不同值用于指示第二同步事件是否发生;第一处理器,还用于向第四处理器发送第二同步寄存器的虚拟地址;第一处理器和第四处理器为不同AI服务器中的处理器,第四处理器包括第二CPU。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第七API,解除上述第一同步寄存器与上述第一同步事件的对应关系,并将上述第一同步寄存器的值重置为上述第一数值;该第七API用于释放上述第一同步寄存器。
结合第四方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一同步寄存器的物理地址采用全局编址方式编址。
本申请实施例的第五方面,提供一种同步装置,该同步装置包括第四处理器和第五处理器;第四处理器,用于接收来自第一处理器的第二同步寄存器的虚拟地址;该第二同步寄存器为第二同步事件对应的寄存器,该第二同步寄存器的值包括第一数值或第二数值,第一数值用于指示第二同步事件未发生,第二数值用于指示第二同步事件已经发生;该第一处理器和该第四处理器为不同AI服务器中的处理器;第一处理器包括第一中央处理器CPU,第四处理器包括第二CPU;第四处理器,还用于向第五处 理器发送第二同步事件对应的远程直接内存存取RDMA任务;第二同步事件对应的RDMA任务用于指示第二同步事件已经发生,第二同步事件对应的RDMA任务中包括第二同步寄存器的虚拟地址;第五处理器包括第三NPU;第五处理器,用于接收第二同步事件对应的RDMA任务,并基于第二同步寄存器的虚拟地址,通过RDMA装置将第二同步寄存器的值重置为第二数值。
可选的,第四处理器可以通过调用第八应用程序接口API,向第五处理器发送第二同步事件对应的RDMA任务。
本申请实施例的第六方面,提供一种同步装置,该同步装置包括第四处理器;该第四处理器,用于接收来自第一处理器的第二同步寄存器的虚拟地址,第二同步寄存器为第二同步事件对应的寄存器,第二同步寄存器的值包括第一数值或第二数值,第一数值用于指示第二同步事件未发生,第二数值用于指示第二同步事件已经发生;第一处理器和第四处理器为不同AI服务器中的处理器;第一处理器包括第一中央处理器CPU,第四处理器包括第二CPU;第四处理器,还用于基于第二同步寄存器的虚拟地址,通过远程直接内存存取RDMA装置将第二同步寄存器的值重置为第二数值。
上述第四方面的效果描述可以参考第一方面的效果描述,上述第五方面的效果描述可以参考第二方面的效果描述,上述第六方面的效果描述可以参考第三方面的效果描述,在此不再赘述。
本申请实施例的第七方面,提供一种第一处理器,该第一处理器,用于为第一同步事件创建第一同步对象;该第一同步对象中包括第一同步寄存器的标识;该第一寄存器的值包括第一数值或第二数值,第一数值用于指示同步事件未发生,第二数值用于指示同步事件已经发生;该第一处理器包括第一中央处理器CPU。
可选的,第一处理器还用于将上述第一寄存器的值重置为第一数值。
结合第七方面,在一种可能的实现方式中,上述第一处理器,具体用于通过调用第一应用程序接口API,在第二处理器包括的多个同步寄存器中为第一同步事件分配第一同步寄存器,并在第一同步对象中保存第一同步寄存器的标识。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第二API,向第二处理器发送第一同步事件对应的等待任务;该第一同步事件对应的等待任务用于等待第一同步事件发生,该第一同步事件对应的等待任务包括第一队列标识,以及第一同步寄存器的标识;第一队列标识为等待任务所在的队列的标识。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第三API,向上述第二处理器发送第一同步事件对应的记录任务;该第一同步事件对应的记录任务用于指示该第一同步事件已经发生,第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,第二队列标识为第一同步事件对应的记录任务所在的队列的标识。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第三API,向第三处理器发送第一同步事件对应的记录任务;该第一同步事件对应的记录任务用于指示所述第一同步事件已经发生,该第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,该第二队列标 识为第一同步事件对应的记录任务所在的队列的标识。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,若上述第一同步事件为进程间的同步事件;上述第一处理器,还用于通过调用第一应用程序的第四API,将上述第一同步对象的名称设置为预设名称;该第一处理器,还用于通过调用第二应用程序的第五API,获取预设名称对应的第一同步寄存器的标识。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一同步事件为上述第一应用程序和上述第二应用程序之间的同步事件,上述预设名称为第一应用程序和第二应用程序预先约定的名称。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第六API,获取第二同步寄存器的虚拟地址;该第二同步寄存器为第二同步事件对应的寄存器,该第二同步寄存器的不同值用于指示第二同步事件是否发生;第一处理器,还用于向第四处理器发送该第二同步寄存器的虚拟地址;第一处理器和第四处理器为不同AI服务器中的处理器,第四处理器包括第二CPU。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一处理器,还用于通过调用第七API,解除上述第一同步寄存器与上述第一同步事件的对应关系,并将上述第一同步寄存器的值重置为第一数值;第七API用于释放第一同步寄存器。
结合第七方面和上述可能的实现方式,在另一种可能的实现方式中,上述第一同步寄存器的物理地址采用全局编址方式编址。
本申请实施例的第八方面,提供一种第二处理器,该第二处理器包括多个同步寄存器,每个同步寄存器用于与一个同步事件相对应,每个同步寄存器的值包括第一数值或第二数值,第一数值用于指示同步寄存器对应的同步事件未发生,第二数值用于指示同步寄存器对应的同步事件已经发生;第二处理器包括第一神经网络处理器NPU。
结合第八方面,在一种可能的实现方式中,上述第二处理器,用于基于第一同步寄存器的值,确定第一同步事件是否发生。
结合第八方面和上述可能的实现方式,在另一种可能的实现方式中,上述第二处理器,具体用于在第一同步寄存器的值为第一数值的情况下,确定第一同步事件未发生,第二处理器继续等待第一同步事件发生,直至第一同步寄存器的值为第二数值,第二处理器确定第一同步事件已经发生,将第一同步寄存器的值重置为第一数值。
结合第八方面和上述可能的实现方式,在另一种可能的实现方式中,上述第二处理器,具体还用于在第一同步寄存器的值为第二数值的情况下,确定第一同步事件已经发生,将第一同步寄存器的值重置为第一数值。
结合第八方面和上述可能的实现方式,在另一种可能的实现方式中,上述第二处理器,还用于接收第一同步事件对应的等待任务;该第一同步事件对应的等待任务用于等待第一同步事件发生,第一同步事件对应的等待任务包括第一队列标识,以及第一同步寄存器的标识;第一队列标识为等待任务所在的队列的标识。
结合第八方面和上述可能的实现方式,在另一种可能的实现方式中,第二处理器,还用于接收第一同步事件对应的记录任务,并基于第一同步寄存器的标识,将第一同步寄存器的值重置为第二数值;该第一同步事件对应的记录任务用于指示第一同步事 件已经发生,第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,第二队列标识为第一同步事件对应的记录任务所在的队列的标识。
本申请实施例的第九方面,提供一种第四处理器,该第四处理器,用于接收来自第一处理器的第二同步寄存器的虚拟地址;第二同步寄存器为第二同步事件对应的寄存器,第二同步寄存器的值包括第一数值或第二数值,第一数值用于指示第二同步事件未发生,第二数值用于指示所述第二同步事件已经发生;第一处理器和第四处理器为不同AI服务器中的处理器;第一处理器包括第一中央处理器CPU,第四处理器包括第二CPU;该第四处理器,还用于向第五处理器发送第二同步事件对应的远程直接内存存取RDMA任务;第二同步事件对应的RDMA任务用于指示第二同步事件已经发生,第二同步事件对应的RDMA任务中包括第二同步寄存器的虚拟地址;第五处理器包括第三NPU。
本申请实施例的第十方面,提供一种第五处理器,该第五处理器,用于接收第二同步事件对应的RDMA任务,并基于第二同步寄存器的虚拟地址,通过RDMA装置将第二同步寄存器的值重置为第二数值;该第二同步事件对应的RDMA任务用于指示第二同步事件已经发生,第二同步事件对应的RDMA任务中包括第二同步寄存器的虚拟地址;第五处理器包括第三NPU;第二同步寄存器的值包括第一数值或第二数值,第一数值用于指示第二同步事件未发生,第二数值用于指示第二同步事件已经发生。
本申请实施例的第十一方面,提供一种电子设备,该电子设备包括存储器,以及如上述第四方面、第五方面、第六方面中任一所述的同步装置。
本申请实施例的第十二方面,提供一种芯片,所述芯片包括接口电路,以及如上述第一方面所述的第一处理器,所述第一处理器用于通过所述接口电路与其它装置通信,以实现上述第一方面所述的方法。
本申请实施例的第十三方面,提供一种芯片,所述芯片包括接口电路,以及如上述第一方面所述的第一处理器和第二处理器,所述第一处理器和所述第二处理器通过所述接口电路通信,以实现上述第一方面所述的方法。
本申请实施例的第十四方面,提供一种芯片,所述芯片包括接口电路,以及如上述第一方面所述的第一处理器、第二处理器和第三处理器,所述第一处理器、所述第二处理器和所述第三处理器通过所述接口电路通信,以实现上述第一方面所述的方法。
本申请实施例的第十五方面,提供一种芯片,所述芯片包括接口电路,以及如上述第二方面或第三方面所述的第四处理器和第五处理器,所述第四处理器和所述第五处理器通过所述接口电路通信,以实现上述任一方面所述的方法。
本申请实施例的第十六方面,提供一种AI服务器,所述AI服务器包括CPU和一个或多个AI加速器,所述CPU为上述任一方面所述的第一处理器,所述一个或多个AI加速器包括上述任一方面所述的第二处理器或第三处理器中的至少一种。
本申请实施例的第十七方面,提供一种AI服务器,所述AI服务器包括CPU和一个或多个AI加速器,所述CPU为上述任一方面所述的第四处理器,所述AI加速器为上述任一方面所述的第五处理器。
本申请实施例的第十八方面,提供一种AI集群,该AI集群包括多个AI服务器,所述AI服务器包括CPU和一个或多个AI加速器,所述CPU包括上述任一方面所述 的第一处理器,所述AI加速器包括上述任一方面所述的第二处理器或第三处理器中的至少一种。
本申请实施例的第十九方面,提供一种AI集群,该AI集群包括多个AI服务器,所述AI服务器包括CPU和一个或多个AI加速器,所述CPU包括上述任一方面所述的第四处理器,所述AI加速器包括上述任一方面所述的第五处理器。
本申请实施例的第二十方面,提供一种通信系统,该通信系统包括AI加速器、上述第十一方面所述的AI服务器、上述第十二方面所述的AI服务器、上述第十三方面所述的AI集群,或上述第十四方面所述的AI集群中的至少一种。该AI加速器包括上述任一方面所述的第二处理器、第三处理器、第五处理器中的至少一种。
本申请实施例的第二十一方面,提供一种应用程序接口API,该API部署在处理器中,该API用于为同步事件创建同步对象。可选的,该API可以为NotifyCreat(deviceID,notify),其中,输入deviceID为AI加速器的ID,输出notify为同步对象。
本申请实施例的第二十二方面,提供一种应用程序接口API,该API部署在处理器中,该API用于下发同步事件对应的等待任务。可选的,该API可以为NotifyWait(notify,stream)接口,该接口用于在stream等待同步对象对应的同步事件发生。
本申请实施例的第二十三方面,提供一种应用程序接口API,该API部署在处理器中,该API用于下发同步事件对应的记录任务。可选的,该API可以为NotifyRecord(notify,stream)接口,该接口用于在stream设置同步对象对应的同步事件发生。
本申请实施例的第二十四方面,提供一种应用程序接口API,该API部署在处理器中,该API用于设置同步对象的全局名称。可选的,该API可以为IpcSetNotifyName(notify,name),用于设置同步对象notify的全局名称。
本申请实施例的第二十五方面,提供一种应用程序接口API,该API部署在处理器中,该API用于打开同步对象。可选的,该API可以为IpcOpenNotify(notify,name),用于根据同步对象notify的全局名称name,打开同步对象。
本申请实施例的第二十六方面,提供一种应用程序接口API,该API部署在处理器中,该API用于获取同步对象对应的寄存器的虚拟地址。可选的,该API可以为NotifyGetAddr(notify,addr),其中输入为同步对象notify,输出为同步对象notify对应的同步寄存器的虚拟地址。
本申请实施例的第二十七方面,提供一种应用程序接口API,该API部署在处理器中,该API用于释放同步寄存器。可选的,该API可以为NotifyDestroy(notify),该接口可以用于销毁同步对象notify,释放同步对象对应的同步寄存器。
本申请实施例的第二十八方面,提供一种应用程序接口API,该API部署在处理器中,该API用于下发同步事件对应的RDMA任务。可选的,该API可以为RDMAsend(addr,1),用于指示向虚拟地址addr写入第二数值1。
附图说明
图1A为本申请实施例提供的一种AI训练过程的示意图;
图1B为本申请实施例提供的一种单AI服务器内的Ring算法的结构示意图;
图1C为本申请实施例提供的一种单AI服务器内的Ring算法中reduce-scatter阶 段的计算过程示意图;
图1D为本申请实施例提供的一种单AI服务器内的Ring算法中all-gather阶段的计算过程示意图;
图2A为本申请实施例提供的一种AI加速器的结构示意图;
图2B为本申请实施例提供的一种计算架构的结构示意图;
图3为本申请实施例提供的一种同步方法的流程示意图;
图4为本申请实施例提供的一种AI服务器的计算架构的结构示意图;
图5为本申请实施例提供的一种计算任务的示意图;
图6为本申请实施例提供的另一种同步方法的流程示意图;
图7为本申请实施例提供的一种进程间同步的计算架构的结构示意图;
图8为本申请实施例提供的另一种同步方法的流程示意图;
图9为本申请实施例提供的另一种同步方法的流程示意图;
图10为本申请实施例提供的一种AI服务器间同步的计算架构的结构示意图;
图11为本申请实施例提供的另一种同步方法的流程示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。在本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,a和b,a和c,b和c,或,a和b和c,其中a、b和c可以是单个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分,本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。比如,本申请实施例中的第一处理器中的“第一”和第二处理器中的“第二”仅用于区分不同的处理器。本申请实施例中出现的第一、第二等描述,仅作示意与区分描述对象之用,没有次序之分,也不表示本申请实施例中对设备个数的特别限定,不能构成对本申请实施例的任何限制。
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
人工智能场景(例如对神经网络进行训练)往往需要多个AI服务器组成集群来提供所需的算力。通常一个AI服务器可以包括一个或多个AI加速器。其中,AI加速器作为一种计算设备,可以是加速用于智能计算或其它数据密集或传感器驱动任务的机器学习过程或算法等专用任务的一类微处理器,还可以包括与该类微处理器相关的指令集。专用任务可以包括AI处理,例如人工神经网络,机器学习(machine learning,ML)训练,ML优化/学习,推断,分类等操作,可视数据处理,网络数据处理,对象 检测,规则分析,内容处理操作等。AI加速器可以为神经网络处理器NPU,可包括图形处理单元GPU,数字信号处理器(digital signal processor,DSP),片上系统(system on chip,SOC),现场可编程门阵列(Field-Programmable Gate Array,FPGA)、专用集成电路(application specific integrated circuit,ASIC)等中的一个或多个。AI加速器可以通过加载权值,偏置,训练数据或代码等运行相关的AI指令集以完成专用任务。本申请实施例对于AI加速器的具体形式并不限定。下述实施例以AI加速器为NPU为例进行说明。
如图1A所示,神经网络的训练过程一般包括多个迭代,每个迭代包括三个阶段:前向计算、反向计算和梯度汇聚。每个AI加速器分别独立的进行前向计算和反向计算,计算出的梯度需要在多个AI加速器上做汇聚。由于反向计算一般是进行误差的反向传播,在获取了误差(神经网络的识别值和监督数据之间的区别)后,基于梯度下降的方法来调整神经网络的权重。所以,反向计算包括了“获取误差值”和“基于误差值进行反向传播”的过程,而后一个过程(误差反向传播的过程)包括了基于梯度来调整神经网络的层与层之间权重的过程。由于目前的主流AI模型(例如用于图像识别的卷积神经网络)一般都包括多个神经元“Layer(层)”,而反向传播过程是将误差值从神经网络的输出层依次、反向地向神经网络的输入层传播过去,在反向传播的过程中,基于误差计算权重参数的梯度,并且进而依据权重参数的梯度的方向对神经网络的权重进行更新。因此,在反向计算过程中,当部分神经元层的梯度数值计算出来后,就可以开始梯度汇聚,例如:对于一个100层的神经网络而言,可以在第100-80层的梯度计算完成后即可开始梯度汇聚。如此一来,反向计算全部完成后,剩下的数据进行梯度汇聚的时间,较所有数据进行梯度汇聚的时间短,能够提高训练效率。
上述前向计算和反向计算过程由每个AI加速器完成,梯度汇聚主要包括AI服务器内的多个AI加速器间的数据传输、AI服务器之间的网络传输、AI加速器之间的同步等待、AI加速器之间的梯度数据累加等,梯度汇聚不需要AI加速器的计算单元参与,因此梯度汇聚时AI加速器的计算单元处于空闲状态。
例如,如图1A所示,在T0时刻至T1时刻AI加速器进行前向计算,在T1时刻至T2时刻AI加速器进行反向计算。在反向计算的过程中,为了提高训练效率,可以在T1时刻至T4时刻计算出部分数据后,从T4时刻开始进行梯度汇聚。即T4时刻至T2时刻同时进行反向计算和梯度汇聚1,待反向计算完成后,T2时刻至T3时刻仅进行剩余数据的梯度汇聚。因此,T0时刻至T2时刻为计算耗时,而T2时刻至T3时刻AI加速器的计算单元既不进行正向计算,也不进行反向计算,处于空闲态,由于AI集群仅进行梯度汇聚,该T2时刻至T3时刻也可以称为梯度汇聚时间。
示例性的,上述梯度汇聚时,可以采用All-reduce算法,该All-reduce算法为一类算法,用于高效地将不同AI加速器中的数据整合之后,再把结果分发给各个AI加速器。
梯度汇聚性能是体现集群训练性能的关键因素,梯度汇聚的时间越短,集群的线性度就越高。该集群的线性度L可以通过以下公式计算:
L=(计算耗时)/(计算耗时+T idle);
其中,T idle为AI加速器的计算单元处于空闲态的时间,即T idle为梯度汇聚时间(例 如,All-reduce时间)。该梯度汇聚时间越长,AI加速器的计算单元处于空闲态的时间就越长,集群线性度L越低。该梯度汇聚时间越短,AI加速器的计算单元处于空闲态的时间就越短,集群线性度L越高。
例如,如图1A所示,由于T2时刻至T3时刻进行的梯度汇聚2不需要AI加速器的计算单元的参与,AI加速器的计算单元处于空闲态,即T2时刻至T3时刻为梯度汇聚时间。因此T2时刻至T3时刻的时长越短,集群线性度越高,T2至T3的时长越长,集群线性度越低,故可以通过减少梯度汇聚时的同步传输及同步等待时间,提高集群线性度。
示例性,以集群的梯度汇聚算法为单AI服务器内的Ring算法为例,若AI服务器包括5块AI加速器,例如为GPU0至GPU4。该Ring算法包括2个阶段,分别为reduce-scatter阶段和all-gather阶段。在reduce-scatter阶段,GPU之间交换数据,使得每个GPU最终得到最终结果的一部分。在all-gather阶段,GPU将交换这些块,以便所有GPU最终得到完整的最终结果。
可选的,在Ring算法中,每个GPU都有一个左邻居和一个右邻居,而且每个GPU只会向它的右邻居发送数据,并从它的左邻居接收数据。例如,如图1B所示,以AI服务器包括5个GPU,分别为GPU0至GPU4为例,每个GPU都有一个左邻居和一个右邻居,GPU0只会向它的右邻居GPU1发送数据,并从它的左邻居GPU4接收数据。GPU1只会向它的右邻居GPU2发送数据,并从它的左邻居GPU0接收数据。GPU2只会向它的右邻居GPU3发送数据,并从它的左邻居GPU1接收数据。GPU3只会向它的右邻居GPU4发送数据,并从它的左邻居GPU2接收数据。GPU4只会向它的右邻居GPU0发送数据,并从它的左邻居GPU3接收数据。
例如,以AI服务器包括5个GPU,分别为GPU0至GPU4,每个GPU将数据分成5个较小的数据块为例,结合图1B,如图1C所示,在reduce-scatter阶段,每个GPU将进行4次迭代的reduce-scatter,在每次迭代中,每个GPU都会将其中一个数据块发送到其右邻居,并从其左邻居接收一个数据块并累积到该数据块中。每次迭代发送和接收的数据块均不同的。例如,GPU0将数据块a0发送至它的右邻居GPU1,并从它的左邻居GPU4接收数据块e4,并累积到数据块e0中。GPU1将数据块b1发送至它的右邻居GPU2,并从它的左邻居GPU0接收数据块a0,并累积到数据块a1中,以此类推。如图1C所示,在reduce-scatter阶段,GPU0至GPU4经过4次迭代后,每个GPU的一个数据块可以得到一个最终值。
为了实现All-reduce,结合图1B,如图1D所示,在all-gather阶段,GPU0至GPU4再次进行4次迭代,只是在每次迭代中,GPU都会将其中一个数据块发送到其右邻居,并从其左邻居接收一个数据块并覆盖到该数据块中。例如,GPU0将数据块b2+b1+b3+b4+b0发送至它的右邻居GPU1,并从它的左邻居GPU4接收数据块a1+a0+a2+a3+a4,并用数据块a1+a0+a2+a3+a4覆盖数据块a0。GPU1将数据块c3+c2+c4+c0+c1发送至它的右邻居GPU2,并从它的左邻居GPU0接收数据块b2+b1+b3+b4+b0,并用数据块b2+b1+b3+b4+b0覆盖数据块b1,以此类推。如图1D所示,在all-gather阶段,GPU0至GPU4经过4次迭代后,所有GPU都具有整个数组的完全累积值。
结合图1B、图1C和图1D可知,该梯度汇聚算法中GPU之间需要有同步机制,才能确保All-reduce得到的结果是正确的。例如,GPU1必须在GPU0把a0传给GPU1后,GPU1才能把a0+a1传给GPU2,如果GPU1提前把结果传给GPU2,则All-reduce结果不正确,如果GPU1延迟将结果传给GPU2,则All-reduce时间较长,浪费GPU的计算算力,因此需要合理的同步机制既确保AI算法正确运行,又能提高AI加速器的算力。
可以理解的,本申请实施例以单AI服务器内的Ring算法为例,说明在AI训练场景中需要同步机制才能确保算法的正常运行,本申请实施例对于同步机制的具体适用场景并不进行限定。实际应用中,当多个AI服务器组成集群进行AI训练时,为了减少一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步传输及同步等待时间,提供一种合理的同步机制是非常有必要的。
一种同步机制是通过信号量(semaphore)机制,确保进程内和进程间的同步互斥。但是该方法只支持在通用处理器架构(例如,X86或ARM)上进行同步,不支持在AI加速器等芯片上的同步,而且不支持AI服务器间的同步。
另一种同步方法是英伟达NVIDIA的统一计算设备架构(compute unified device architecture,CUDA)提供的事件event同步机制,该event同步机制用于进程内、进程间,图形处理器(graphics processing unit,GPU)片内、GPU片间的同步。但是,event机制不支持AI服务器间的同步,在GPU片内、GPU片间进行同步时的开销较大,在10us量级,而且event机制用于进程间同步时,应用程序接口(application program interface,API)设计比较复杂,不方便开发者使用。
本申请实施例提供一种同步方法,该方法能够实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步,而且同步开销较小,API设计较简单,方便开发者使用。
本申请实施例提供的同步方法,可以应用于一种计算架构,该计算架构可以为AI服务器的计算架构。该AI服务器的计算架构为异构计算的硬件架构,该架构中包括中央处理器(central processing unit,CPU)以及一个或多个AI加速器。其中,CPU可以向AI加速器发送AI计算任务,AI加速器接收CPU发送的AI计算任务后,执行该AI计算任务,并将执行结果上报给CPU。
图2A为本申请实施例提供的一种AI加速器,如图2A所示,该AI加速器包括控制器、运算逻辑单元和多个同步寄存器。
控制器用于接收CPU发送的AI计算任务,并将该计算任务的执行结果上报给CPU。
运算逻辑单元用于执行控制器下发的计算任务,并向控制器返回每个计算任务的执行结果。
如图2A所示,AI加速器中包括多个同步寄存器,该多个同步寄存器分别为Reg0、Reg1至Regn。每个同步寄存器用于与一个同步事件相对应,该同步寄存器的不同值可以用于指示其对应的同步事件是否发生。可选的,如图2A所示,该多个同步寄存器可以设置在AI加速器的控制器中。
示例性的,每个同步寄存器的值可以包括第一数值和第二数值。第一数值用于指 示该同步寄存器对应的同步事件未发生,该第二数值用于指示该同步寄存器对应的同步事件已经发生。该第一数值和第二数值为不同的数值。本申请实施例对于第一数值和第二数值的具体取值并不限定,下述实施例以第一数值为0,第二数值为1进行示例性说明。
可选的,同步寄存器对应的同步事件可以发生在一个AI加速器内,也可以发生在一个AI服务器内的不同AI加速器之间,还可以发生在不同AI服务器之间(每个AI服务器包括至少一个AI加速器)。可以理解的,当同步寄存器对应的同步事件发生在一个AI加速器内时,该AI加速器可以基于该同步寄存器的值,确定该同步事件是否发生,从而实现AI加速器内的同步。当同步寄存器对应的同步事件发生在一个AI服务器内的不同AI加速器之间时,AI加速器可以基于该同步寄存器的值,确定该同步事件是否发生,从而实现一个AI服务器内的不同AI加速器间的同步。当同步寄存器对应的同步事件发生在不同AI服务器之间时,一个AI服务器的AI加速器可以基于该同步寄存器的值,确定该同步事件是否发生,从而实现AI加速器间的同步。
可选的,本申请实施例对于每个AI加速器内设置的同步寄存器的具体数量并不做限定。例如,以AI加速器同时最多支持1024个同步事件为例,可以在AI加速器中设置1024个同步寄存器,一个同步寄存器可以与一个同步事件相对应。
可以理解的,本申请实施例提供的AI加速器,通过在AI加速器中设置多个同步寄存器,而且每个同步寄存器用于与一个同步事件相对应,从而使得AI加速器可以基于该同步寄存器的值,确定该同步寄存器对应的同步事件是否发生,以实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步。
本申请实施例提供的同步方法,可以应用于图2B所示的AI服务器,如图2B所示,该AI服务器可以包括CPU和多个AI加速器,每个AI加速器中包括一组同步寄存器,每个同步寄存器可以与一个同步事件相对应,该同步寄存器的不同值可以用于指示其对应的同步事件是否发生。
如图2B所示,CPU中的驱动用于为AI加速器提供驱动功能。应用程序(application,App)中部署用户态驱动层运行时(runtime),runtime用于提供AI加速器的用户态驱动功能。例如,runtime中包括多个API,CPU运行APP时,可以通过调用不同的API接口实现软件与硬件间的交互。CPU可以通过调用API向AI加速器发送AI计算任务,AI加速器中的控制器接收CPU发送的AI计算任务后,执行该AI计算任务,并将执行结果上报给CPU。
可选的,APP的用户态驱动层runtime提供API。上层业务APP可以将AI模型(计算图)进行分拆,转换成AI加速器能够处理的stream、task、event等任务,通过runtime提供的API分别下发给AI加速器处理。示例性的,task是计算任务,一般由AI加速器中的运算逻辑单元处理。event是事件同步机制,一般由控制器处理。AI加速器中的控制器可以并发调度多个stream的task执行,但是同一个stream里的task只能顺序执行。
可选的,当AI服务器包括多个AI加速器时,不同AI加速器内设置的同步寄存器的数量可以相同,也可以不同,本申请实施例对此并不限定,图2B中以AI服务器包括m+1个AI加速器,AI加速器0和AI加速器m中均设置n个同步寄存器为例进 行示意。
可选的,当AI服务器包括多个AI加速器时,每个AI加速器中可以设置多个同步寄存器,一个AI服务器内的不同AI加速器中设置的同步寄存器的物理地址可以采用全局编址方式编址。例如,可以根据AI加速器的标识(identity,ID)加偏移或其他方式,实现一个AI服务器内的同步寄存器的全局编址。可以理解的,由于AI服务器内的多个AI加速器中的同步寄存器采用全局编址,因此每个AI加速器的控制器,可以获知AI服务器内其他AI加速器中的同步寄存器的物理地址,同时也可以通过物理地址,访问其他AI加速器的同步寄存器。
示例性的,当AI服务器中仅包括一个AI加速器时,该AI加速器和CPU可以集成在一个芯片上,也可以分别集成在不同的芯片上。当计算架构中包括多个AI加速器时,该多个AI加速器可以集成在一个或多个芯片上,CPU可以集成在另一个芯片上,也可以将CPU和AI加速器集成在一个芯片上。本申请实施例对于AI服务器中CPU和AI加速器组成的异构计算的硬件形态并不进行限定,在此示例性说明。
可以理解的,本申请实施例通过在AI服务器内的AI加速器中设置一组同步寄存器,而且每个同步寄存器可以与一个同步事件相对应,从而使得AI加速器可以基于该同步寄存器的值,确定该同步寄存器对应的同步事件是否发生,能够实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步。
结合上述图2A、图2B,如图3所示,为本申请实施例提供的一种同步方法,该方法包括以下步骤:
S301、第一处理器为第一同步事件创建第一同步对象。
第一处理器可以为AI服务器中的中央控制单元,例如,CPU。第一处理器包括第一CPU。
可选的,上述步骤S301中第一处理器为第一同步事件创建第一同步对象,可以包括:第一处理器通过调用第一API,在第二处理器包括的多个同步寄存器中,为第一同步事件分配第一同步寄存器,并在第一同步对象中保存第一同步寄存器的标识。该第二处理器包括第二NPU,而且该第二处理器为等待第一同步事件发生的NPU。即本申请实施例中为同步事件分配的同步寄存器为等待同步事件发生的NPU中的同步寄存器。
该第一API用于为同步事件创建同步对象。例如,该第一API可以为NotifyCreat(deviceID,notify),其中输入deviceID为AI加速器的ID,输出notify为同步对象,该NotifyCreat接口用于创建同步对象。上述NotifyCreat接口中的deviceID为第二处理器的ID。
可选的,第一处理器为第一同步事件分配第一同步寄存器时,还可以将该第一同步寄存器的值重置为第一数值,使得第一同步寄存器的值与第一同步事件的当前状态相对应。上述将第一同步寄存器的值重置为第一数值,也可以是将第一同步寄存器的值设置为第一数值,本申请实施例对此并不限定。实际应用中,可以采用设置方式,也可以采用重置(Reset)方式改变同步寄存器的值。
可选的,上述第一处理器可以为AI服务器中的CPU,第二处理器可以为该AI服务器中的AI加速器,该第一处理器和第二处理器组成了异构的计算架构,该AI服务 器可以为异构服务器。例如,第一处理器可以为AI服务器中的host CPU,第二处理器可以为AI服务器中的NPU,host CPU可以通过调用第一API,在等待同步事件发生的NPU包括的多个同步寄存器中,为第一同步事件分配第一同步寄存器。
可选的,上述第一同步事件可以发生在一个NPU内,也可以发生在一个AI服务器内的不同NPU之间,还可以发生在不同AI服务器之间,本申请实施例对此并不限定。
示例性的,图4为一种AI服务器的计算架构的结构示意图。如图4所示,以AI加速器为NPU,AI服务器包括CPU和两个NPU,两个NPU分别为NPU0和NPU1为例,CPU可以向NPU0和NPU1下发计算任务、记录任务和等待任务。其中,计算任务(Task)是由运算逻辑单元处理的计算任务,记录任务(record)用于指示同步事件已经发生,等待任务(wait)用于等待同步事件发生。
例如,以同步事件发生在AI加速器内为例,结合图4,如图5中的(a)所示,NPU0的队列0执行完计算任务01后,NPU0的队列1才可以执行计算任务12。该同步需要NPU0的队列1等待同步事件1发生,该同步事件1为NPU0的队列0执行完计算任务01并发送执行结果。在同步事件1未发生时,NPU0的队列1执行完计算任务11后,一直保持等待。在同步事件1已经发生(NPU0的队列0执行完计算任务01,并发送执行结果)时,NPU0的队列0可以继续执行计算任务12。可以理解的,同步事件1发生在AI加速器NPU0的两个不同队列之间。
再例如,以同步事件发生在一个AI服务器内的不同AI加速器间为例,结合图4,如图5中的(b)所示,NPU0的队列2执行完计算任务3n后,NPU1的队列1才可以执行计算任务2n。该同步需要NPU1的队列1等待同步事件2发生,该同步事件2为NPU0的队列2执行完计算任务3n,并发送执行结果。在同步事件2未发生时,NPU1的队列1一直保持等待。在同步事件2已经发生时,NPU1的队列1可以继续执行计算任务2n。可以理解的,同步事件2发生在一个AI服务器内的不同AI加速器之间(NPU0和NPU1之间)。
示例性的,对于上述同步事件1,由NPU0的队列1等待同步事件1发生,因此CPU可以在NPU0包括的多个同步寄存器中为同步事件1分配同步寄存器,并在同步对象1中保存该同步寄存器的标识,该同步对象1可以记为notify1。对于上述同步事件2,由NPU1的队列1等待同步事件2发生,因此CPU可以在NPU1包括的多个同步寄存器中为同步事件2分配同步寄存器,并在同步对象2中保存该同步寄存器的标识,该同步对象2可以记为notify2。
可选的,由于本申请实施例在每个NPU内设置一组同步寄存器,因此在APP确定需要进行同步时,可以通过调用NotifyCreat(deviceID,notify)接口,在等待同步事件发生的NPU上为每个同步事件分配一个同步寄存器。
例如,对于图5中的(a)所示的同步事件1,APP下发NotifyCreate的API,在NPU0上创建同步对象notify1,Runtime调用NPU驱动的接口,请求NPU驱动在NPU0上为同步事件分配一个同步寄存器。如图4所示,NPU驱动可以在NPU0中的多个同步寄存器中分配一个同步寄存器Reg0,记录该同步寄存器Reg0的标识,并将该同步寄存器的值重置为第一数值0。NPU驱动返回该同步寄存器Reg0的id给Runtime, Runtime构建好同步对象notify1,将同步寄存器Reg0的id保存在notify1中,将notify1返回给APP。
再例如,对于图5中的(b)所示的同步事件2,APP下发NotifyCreate的API,在NPU1上创建同步对象notify2,Runtime调用NPU驱动的接口,请求NPU驱动在NPU1上为同步事件分配一个同步寄存器。如图4所示,NPU驱动可以在NPU1中的多个同步寄存器中分配一个同步寄存器Reg1,记录该同步寄存器Reg1的标识,并将该同步寄存器Reg1的值重置为第一数值0。NPU驱动返回该同步寄存器Reg1的id给Runtime,Runtime构建好同步对象notify2,将同步寄存器Reg1的id保存在notify2中,将notify2返回给APP。
可选的,NPU驱动在为同步事件分配同步寄存器时,可以将NPU中处于空闲状态的同步寄存器分配给该同步事件。可以理解的,该NPU中处于空闲状态的同步寄存器是指未与其他同步事件关联过的同步寄存器,或者,虽与其他同步事件关联过但已经被回收(即与其他同步事件或同步对象解除关联关系)的同步寄存器。
本申请实施例中的同步事件可以发生在一个NPU内,也可以发生在一个AI服务器内的不同NPU之间,还可以发生在不同AI服务器的NPU(每个AI服务器包括至少一个NPU)之间。本实施例以图5中的(a)以同步事件1发生在一个NPU内,图5中的(b)以同步事件2发生在一个AI服务器内的不同NPU之间为例进行说明。
S302、第二处理器基于第一同步寄存器的值,确定第一同步事件是否发生。
可选的,由于该第一同步寄存器的不同值用于指示第一同步事件是否发生。因此第二处理器基于第一同步寄存器的值,确定第一同步事件是否发生,可以分为以下两种实现方式。
第一种实现方式,上述步骤S302可以包括:在第一同步寄存器的值为第一数值的情况下,第二处理器确定第一同步事件未发生,第二处理器继续等待第一同步事件发生,直至第一同步寄存器的值为第二数值,第二处理器确定第一同步事件已经发生,第二处理器将第一同步寄存器的值重置为第一数值。
示例性的,如果第一同步寄存器的值为第一数值,表示第一同步事件未发生,那么第二处理器将继续等待第一同步事件发生,直到第一同步寄存器的值为第二数值时,第二处理器将第一同步寄存器的值重置为第一数值,再执行接下来的任务,从而能够确保正确同步。
可选的,在第一同步寄存器的值为第一数值的情况下,第二处理器的控制器会一直检查第一同步寄存器的取值。当第一同步寄存器的值从0变为1时,由于第一同步寄存器为第二处理器中的同步寄存器,因此第二处理器的控制器可以立刻检查到第一同步寄存器的取值修改,第二处理器确定第一同步事件已经发生,第二处理器将第一同步寄存器清0,以便第一同步寄存器可以继续进行同步操作。
第二种实现方式,上述步骤S302可以包括:在第一同步寄存器的值为第二数值的情况下,第二处理器确定第一同步事件已经发生,第二处理器将第一同步寄存器的值重置为第一数值。
示例性的,如果第一同步寄存器的值为第二数值,那么第二处理器确定第一同步事件已经发生,第二处理器将第一同步寄存器的值重置为第一数值。然后第二处理器 可以继续执行后续的任务,从而能够确保正确同步。
本申请实施例提供的同步方法,通过为第一同步事件创建第一同步对象,从而使得第一同步事件可以与第一同步寄存器相对应,AI加速器可以基于该同步寄存器的值,确定该同步寄存器对应的同步事件是否发生,从而能够实现一个AI加速器内、一个AI服务器内的不同AI加速器间,以及AI服务器间的同步。
图6为本申请实施例提供的一种同步方法,如图6所示,该方法可以包括以下步骤:
S601、第一处理器为第一同步事件创建第一同步对象。
可选的,该第一同步事件可以发生在一个NPU内,也可以发生在一个AI服务器内的不同NPU间。
可以理解的,步骤S601的具体实现方式可以参考步骤S301,在此不再赘述。
S602、第一处理器通过调用第二API,向第二处理器发送第一同步事件对应的等待任务。
第二API用于下发同步事件对应的等待任务。例如,该第二API可以为NotifyWait(notify,stream)接口,该接口用于在stream等待同步对象对应的同步事件发生。
第一同步事件对应的等待任务用于等待第一同步事件发生,该第一同步事件对应的等待任务中包括第一队列标识,以及第一同步寄存器的标识。第一队列标识为等待任务所在的队列的标识。即第一同步事件对应的等待任务为第一队列中的任务。可选的,该第一队列标识可以为等待任务所在的stream的标识。
例如,结合图4,如图5中的(a)所示,对于同步事件1,CPU通过调用NotifyWait(notify1,队列1),向NPU0下发Wait等待任务1,指示NPU0在队列1等待notify1对应的同步事件1发生。
再例如,结合图4,如图5中的(b)所示,对于同步事件2,CPU通过调用NotifyWait(notify2,队列1),向NPU1下发等待任务2,指示NPU1在队列1等待notify2对应的同步事件2发生。
S603、第二处理器接收第一同步事件对应的等待任务。
S604、第二处理器基于第一同步寄存器的值,确定第一同步事件是否发生。
可选的,第二处理器接收第一同步事件对应的等待任务后,可以基于该等待任务中携带的第一同步寄存器的标识,读取该第一同步寄存器的值,由于该第一同步寄存器的不同值用于指示第一同步事件是否发生。因此第二处理器基于第一同步寄存器的值,确定第一同步事件是否发生。
可以理解的,步骤S604的具体实现方式可以参考步骤S302,在此不再赘述。
例如,结合图4和图5中的(a),对于同步事件1,CPU通过NotifyWait向NPU0下发等待任务1,NPU0接收等待任务1后,基于等待任务1中的同步寄存器Reg0的标识,读取该Reg0的值。如果Reg0的值为0,说明notify1对应的同步事件1未发生,那么NPU0继续等待同步事件1发生,NPU0的控制器一直检查Reg0的取值。当Reg0的值从0变为1时,说明notify1对应的同步事件1已经发生,NPU0的控制器立刻检查到Reg0的取值修改,确定同步事件1已经发生,NPU0的控制器将Reg0的值清零。
再例如,结合图4和图5中的(a),对于同步事件1,CPU通过NotifyWait向 NPU0下发等待任务1,NPU0接收等待任务1后,基于等待任务1中的同步寄存器Reg0的标识,读取该Reg0的值。如果Reg0的值为1,NPU0确定notify1对应的同步事件1已经发生,NPU0的控制器将Reg0的值清零。
可选的,在第一同步事件发生后,第二处理器通过将第一同步寄存器的值重置为第一数值,从而使得该第一同步寄存器可以继续进行其他同步操作。例如,如果第一同步对象对应的同步事件周期性发生,那么可以在下次该第一同步对象对应的同步事件发生时,第二处理器基于该第一同步寄存器的值进行同步。
S605、第一处理器通过调用第三API,向第三处理器发送第一同步事件对应的记录任务。
该第三处理器可以为NPU,该第三处理器与上述第二处理器可以为同一个NPU,也可以为同一个AI服务器内的不同NPU。
第三API用于下发同步事件对应的记录任务。例如,该第三API可以为NotifyRecord(notify,stream)接口,该接口用于在stream设置同步对象对应的同步事件发生。
第一同步事件对应的记录任务用于指示第一同步事件已经发生,第一同步事件对应的记录任务中包括第二队列标识,以及第一同步寄存器的标识,第二队列标识为第一同步事件对应的记录任务所在的队列的标识。即第一同步事件对应的记录任务为第二队列中的任务。可选的,该第二队列标识可以为记录任务所在的stream的标识。
可选的,当第一同步事件发生在一个AI加速器内时,上述第二处理器和第三处理器为同一个AI加速器(例如,NPU),即同一个AI加速器既执行Wait任务,又执行Record任务。当第一同步事件发生在一个AI服务器内的两个AI加速器之间时,上述第二处理器和第三处理器为AI服务器内的两个不同的AI加速器。即,一个AI加速器执行Wait任务,另一个AI加速器执行Record任务。可选的,AI加速器既执行Wait任务,又执行Record任务时,Wait任务和Record任务可以分别为两个Stream中的任务。
例如,结合图4,如图5中的(a)所示,以第一同步事件为同步事件1为例,该同步事件1发生在一个NPU内。对于同步事件1,上述第二处理器和第三处理器相同,均为NPU0,即NPU0既执行Wait任务,又执行Record任务。CPU通过调用NotifyRecord(notify1,队列0),向NPU0下发记录任务1,指示NPU0的队列0中同步对象notify1对应的同步事件1已经发生。可选的,CPU可以在NPU0执行完计算任务01,并将计算任务01的执行结果发送给CPU后,CPU向NPU0下发记录任务1,指示notify1对应的同步事件1已经发生。
再例如,结合图4,如图5中的(b)所示,以第一同步事件为同步事件2为例,该同步事件2发生在AI服务器内的不同NPU间。对于同步事件2,上述第二处理器为NPU1,第三处理器为NPU0,CPU通过调用NotifyRecord(notify2,队列2),向NPU0下发记录任务2,指示NPU0在队列2中同步对象notify2对应的同步事件2已经发生。可选的,CPU可以在NPU0执行完计算任务3n,并将计算任务3n的执行结果发送给CPU后,CPU向NPU0下发记录任务2,指示notify2对应的同步事件2已经发生。
S606、第三处理器接收第一同步事件对应的记录任务。
示例性的,第三处理器接收第一同步事件对应的记录任务,可以获知该第一同步事件已经发生。
S607、第三处理器基于第一同步寄存器的标识,将第一同步寄存器的值重置为第二数值。
由于第一同步事件已经发生,第三处理器可以基于第一同步事件对应的记录任务中的第一同步寄存器的标识,将该第一同步寄存器的值重置为第二数值,以使得第一同步寄存器的值与第一同步事件的发生状态相对应。
例如,结合图4和图5中的(a)所示,对于同步事件1,NPU0可以基于Reg0的标识将NPU0中的Reg0的值重置为1,从而NPU0的控制器可以立刻检查到Reg0的取值修改,NPU0确定同步事件1已经发生,将Reg0的值清零。
再例如,结合图4和图5中的(b)所示,对于同步事件2,NPU0可以基于Reg1的标识将NPU1中的Reg1的值重置为1,从而NPU1的控制器可以立刻检查到Reg1的取值修改,NPU1确定同步事件2已经发生,将Reg1的值清零。
可以理解的,在本申请实施例中,NotifyWait和NotifyRecord是一一对应的,第三处理器接收记录任务后,获知同步对象对应的同步事件已经发生,将该同步对象对应的同步寄存器的值重置为1。第二处理器接收等待任务后,读取该同步对象对应的同步寄存器的值,如果同步寄存器的值0,确定同步事件未发生,第二处理器将继续等待同步事件发生,直到第三处理器将该同步对象对应的同步寄存器的值置为1,第二处理器立刻检查到同步寄存器的值为1,那么第二处理器确定同步事件已经发生,第二处理器将该同步寄存器的值重置为0,以便该同步寄存器可以继续进行后续的其他同步操作。
需要说明的是,本实施例提供的同步方法中,同步开销为AI加速器的控制器通过总线写寄存器的开销,该同步开销较小。例如,采用本实施例提供的同步方法,对于一个NPU内的同步而言,同步开销小于50ns,对于一个AI服务器内的不同NPU间的同步,同步开销小于1us。而且本申请实施例提供了简单的API接口,和通用OS的semaphore接口类似,可以大大方便开发者使用AI加速器。
可以理解的,本申请实施例对于上述步骤S601-S607的具体执行顺序并不进行限定,图6仅是示例性说明。
可选的,上述方法还可以包括步骤S608。
S608、第一处理器通过调用第七API,解除第一同步寄存器与第一同步对象的对应关系,并将第一同步寄存器的值重置为第一数值。
第七API用于释放第一同步寄存器。例如,第七API可以为NotifyDestroy(notify),该接口可以用于销毁同步对象notify,释放同步对象对应的同步寄存器。
例如,如图4所示,APP下发NotifyDestroy的API,销毁创建的同步对象notify1,runtime调用NPU驱动的接口,释放NPU0上的notify1,NPU驱动回收NPU0的notify 1,同时将notify 1对应的同步寄存器Reg0的值重置为0。
可以理解的,通过NotifyDestroy销毁同步对象,能够将该同步对象对应的同步寄存器回收,从而在后续需要进行同步时,可以将该同步寄存器分配给其他同步事件。
本申请实施例提供的同步方法,通过在AI加速器中设置一组同步寄存器,每个寄存器可以与一个同步事件相对应,该寄存器的不同取值用于指示其对应的同步事件是否发生。在AI加速器接收等待任务时,通过读取相应的同步寄存器的值,能够在同步事件未发生时,一直等待同步事件发生,在同步事件已经发生时,将该同步寄存器的值重置为第一数值。在AI加速器接收到记录任务时,通过在相应的同步寄存器中写入数值,指示同步事件已经发生,从而能够使得需要进行同步的AI加速器准确的实现同步。可以理解的,本申请实施例提供的同步方法,不仅可以通过同步寄存器实现一个AI加速器内的同步,也可以实现一个AI服务器内的不同AI加速器间的同步。而且提供了简单的API接口、同步开销较小,能够提升AI训练的效率。
可选的,上述第一同步事件可以是一个APP的同步事件,也可以是不同APP之间的同步事件。无论同步事件是一个APP的同步事件,该是多个APP之间的同步事件,该同步事件可以发生在一个AI加速器内,也可以发生在一个AI服务器内的不同AI加速器间。但是,当第一同步事件为多个APP之间的同步事件时,为了实现进程间的同步,需要该多个APP预先约定好同步对象的名称。例如,如图7所示,以第一同步事件为APP1和APP3之间的同步为例,在APP1和APP3之间需要进行同步时,APP1和APP3可以预先约定好同步对象的名称,从而实现不同进程间的同步。
本申请实施例还提供一种同步方法,如图8所示,在本实施例中第一同步事件为进程间的同步事件,该方法包括以下步骤:
S801、第一处理器为第一同步事件创建第一同步对象。
该第一同步事件为进程间的同步事件,该第一同步事件可以发生在一个AI加速器内,也可以发生在一个AI服务器内的不同AI加速器间,本申请实施例对此并不限定。
可以理解的,步骤S801的具体实现方式可以参考前述步骤S301的具体实现方式,在此不再赘述。
S802、第一处理器通过调用第一应用程序的第四API,将第一同步对象的名称设置为预设名称。
第四API用于设置同步对象的全局名称。例如,第四API可以为IpcSetNotifyName(notify,name),用于设置同步对象notify的全局名称。
可选的,第一同步事件可以为第一应用程序和第二应用程序之间的同步,上述预设名称为该第一应用程序和第二应用程序预先约定的名称。
例如,以第一同步事件为APP1和APP3之间的同步,APP1和APP3预先约定的同步对象的名称为NotifyForTest1为例,如图7所示,APP1可以通过调用NotifyCreate接口,创建同步对象A,该同步对象A可以记为notifyA,NPU驱动将NPU1的同步寄存器Regn分配给APP1的runtime,notifyA中保存同步寄存器Regn的标识,图7中以该同步寄存器Regn的标识为1-n为例进行示例。APP1调用IpcSetNotifyName接口,将notifyA设置为进程间通信(inter process communication,IPC)的同步对象,NPU驱动将同步对象notifyA的名称标记为NotifyForTest1。
S803、第一处理器通过调用第二应用程序的第五API,获取预设名称对应的第一同步寄存器的标识。
第五API用于获取预设名称对应的寄存器的标识。例如,该第五API可以为 IpcOpenNotify(notify,name),用于根据同步对象notify的全局名称name,打开同步对象。
例如,以第一同步事件为APP1和APP3之间的同步,APP1和APP3预先约定的同步对象的名称为NotifyForTest1为例,如图7所示,APP3调用IpcOpenNotify,runtime调用NPU驱动接口,传入NotifyForTest1,NPU驱动根据NotifyForTest1找到同步对象notifyA,返回notifyA中的同步寄存器的标识Reg1-n给runtime,runtime创建同步对象B给APP3,该同步对象B可以记为notifyB,notifyB中保存同步寄存器的标识Reg1-n。如此一来,同一个同步寄存器Reg1-n可以分别对应APP1的notifyA和APP3的notifyB,然后APP1和APP3就可以使用NotifyRecord和NotifyWait接口进行同步。
S804、第一处理器通过调用第二API,向第二处理器发送第一同步事件对应的等待任务。
例如,如图7所示,APP1可以调用NotifyWait(notifyA,队列1)接口,向NPU1下发等待任务,指示NPU1在队列1等待notifyA对应的同步事件发生。
S805、第二处理器接收第一同步事件对应的等待任务。
S806、第二处理器基于第一同步寄存器的值,确定第一同步事件是否发生。
例如,结合图7所示,CPU通过NotifyWait向NPU1下发等待任务,NPU1接收等待任务后,基于等待任务中的同步寄存器Reg1-n的标识,读取该Reg1-n的值。如果Reg1-n的值为0,说明notifyA对应的同步事件未发生,那么NPU1一直保持等待,NPU1的控制器一直检查Reg1-n的取值,当Reg1-n的值从0变为1时,说明notifyA对应的同步事件已经发生,NPU1的控制器立刻检查到Reg1-n的取值修改,NPU1的控制器结束等待,并将Reg1-n的值清零。
S807、第一处理器通过调用第三API,向第三处理器发送第一同步事件对应的记录任务。
可选的,当第一同步事件发生在一个AI加速器内时,第三处理器和第二处理器为同一个AI加速器。当第一同步事件发生在一个AI服务器内的不同AI加速器间时,第三处理器和第二处理器为一个AI服务器内的两个不同的AI加速器。下述实施例以第一同步事件发生在一个AI服务器内的不同AI加速器间为例进行说明。
例如,如图7所示,APP2可以调用NotifyRecord(notifyB,队列0)接口,向NPU0下发记录任务,指示NPU0在队列0中同步对象notifyB对应的同步事件已经发生。
S808、第三处理器接收第一同步事件对应的记录任务。
S809、第三处理器基于第一同步寄存器的标识,将第一同步寄存器的值重置为第二数值。
例如,如图7所示,NPU0可以基于Reg1-n的标识将NPU1中的Reg1-n的值重置为1,从而NPU1的控制器可以立刻检查到Reg1-n的取值修改,NPU1的控制器结束等待,并将Reg1-n的值清零。
可以理解的,上述步骤S804-S809的具体实现方式可以参考前述实施例中的步骤S602-S607的实现方式,在此不再赘述。
可以理解的,本申请实施例对于上述步骤S801-S809的具体执行顺序并不进行限 定,图8仅是示例性说明。
可选的,上述方法还可以包括步骤S810。
S810、第一处理器通过调用第七API,解除第一同步寄存器与第一同步事件的对应关系,并将第一同步寄存器的值重置为第一数值。
可以理解的,上述步骤S810的具体实现方式可以参考步骤S608,在此不再赘述。
本申请实施例提供的同步方法,通过在AI加速器中设置一组用于同步的寄存器,每个寄存器可以用于与一个同步事件相对应,该寄存器的不同取值用于指示其对应的同步事件是否发生,而且在同步事件为进程间同步的情况下,通过预先设置同步事件的全局名称,从而能够将不同进程间的同步事件与同一个寄存器对应,以实现进程间的同步。
本申请实施例还提供一种芯片的同步方法,在本实施例中第二同步事件发生在不同AI服务器之间,如图9所示,该方法包括以下步骤:
S901、第一处理器为第二同步事件创建第二同步对象。
可以理解的,上述步骤S901的具体实现方式可以参考前述步骤S301,在此不再赘述。
第二同步事件为不同AI服务器间的同步事件。
例如,以第二同步事件为AI服务器1与AI服务器2之间的同步为例,如图10所示,AI服务器1中运行APP1,AI服务器2中运行APP2,APP1和APP2需要进行同步,APP1等待APP2传数据给APP1,APP2在数据传输完成后,通知APP1传输完成,指示APP1可以执行后续任务。对于该同步,可以由CPU1在AI服务器1的NPU0包括的多个同步寄存器中为该同步事件分配同步寄存器Regm,并在同步对象K中保存该同步寄存器Regm的标识Reg0-m,该同步对象K可以记为notifyK。比如,AI服务器1的CPU1可以通过调用NotifyCreat接口,创建同步对象notifyK,该同步对象notifyK中保存了NPU驱动为该同步事件分配的同步寄存器的标识Reg0-m。
S902、第一处理器通过调用第二API,向第二处理器发送第二同步事件对应的等待任务。
第一处理器和第二处理器为同一个AI服务器内的处理器。例如,如图10所示,该第一处理器可以为AI服务器1中的CPU1,第二处理器可以为AI服务器1中的AI加速器NPU0。
S903、第二处理器接收第二同步事件对应的等待任务。
S904、第二处理器基于第二同步寄存器的值,确定第二同步事件是否发生。
可以理解的,上述步骤S902-S904的具体实现方式可以参考前述步骤S602-S604的具体实现方式,在此不再赘述。
S905、第一处理器通过调用第六API,获取第二同步寄存器的虚拟地址。
该第六API用于获取同步对象对应的寄存器的虚拟地址。例如,该第六API可以为NotifyGetAddr(notify,addr),其中输入为同步对象notify,输出为同步对象notify对应的同步寄存器的虚拟地址。
例如,如图10所示,AI服务器间进行同步时,APP1通过调用NotifyGetAddr接口,将同步对象notifyK对应的同步寄存器Reg0-m的物理地址映射为APP1的虚拟地 址(Virtual Address,VA),记为VA1。比如,APP1调用Runtime的NotifyGetAddr接口,传入同步对象notifyK,Runtime根据同步对象notifyK获取同步寄存器Reg0-m的标识,NPU驱动根据同步寄存器Reg0-m的标识,获取同步寄存器的物理地址,并将该物理地址映射出APP1的虚拟地址,NPU驱动将该虚拟地址返回给Runtime,Runtime将虚拟地址返回给APP,完成同步寄存器的虚拟地址映射流程。
可选的,本申请实施例对于将同步寄存器的物理地址映射为虚拟地址的具体实现方式并不进行限定,具体可以参考现有技术,在此不再赘述。
S906、第一处理器向第四处理器发送第二同步寄存器的虚拟地址。
第四处理器可以为AI服务器中的中央控制单元,例如,CPU。第四处理器包括第二CPU。
第一处理器和第四处理器为不同AI服务器中的处理器。可选的,第一处理器和第四处理器可以为不同AI服务器中的CPU。
例如,如图10所示,第一处理器可以为AI服务器1中的CPU1,第四处理器可以为AI服务器2中的CPU2,AI服务器1中的CPU1向AI服务器2中的CPU2发送同步对象notifyK对应的同步寄存器Reg0-m的虚拟地址VA1。
S907、第四处理器接收第二同步寄存器的虚拟地址。
S908、第四处理器向第五处理器发送第二同步事件对应的远程直接内存存取(Remote Direct Memory Access,RDMA)任务。
第二同步事件对应的RDMA任务用于指示第二同步事件已经发生,第二同步事件对应的RDMA任务中包括第二同步寄存器的虚拟地址。
该第四处理器和第五处理器为同一个AI服务器内的处理器,第四处理器可以为AI服务器内的CPU,第五处理器可以为AI服务器内的AI加速器(比如,NPU)。
例如,如图10所示,第四处理器为AI服务器2中的CPU2,第五处理器可以为AI服务器2中的NPU1。CPU2可以通过调用RDMAsend(VA1,1),向NPU1下发RDMA任务。
可选的,第四处理器可以通过调用第八API,向第五处理器发送第二同步事件对应的RDMA任务。该第八API用于下发同步事件对应的RDMA任务。例如,第八API为RDMAsend(addr,1),用于指示向虚拟地址addr写入第二数值1。
S909、第五处理器接收第二同步事件对应的RDMA任务。
S910、第五处理器基于第二同步寄存器的虚拟地址,通过RDMA装置将第二同步寄存器的值重置为第二数值。
由于第二同步事件已经发生,第五处理器可以基于第二同步事件对应的RDMA任务中的第二同步寄存器的虚拟地址,将该第二同步寄存器的值重置为第二数值,以使得第二同步寄存器的值与第二同步事件的发生状态相对应。
例如,结合图10所示,AI服务器2中的NPU1可以基于VA1将NPU0中的Reg0-m的值重置为1,从而NPU0的控制器可以立刻检查到Reg0-m的取值修改,NPU0的控制器结束等待,并将Reg0-m的值清零。
可以理解的,在本申请实施例中,NotifyWait和RDMAsend是一一对应的,第五处理器接收RDMAsend任务后,获知同步对象对应的同步事件发生,通过RDMA装 置将该同步对象对应的同步寄存器的值重置为1。第二处理器接收等待任务后,读取该同步对象对应的同步寄存器的值,如果同步寄存器的值0,确定同步事件未发生,第二处理器将一直保持等待,直到第五处理器将该同步对象对应的同步寄存器的值置为1,第二处理器检查到同步寄存器的值为1,确定同步事件已经发生,那么第二处理器结束等待,并将该同步寄存器的值重置为0,以便该同步寄存器可以继续进行后续的其他同步操作。
需要说明的是,本申请实施例对于上述步骤S901-S910的先后执行顺序并不限定,图9仅是示例性说明。
本申请实施例的同步方法,对于AI服务器间的同步,同步开销仅是网络通讯的时间开销,没有其它额外开销,因此同步开销较小。而且本申请实施例提供了简单的API接口,和通用OS的semaphore接口类似,可以大大方便开发者使用AI加速器。
可选的,上述方法还可以包括步骤S911。
S911、第一处理器通过调用第七API,解除第二同步寄存器与第二同步事件的对应关系,并将第二同步寄存器的值重置为第一数值。
可以理解的,上述步骤S911的具体实现方式可以参考步骤S608,在此不再赘述。
本申请实施例提供的同步方法,通过在AI加速器中设置一组同步寄存器,每个寄存器都可以与一个同步对象对应,该寄存器的不同取值用于指示同步对象对应的同步事件是否发生。在AI加速器接收等待任务时,通过读取相应的同步寄存器的值,能够在同步事件未发生时保持等待,在同步事件已经发生时结束等待。在AI加速器接收到RDMA任务时,通过在虚拟地址对应的同步寄存器中写入数值,指示同步事件已经发生,从而能够使得需要进行同步的AI加速器准确的实现同步。而且该方案通过将同步寄存器的物理地址转换成虚拟地址,并通过RDMA在虚拟地址写入数值,能够实现不同节点(AI服务器)间的同步。而且提供了简单的API接口、同步开销较小,提升了AI训练的效率。
本申请实施例还提供一种芯片的同步方法,如图11所示,在本实施例中第二同步事件发生在AI服务器之间,该方法包括以下步骤:
S1101、第一处理器为第二同步事件创建第二同步对象。
S1102、第一处理器通过调用第二API,向第二处理器发送第二同步事件对应的等待任务。
S1103、第二处理器接收第二同步事件对应的等待任务。
S1104、第二处理器基于第二同步寄存器的值,确定第二同步事件是否发生。
S1105、第一处理器通过调用第六API,获取第二同步寄存器的虚拟地址。
S1106、第一处理器向第四处理器发送第二同步寄存器的虚拟地址。
S1107、第四处理器接收第二同步寄存器的虚拟地址。
可以理解的,上述步骤S1101-S1107的具体实现方式可以参考前述步骤的实现方式,在此不再赘述。
S1108、第四处理器基于第二同步寄存器的虚拟地址,通过RDMA装置将第二同步寄存器的值重置为第二数值。
示例性的,如图10所示,当第二同步事件发生在AI服务器1和AI服务器2之间 时,AI服务器2的CPU2接收同步寄存器Reg0-m的虚拟地址VA1后,CPU2可以在第二同步事件发生时,直接通过RDMA装置将第二同步寄存器的值重置为第二数值,而不需要像图9所示的实施例由AI服务器2的CPU2向NPU1发送RDMA任务,再由NPU1通过RDMA装置将第二同步寄存器的值重置为第二数值。可以理解的,CPU2在第二同步事件发生时,通过RDMA装置将同步寄存器Reg0-m的值重置为1后,NPU0的控制器可以立刻检查到Reg0-m的取值修改,NPU0的控制器结束等待,并将Reg0-m的值清零,从而实现AI服务器间的准确同步。
需要说明的是,本申请实施例对于上述步骤S1101-S1108的先后执行顺序并不限定,图11仅是示例性说明。
可选的,上述方法还可以包括步骤S1109。
S1109、第一处理器通过调用第七API,解除第二同步寄存器与第二同步事件的对应关系,并将第二同步寄存器的值重置为第一数值。
可以理解的,上述步骤S1109的具体实现方式可以参考步骤S608,在此不再赘述。
本申请实施例提供的同步方法,通过在AI加速器中设置一组同步寄存器,每个寄存器都可以与一个同步对象对应,该寄存器的不同取值用于指示同步对象对应的同步事件是否发生。在AI加速器接收等待任务时,通过读取相应的同步寄存器的值,能够在同步事件未发生时保持等待,在同步事件已经发生时结束等待。处理器在同步事件发生时,直接基于同步寄存器的虚拟地址,对同步寄存器写入数值,指示同步事件已经发生,从而能够使得需要进行同步的AI服务器之间准确的实现同步。
需要说明的是,本申请实施例对于上述第一API至第八API具体属于哪个APP的API并不进行限定,实际应用中,每个APP可以根据自己的业务需求,调用上述一个或多个API,以实现一个AI加速器内、一个AI服务器内的不同AI加速器间,或AI服务器间的同步。
本申请实施例还提供一种芯片,该芯片包括上述第一处理器和接口电路,第一处理器用于通过接口电路与其它装置通信,以实现图3、图6、图8、图9或图11所示的同步方法。可选的,该芯片还可以包括存储器,该存储器用于存储计算机指令。
本申请实施例还提供一种芯片,该芯片包括上述第二处理器和接口电路,第二处理器用于通过接口电路与其它装置通信,以实现图3、图6、图8、图9或图11所示的同步方法。
本申请实施例还提供一种芯片,该芯片包括上述第三处理器和接口电路,第三处理器用于通过接口电路与其它装置通信,以实现图6或图8所示的同步方法。
本申请实施例还提供一种芯片,该芯片包括上述第四处理器和接口电路,第四处理器用于通过接口电路与其它装置通信,以实现图9或图11所示的同步方法。
本申请实施例还提供一种芯片,该芯片包括上述第五处理器和接口电路,第五处理器用于通过接口电路与其它装置通信,以实现图11所示的同步方法。
本申请实施例还提供一种AI服务器,该AI服务器包括上述第一处理器、第二处理器,以及接口电路,该第一处理器和第二处理器通过接口电路通信,以实现图3、图6、图8、图9或图11所示的同步方法。
本申请实施例还提供一种AI服务器,该AI服务器包括上述第一处理器、第二处 理器、第三处理器,以及接口电路,该第一处理器、第二处理器以及第三处理器通过接口电路通信,以实现图6或图8所示的同步方法。
本申请实施例还提供一种AI服务器,该AI服务器包括上述第四处理器、第五处理器,以及接口电路,该第四处理器和第五处理器通过接口电路通信,以实现图9所示的同步方法。
本申请实施例提供一种AI集群,该AI集群包括多个AI服务器,该AI服务器包括CPU和一个或多个AI加速器,CPU可以包括上述第一处理器,AI加速器可以包括上述第二处理器或第三处理器中的至少一种。
本申请实施例提供一种AI集群,该AI集群包括多个AI服务器,该AI服务器包括CPU和一个或多个AI加速器,CPU可以包括上述第四处理器,AI加速器可以包括上述第五处理器。
本申请实施例提供一种通信系统,该通信系统包括上述AI加速器、上述AI服务器,或上述AI集群中的至少一种。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于为同步事件创建同步对象。可选的,该API可以为NotifyCreat(deviceID,notify),其中,输入deviceID为AI加速器的ID,输出notify为同步对象。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于下发同步事件对应的等待任务。可选的,该API可以为NotifyWait(notify,stream)接口,该接口用于在stream等待同步对象对应的同步事件发生。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于下发同步事件对应的记录任务。可选的,该API可以为NotifyRecord(notify,stream)接口,该接口用于在stream设置同步对象对应的同步事件发生。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于设置同步对象的全局名称。可选的,该API可以为IpcSetNotifyName(notify,name),用于设置同步对象notify的全局名称。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于打开同步对象。可选的,该API可以为IpcOpenNotify(notify,name),用于根据同步对象notify的全局名称name,打开同步对象。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于获取同步对象对应的寄存器的虚拟地址。可选的,该API可以为NotifyGetAddr(notify,addr),其中输入为同步对象notify,输出为同步对象notify对应的同步寄存器的虚拟地址。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于释放同步寄存器。可选的,该API可以为NotifyDestroy(notify),该接口可以用于销毁同步对象notify,释放同步对象对应的同步寄存器。
本申请实施例提供一种应用程序接口API,该API部署在处理器中,该API用于下发同步事件对应的RDMA任务。可选的,该API可以为RDMAsend(addr,1),用于指示向虚拟地址addr写入第二数值1。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有 计算机程序代码,当上述处理器执行该计算机程序代码时,电子设备执行图3、图6、图8、图9或图11所示的同步方法。
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行图3、图6、图8、图9或图11所示的同步方法。
结合本申请公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、可擦除可编程只读存储器(erasable programmable ROM,EPROM)、电可擦可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于核心网接口设备中。当然,处理器和存储介质也可以作为分立组件存在于核心网接口设备中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (30)

  1. 一种同步方法,其特征在于,所述方法包括:
    第一处理器为第一同步事件创建第一同步对象;所述第一同步对象中包括第一同步寄存器的标识,所述第一同步寄存器的值包括第一数值或第二数值,所述第一数值用于指示所述第一同步事件未发生,所述第二数值用于指示所述第一同步事件已经发生;所述第一处理器包括第一中央处理器CPU;
    第二处理器基于所述第一同步寄存器的值,确定所述第一同步事件是否发生;所述第二处理器包括第一神经网络处理器NPU。
  2. 根据权利要求1所述的方法,其特征在于,所述第一处理器为第一同步事件创建第一同步对象,包括:
    所述第一处理器通过调用第一应用程序接口API,在所述第二处理器包括的多个同步寄存器中为所述第一同步事件分配所述第一同步寄存器,并在所述第一同步对象中保存所述第一同步寄存器的标识。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    所述第一处理器通过调用第二API,向所述第二处理器发送第一同步事件对应的等待任务;所述第一同步事件对应的等待任务用于等待所述第一同步事件发生,所述第一同步事件对应的等待任务包括第一队列标识,以及所述第一同步寄存器的标识;所述第一队列标识为所述等待任务所在的队列的标识;
    所述第二处理器接收所述第一同步事件对应的等待任务。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述第二处理器基于所述第一同步寄存器的值,确定所述第一同步事件是否发生,包括:
    在所述第一同步寄存器的值为所述第一数值的情况下,所述第二处理器确定所述第一同步事件未发生,所述第二处理器继续等待所述第一同步事件发生,直至所述第一同步寄存器的值为所述第二数值,所述第二处理器确定所述第一同步事件已经发生,所述第二处理器将所述第一同步寄存器的值重置为所述第一数值。
  5. 根据权利要求1-3中任一项所述的方法,其特征在于,所述第二处理器基于所述第一同步寄存器的值,确定所述第一同步事件是否发生,还包括:
    在所述第一同步寄存器的值为所述第二数值的情况下,所述第二处理器确定所述第一同步事件已经发生,所述第二处理器将所述第一同步寄存器的值重置为所述第一数值。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理器通过调用第三API,向所述第二处理器发送所述第一同步事件对应的记录任务;所述第一同步事件对应的记录任务用于指示所述第一同步事件已经发生,所述第一同步事件对应的记录任务中包括第二队列标识,以及所述第一同步寄存器的标识,所述第二队列标识为所述第一同步事件对应的记录任务所在的队列的标识;
    所述第二处理器接收所述第一同步事件对应的记录任务,并基于所述第一同步寄存器的标识,将所述第一同步寄存器的值重置为所述第二数值。
  7. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理器通过调用第三API,向第三处理器发送第一同步事件对应的记录 任务;所述第一同步事件对应的记录任务用于指示所述第一同步事件已经发生,所述第一同步事件对应的记录任务中包括第二队列标识,以及所述第一同步寄存器的标识,所述第二队列标识为所述第一同步事件对应的记录任务所在的队列的标识;所述第三处理器包括第二NPU;
    所述第三处理器接收所述第一同步事件对应的记录任务,并基于所述第一同步寄存器的标识,将所述第一同步寄存器的值重置为所述第二数值。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,若所述第一同步事件为进程间的同步事件,所述方法还包括:
    所述第一处理器通过调用第一应用程序的第四API,将所述第一同步对象的名称设置为预设名称;
    所述第一处理器通过调用第二应用程序的第五API,获取所述预设名称对应的所述第一同步寄存器的标识。
  9. 根据权利要求8所述的方法,其特征在于,所述第一同步事件为所述第一应用程序和所述第二应用程序之间的同步事件,所述预设名称为所述第一应用程序和所述第二应用程序预先约定的名称。
  10. 根据权利要求1-9中任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理器通过调用第六API,获取第二同步寄存器的虚拟地址;所述第二同步寄存器为第二同步事件对应的寄存器,所述第二同步寄存器的不同值用于指示所述第二同步事件是否发生;
    所述第一处理器向第四处理器发送所述第二同步寄存器的虚拟地址;所述第一处理器和所述第四处理器为不同AI服务器中的处理器,所述第四处理器包括第二CPU。
  11. 根据权利要求1-10中任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理器通过调用第七API,解除所述第一同步寄存器与所述第一同步事件的对应关系,并将所述第一同步寄存器的值重置为所述第一数值;所述第七API用于释放所述第一同步寄存器。
  12. 根据权利要求1-11中任一项所述的方法,其特征在于,所述第一同步寄存器的物理地址采用全局编址方式编址。
  13. 一种同步方法,其特征在于,所述方法包括:
    第四处理器接收来自第一处理器的第二同步寄存器的虚拟地址;所述第二同步寄存器为第二同步事件对应的寄存器,所述第二同步寄存器的值包括第一数值或第二数值,所述第一数值用于指示所述第二同步事件未发生,所述第二数值用于指示所述第二同步事件已经发生;所述第一处理器和所述第四处理器为不同AI服务器中的处理器;所述第一处理器包括第一中央处理器CPU,所述第四处理器包括第二CPU;
    所述第四处理器通过向第五处理器发送第二同步事件对应的远程直接内存存取RDMA任务;所述第二同步事件对应的RDMA任务用于指示所述第二同步事件已经发生,所述第二同步事件对应的RDMA任务中包括所述第二同步寄存器的虚拟地址;所述第五处理器包括第三NPU;
    所述第五处理器接收所述第二同步事件对应的RDMA任务,并基于所述第二同步寄存器的虚拟地址,通过RDMA装置将所述第二同步寄存器的值重置为所述第二数 值。
  14. 一种同步方法,其特征在于,所述方法包括:
    第四处理器接收来自第一处理器的第二同步寄存器的虚拟地址,所述第二同步寄存器为第二同步事件对应的寄存器,所述第二同步寄存器的值包括第一数值或第二数值,所述第一数值用于指示所述第二同步事件未发生,所述第二数值用于指示所述第二同步事件已经发生;所述第一处理器和所述第四处理器为不同AI服务器中的处理器;所述第一处理器包括第一中央处理器CPU,所述第四处理器包括第二CPU;
    所述第四处理器基于所述第二同步寄存器的虚拟地址,通过远程直接内存存取RDMA装置将所述第二同步寄存器的值重置为所述第二数值。
  15. 一种同步装置,其特征在于,所述同步装置包括第二处理器,所述第二处理器包括多个同步寄存器,每个所述同步寄存器用于与一个同步事件相对应,每个所述同步寄存器的值包括第一数值或第二数值,所述第一数值用于指示所述同步寄存器对应的同步事件未发生,所述第二数值用于指示所述同步寄存器对应的同步事件已经发生;所述第二处理器包括第一神经网络处理器NPU。
  16. 根据权利要求15所述的装置,其特征在于,所述同步装置还包括第一处理器;
    所述第一处理器,用于通过为第一同步事件创建第一同步对象;所述第一同步对象中包括第一同步寄存器的标识;所述第一同步寄存器的不同值用于指示所述第一同步事件是否发生;所述第一处理器包括第一中央处理器CPU;
    所述第二处理器,用于基于所述第一同步寄存器的值,确定所述第一同步事件是否发生。
  17. 根据权利要求16所述的装置,其特征在于,所述第一处理器,具体用于通过调用第一应用程序接口API,在所述第二处理器包括的所述多个同步寄存器中为所述第一同步事件分配所述第一同步寄存器,并在所述第一同步对象中保存所述第一同步寄存器的标识。
  18. 根据权利要求16或17所述的装置,其特征在于,
    所述第一处理器,还用于通过调用第二API,向所述第二处理器发送第一同步事件对应的等待任务;所述第一同步事件对应的等待任务用于等待所述第一同步事件发生,所述第一同步事件对应的等待任务包括第一队列标识,以及所述第一同步寄存器的标识;所述第一队列标识为所述等待任务所在的队列的标识;
    所述第二处理器,还用于接收所述第一同步事件对应的等待任务。
  19. 根据权利要求16-18中任一项所述的装置,其特征在于,所述第二处理器,具体用于在所述第一同步寄存器的值为所述第一数值的情况下,确定所述第一同步事件未发生,所述第二处理器继续等待所述第一同步事件发生,直至所述第一同步寄存器的值为所述第二数值,所述第二处理器确定所述第一同步事件已经发生,将所述第一同步寄存器的值重置为所述第一数值。
  20. 根据权利要求16-18中任一项所述的装置,其特征在于,所述第二处理器,具体还用于在所述第一同步寄存器的值为所述第二数值的情况下,确定所述第一同步事件已经发生,将所述第一同步寄存器的值重置为所述第一数值。
  21. 根据权利要求16-20中任一项所述的装置,其特征在于,
    所述第一处理器,还用于通过调用第三API,向所述第二处理器发送所述第一同步事件对应的记录任务;所述第一同步事件对应的记录任务用于指示所述第一同步事件已经发生,所述第一同步事件对应的记录任务中包括第二队列标识,以及所述第一同步寄存器的标识,所述第二队列标识为所述第一同步事件对应的记录任务所在的队列的标识;
    所述第二处理器,还用于接收所述第一同步事件对应的记录任务,并基于所述第一同步寄存器的标识,将所述第一同步寄存器的值重置为所述第二数值。
  22. 根据权利要求16-20中任一项所述的装置,其特征在于,所述同步装置还包括第三处理器,所述第三处理器包括第二NPU;
    所述第一处理器,还用于通过调用第三API,向所述第三处理器发送第一同步事件对应的记录任务;所述第一同步事件对应的记录任务用于指示所述第一同步事件已经发生,所述第一同步事件对应的记录任务中包括第二队列标识,以及所述第一同步寄存器的标识,所述第二队列标识为所述第一同步事件对应的记录任务所在的队列的标识;
    所述第三处理器,用于接收所述第一同步事件对应的记录任务,并基于所述第一同步寄存器的标识,将所述第一同步寄存器的值重置为所述第二数值。
  23. 根据权利要求16-20中任一项所述的装置,其特征在于,若所述第一同步事件为进程间的同步事件;
    所述第一处理器,还用于通过调用第一应用程序的第四API,将所述第一同步对象的名称设置为预设名称;
    所述第一处理器,还用于通过调用第二应用程序的第五API,获取所述预设名称对应的所述第一同步寄存器的标识。
  24. 根据权利要求23所述的装置,其特征在于,所述第一同步事件为所述第一应用程序和所述第二应用程序之间的同步事件,所述预设名称为所述第一应用程序和所述第二应用程序预先约定的名称。
  25. 根据权利要求16-24中任一项所述的装置,其特征在于,
    所述第一处理器,还用于通过调用第六API,获取第二同步寄存器的虚拟地址;所述第二同步寄存器为第二同步事件对应的寄存器,所述第二同步寄存器的不同值用于指示所述第二同步事件是否发生;
    所述第一处理器,还用于向第四处理器发送所述第二同步寄存器的虚拟地址;所述第一处理器和所述第四处理器为不同AI服务器中的处理器,所述第四处理器包括第二CPU。
  26. 根据权利要求16-25中任一项所述的装置,其特征在于,所述第一处理器,还用于通过调用第七API,解除所述第一同步寄存器与所述第一同步事件的对应关系,并将所述第一同步寄存器的值重置为所述第一数值;所述第七API用于释放所述第一同步寄存器。
  27. 根据权利要求16-26中任一项所述的装置,其特征在于,所述第一同步寄存器的物理地址采用全局编址方式编址。
  28. 一种同步装置,其特征在于,所述同步装置包括第四处理器和第五处理器;
    第四处理器,用于接收来自第一处理器的第二同步寄存器的虚拟地址;所述第二同步寄存器为第二同步事件对应的寄存器,所述第二同步寄存器的值包括第一数值或第二数值,所述第一数值用于指示所述第二同步事件未发生,所述第二数值用于指示所述第二同步事件已经发生;所述第一处理器和所述第四处理器为不同AI服务器中的处理器;所述第一处理器包括第一中央处理器CPU,所述第四处理器包括第二CPU;
    所述第四处理器,还用于向所述第五处理器发送第二同步事件对应的远程直接内存存取RDMA任务;所述第二同步事件对应的RDMA任务用于指示所述第二同步事件已经发生,所述第二同步事件对应的RDMA任务中包括所述第二同步寄存器的虚拟地址;所述第五处理器包括第三NPU;
    所述第五处理器,用于接收所述第二同步事件对应的RDMA任务,并基于所述第二同步寄存器的虚拟地址,通过RDMA装置将所述第二同步寄存器的值重置为所述第二数值。
  29. 一种同步装置,其特征在于,所述同步装置包括第四处理器;
    所述第四处理器,用于接收来自第一处理器的第二同步寄存器的虚拟地址,所述第二同步寄存器为第二同步事件对应的寄存器,所述第二同步寄存器的值包括第一数值或第二数值,所述第一数值用于指示所述第二同步事件未发生,所述第二数值用于指示所述第二同步事件已经发生;所述第一处理器和所述第四处理器为不同AI服务器中的处理器;所述第一处理器包括第一中央处理器CPU,所述第四处理器包括第二CPU;
    所述第四处理器,还用于基于所述第二同步寄存器的虚拟地址,通过远程直接内存存取RDMA装置将所述第二同步寄存器的值重置为所述第二数值。
  30. 一种电子设备,其特征在于,所述电子设备包括存储器,以及如权利要求15-29中任一项所述的同步装置。
PCT/CN2021/084747 2021-03-31 2021-03-31 一种同步方法及装置 WO2022205224A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020237035925A KR20230157503A (ko) 2021-03-31 2021-03-31 동기화 방법 및 장치
EP21933885.2A EP4296906A4 (en) 2021-03-31 2021-03-31 SYNCHRONIZATION APPARATUS AND METHOD
CN202180001205.XA CN113227975B (zh) 2021-03-31 2021-03-31 一种同步方法及装置
PCT/CN2021/084747 WO2022205224A1 (zh) 2021-03-31 2021-03-31 一种同步方法及装置
US18/477,117 US20240028423A1 (en) 2021-03-31 2023-09-28 Synchronization Method and Apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/084747 WO2022205224A1 (zh) 2021-03-31 2021-03-31 一种同步方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/477,117 Continuation US20240028423A1 (en) 2021-03-31 2023-09-28 Synchronization Method and Apparatus

Publications (1)

Publication Number Publication Date
WO2022205224A1 true WO2022205224A1 (zh) 2022-10-06

Family

ID=77081350

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084747 WO2022205224A1 (zh) 2021-03-31 2021-03-31 一种同步方法及装置

Country Status (5)

Country Link
US (1) US20240028423A1 (zh)
EP (1) EP4296906A4 (zh)
KR (1) KR20230157503A (zh)
CN (1) CN113227975B (zh)
WO (1) WO2022205224A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950259A (zh) * 2008-12-30 2011-01-19 英特尔公司 在硬件中登记用户处理程序以用于事务存储器事件处理
CN110489213A (zh) * 2018-05-15 2019-11-22 华为技术有限公司 一种任务处理方法及处理装置、计算机系统
CN112513817A (zh) * 2020-08-14 2021-03-16 华为技术有限公司 一种主cpu与npu的数据交互方法及计算设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9092272B2 (en) * 2011-12-08 2015-07-28 International Business Machines Corporation Preparing parallel tasks to use a synchronization register
US10417560B2 (en) * 2016-12-01 2019-09-17 Via Alliance Semiconductor Co., Ltd. Neural network unit that performs efficient 3-dimensional convolutions
US10365987B2 (en) * 2017-03-29 2019-07-30 Google Llc Synchronous hardware event collection
CN108512783A (zh) * 2018-03-22 2018-09-07 新华三技术有限公司 一种状态信息获取方法及设备
WO2020047337A1 (en) * 2018-08-29 2020-03-05 Qualcomm Incorporated Method, apparatus, and system for an architecture for machine learning acceleration
CN110929856B (zh) * 2018-09-20 2023-08-18 合肥君正科技有限公司 一种npu与主cpu的数据交互方法和装置
FR3091363B1 (fr) * 2018-12-27 2021-08-06 Kalray Système de synchronisation inter-processeurs configurable
CN111104459A (zh) * 2019-08-22 2020-05-05 华为技术有限公司 存储设备、分布式存储系统以及数据处理方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950259A (zh) * 2008-12-30 2011-01-19 英特尔公司 在硬件中登记用户处理程序以用于事务存储器事件处理
CN110489213A (zh) * 2018-05-15 2019-11-22 华为技术有限公司 一种任务处理方法及处理装置、计算机系统
CN112513817A (zh) * 2020-08-14 2021-03-16 华为技术有限公司 一种主cpu与npu的数据交互方法及计算设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4296906A4 *

Also Published As

Publication number Publication date
EP4296906A4 (en) 2024-03-27
CN113227975B (zh) 2023-03-17
EP4296906A1 (en) 2023-12-27
US20240028423A1 (en) 2024-01-25
CN113227975A (zh) 2021-08-06
KR20230157503A (ko) 2023-11-16

Similar Documents

Publication Publication Date Title
CN108280522B (zh) 一种插件式分布式机器学习计算框架及其数据处理方法
US10949328B2 (en) Data flow graph computation using exceptions
CN113256475A (zh) 图计算优化
CN113568705B (zh) 一种分布式架构下的代码集成仿真方法
CN115203142A (zh) 一种多核核间实时通信系统及方法
CN113495865A (zh) 异步数据移动管线
CN112817738A (zh) 用于修改可执行图以实施与新任务图关联的工作负载的技术
CN116783578A (zh) 执行矩阵值指示
CN113495761A (zh) 用于对线程同步的阶段进行协调的技术
EP1993038B1 (en) Data processing system and data processing method
WO2016008317A1 (zh) 数据处理方法和中心节点
WO2022205224A1 (zh) 一种同步方法及装置
CN116348885A (zh) 用于可部署推理系统的存储器映射的神经网络加速器
CN116521254A (zh) 基于图的存储器存储
CN116724292A (zh) 线程组的并行处理
CN116257353A (zh) 用于互操作性的应用编程接口
CN116436874A (zh) 使用备用指示集的网络组播
CN116243921A (zh) 用于修改图代码的技术
CN116225676A (zh) 用于限制存储器的应用程序编程接口
CN115374052A (zh) 用于可重构密码阵列的任务处理方法及装置
CN115836281A (zh) 多格式图形处理单元对接板
d'Auriol et al. A parameterized linear array with a reconfigurable pipelined bus system: LARPBS (p)
KR101592375B1 (ko) 클러스터 시스템 및 클러스터 시스템에서의 통신 방법
US11467836B2 (en) Executing cross-core copy instructions in an accelerator to temporarily store an operand that cannot be accommodated by on-chip memory of a primary core into a secondary core
US20240111694A1 (en) Node identification allocation in a multi-tile system with multiple derivatives

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21933885

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021933885

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021933885

Country of ref document: EP

Effective date: 20230921

ENP Entry into the national phase

Ref document number: 20237035925

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020237035925

Country of ref document: KR

NENP Non-entry into the national phase

Ref country code: DE