CN117850705B - Artificial intelligent chip and data synchronization method thereof - Google Patents

Artificial intelligent chip and data synchronization method thereof Download PDF

Info

Publication number
CN117850705B
CN117850705B CN202410194532.7A CN202410194532A CN117850705B CN 117850705 B CN117850705 B CN 117850705B CN 202410194532 A CN202410194532 A CN 202410194532A CN 117850705 B CN117850705 B CN 117850705B
Authority
CN
China
Prior art keywords
synchronization
circuit
computing
memory
lookup table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410194532.7A
Other languages
Chinese (zh)
Other versions
CN117850705A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410194532.7A priority Critical patent/CN117850705B/en
Publication of CN117850705A publication Critical patent/CN117850705A/en
Application granted granted Critical
Publication of CN117850705B publication Critical patent/CN117850705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Multi Processors (AREA)

Abstract

The present disclosure provides an artificial intelligence chip and a data synchronization method thereof. The artificial intelligence chip comprises a memory circuit and a plurality of computing circuits. The plurality of computing circuits are coupled to the memory circuit. At least one of the plurality of computing circuits is selectively organized into a group of computing circuits to collectively perform an operational task. The computing circuit group sends an access request to the memory circuit based on the operation task and carries synchronous information. The memory circuit checks the synchronous information of the access request to determine whether to return the target data corresponding to the access request to the computing circuit group.

Description

Artificial intelligent chip and data synchronization method thereof
Technical Field
The present disclosure relates to an artificial intelligence chip and a data synchronization method thereof.
Background
Computing devices such as artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chips can provide significant computing power. The tremendous effort of artificial intelligence chips comes from the large number of hardware Execution units (EU, or Execution cores) inside. One AI chip typically contains multiple stream processor clusters (Stream Processor Cluster, SPC), each of which typically contains multiple Compute cores (CU, or computation units), such as at least one of an Integer (INT) Compute core, a Floating Point (FP) Compute core, a tensor core (tensor core), and a vector core (vector core), and each Compute core typically contains multiple execution cores. The stream processor clusters can support general purpose computing, scientific computing, and neural network computing by programmatically organizing the various types of computing cores. In many computing task scenarios, different computing cores within the same stream processor cluster may perform the same computing task together, or different stream processor clusters may perform the same computing task together. Accordingly, read-after-Write (RAW), write-after-Read (WAR), write-after-Write (WAW), and the like are related data synchronization issues that are among the many technical issues in the art.
Disclosure of Invention
The present disclosure is directed to an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip and a data synchronization method thereof to perform operational tasks.
In an embodiment according to the present disclosure, the artificial intelligence chip includes a memory circuit and a plurality of computing circuits. The plurality of computing circuits are coupled to the memory circuit. At least one of the plurality of computing circuits is selectively organized into a group of computing circuits to collectively perform an operational task. The computing circuit group sends an access request to the memory circuit based on the operation task and carries synchronous information. The memory circuit checks the synchronization information to determine whether to return the target data block corresponding to the access request to the computing circuit group.
In an embodiment according to the present disclosure, the data synchronization method of the artificial intelligence chip includes: selectively organizing at least one of the plurality of computing circuits into a group of computing circuits to collectively perform an operational task; the computing circuit group sends an access request with synchronous information to the memory circuit based on the operation task; and checking the synchronous information by the memory circuit to determine whether to return the target data block corresponding to the access request to the computing circuit group.
Based on the above, the access request sent by the computing circuit group to the memory circuit is provided with the synchronization information. For example, in the application scenario of "different computing cores within the same stream processor cluster as the computing circuit group", the computing core group may issue an access request with synchronization information to a memory (e.g., at least one of a level one cache and an input buffer) within the same stream processor cluster. In the context of an application where a different stream processor cluster is used as the computational circuitry group, the stream processor cluster may issue an access request with synchronization information to a memory (e.g., a secondary cache or other shared memory) shared by the different stream processor clusters. The memory circuit may check the synchronization information of the access request itself to determine whether to return the target data block corresponding to the access request to the computing circuit group. Thus, data synchronization between different computing circuits in the computing circuit group can be ensured.
Drawings
Fig. 1 is a schematic block diagram of an Artificial Intelligence (AI) chip in accordance with at least one embodiment of the present disclosure.
FIG. 2 is a flow diagram of a method for data synchronization of an artificial intelligence chip in accordance with at least one embodiment of the present disclosure.
FIG. 3 is a diagram illustrating a data structure of an access request with synchronization information according to at least one embodiment of the present disclosure.
Fig. 4 is a circuit block diagram of an AI chip according to at least one embodiment of the disclosure.
Fig. 5 is a circuit block diagram of an AI chip according to another embodiment of the disclosure.
Description of the reference numerals
100. 400, 500: Artificial Intelligence (AI) chip
110_1, 110—N: computing circuit
120. 413: Memory circuit
121: Memory
122. 414, 422, SCU51, scu52_3: synchronous checking circuit
410. SPC0, SPC1, SPC2 and SPC3: stream processor cluster
411: Scheduler
412: Computing core
414A, 422a: synchronous lookup table
414B, 422b: inspection circuit
420. 520: Shared memory
421: Second level cache
D1_1: cache line
Lr1_1: load requirement
R1_1: synchronous inspection results
S1_1: synchronization information
S210, S220, S230: step (a)
Detailed Description
Reference will now be made in detail to the exemplary embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components or distinguishing between different embodiments or ranges and are not used for limiting the number of components, either upper or lower, or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.
Fig. 1 is a schematic block diagram of an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip 100 according to one embodiment of the present disclosure. The AI chip 100 shown in FIG. 1 includes a plurality of computing circuits (e.g., 110_1, …, 110_n shown in FIG. 1) and a memory circuit 120. The number n of calculation circuits may be determined according to the actual design. The computing circuits 110_1 to 110—n can access (access) the memory circuit 120. For example, in an application scenario in which "different computing cores within the same stream processor cluster are used as the computing circuits 110_1 to 110—n", the computing circuits 110_1 to 110—n may access the memory circuit 120, such as at least one of a level one cache (L1 cache) and an input buffer (input buffer), within the same stream processor cluster. The different computing cores include, for example, an Integer (INT) computing core, a Floating Point (FP) computing core, a tensor core (tensor core), a vector core, or other computing cores. The stream processor clusters of the AI chip 100 can support general purpose computing, scientific computing, and neural network computing by programmatically organizing various types of computing cores. In the application scenario of using the different stream processor clusters as the computing circuits 110_1 to 110_n, the computing circuits 110_1 to 110_n can access the memory circuit 120 shared by the different stream processor clusters, such as a secondary cache (L2 cache) or other off-chip memories.
The computing circuits 110_1 to 110—n are coupled to the memory circuit 120. At least one of the computing circuits 110_1 to 110—n may be selectively organized into a group of computing circuits to collectively perform an operational task. In some embodiments, the computing circuits 110_1-110—n and/or the memory circuit 120 may be implemented as hardware (hardware) circuits, according to various designs. In other embodiments, the computing circuits 110_1-110—n and/or the memory circuit 120 may be implemented in a combination of hardware, firmware, or software (i.e., a program).
In hardware, the computing circuits 110_1-110—n and/or the memory circuit 120 may be implemented as logic circuits on an integrated circuit INTEGRATED CIRCUIT. For example, the computing circuits 110_1-110—n and/or related functions of the memory circuit 120 may be implemented in various logic blocks, modules and circuits in one or more controllers, hardware controllers, microcontrollers (Microcontroller), hardware processors (hardware processor), microprocessors (microprocessors), application-specific integrated circuits (ASICs), digital signal processors (DIGITAL SIGNAL processors, DSPs), field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), central processing units (Central Processing Unit, CPUs), or other processing units. The computing circuits 110_1-110—n and/or the associated functions of the memory circuit 120 may be implemented as hardware circuits, such as various logic blocks, modules and circuits in an integrated circuit, using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language.
In software or firmware, the computing circuits 110_1-110—n and/or the related functions of the memory circuit 120 can be implemented as programming codes (programming codes). For example, the computing circuits 110_1-110—n and/or the memory circuit 120 are implemented using a general programming language (programming languages, such as C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded or deposited on a "non-transitory machine readable storage medium (non-transitory machine-readable storage medium)". In some embodiments, the non-transitory machine-readable storage medium includes, for example, a semiconductor memory and/or a storage device. The semiconductor Memory includes a Memory card, a Read Only Memory (ROM), a FLASH Memory (FLASH Memory), a programmable logic circuit, or other semiconductor Memory. The storage device includes a hard disk (HARD DISK DRIVE, HDD), a Solid state disk (Solid-state drive, STATE DRIVE, SSD), or other storage device. An electronic device (e.g., a CPU, hardware controller, microcontroller, hardware processor, or microprocessor) may read and execute the programming code from the non-transitory machine-readable storage medium to implement the functions associated with the computing circuits 110_1-110—n and/or the memory circuit 120.
FIG. 2 is a flow chart of a method for data synchronization of an artificial intelligence chip according to an embodiment of the disclosure. In some embodiments, the data synchronization method shown in fig. 2 may be implemented in firmware or software (i.e., a program). For example, the operations associated with the data synchronization method shown in FIG. 2 may be implemented as non-transitory machine-readable instructions (programming code or program) that may be stored on a machine-readable storage medium. The data synchronization method shown in fig. 2 may be implemented when non-transitory machine readable instructions are executed by a computer. In other embodiments, the data synchronization method of FIG. 2 may be implemented in hardware, such as the artificial intelligence chip 100 of FIG. 1.
Referring to fig. 1 and 2, in step S210, at least one of the computing circuits 110_1 to 110—n may be selectively organized into a computing circuit group to jointly execute a computing task. At each memory access, the computing circuit group issues an access request with synchronization information to the memory circuit 120 based on the operation task (step S220). In step S230, the memory circuit 120 checks the synchronization information of the access request to determine whether to return the target data block corresponding to the access request to the computing circuit group.
For example, it is assumed that the computing circuits 110_1 to 110—n are selectively organized into a computing circuit group to collectively perform an operation task. At least one first computing circuit (e.g., computing circuit 110—n) in the computing circuit group issues a storage request to the memory circuit 120 to store data (target data block) in the process of the operation task to the memory circuit 120. The memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). The memory circuit 120 checks the synchronization information of the storage request sent by the computing circuit 110—n to update the count value corresponding to the target data block in the synchronization lookup table (e.g., increment the count value of the corresponding entry in the synchronization lookup table by 1). At least one second computing circuit (e.g., computing circuit 110_1) in the computing circuit group issues a load request to the memory circuit 120. The memory circuit 120 checks the count value in the synchronization lookup table based on the synchronization information of the load request itself to determine whether to return the target data block corresponding to the load request to the computing circuit 110_1.
In the application scenario of using "different computing cores in the same stream processor cluster" as the computing circuits 110_1 to 110—n ", the computing circuits 110_1 to 110—n shown in fig. 1 include a plurality of computing cores in the same stream processor cluster, and the memory circuit 120 shown in fig. 1 includes the memory 121 and the synchronization checking circuit 122 in the same stream processor cluster. In accordance with a practical design, in some embodiments, the plurality of computing cores includes at least one tensor core and at least one vector core, and the memory 121 may include any memory (e.g., at least one of a level one cache and an input buffer) within the same stream processor cluster. The synchronization checking circuit 122 of the memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). At least one first computing core (e.g., the computing circuit 110—n) of the plurality of computing cores issues a storage request to the memory 121 to store data (target data block) in the process of the operation task to the memory 121. The synchronization checking circuit 122 checks the synchronization information of the storage request sent by the first computing core to update the count value corresponding to the target data block in the synchronization lookup table (e.g., increment the count value of the corresponding entry in the synchronization lookup table by 1). At least one second computing core (e.g., computing circuitry 110_1) of the plurality of computing cores issues a load request to memory 121. The synchronization checking circuit 122 checks the count value in the synchronization lookup table based on the synchronization information of the load request itself to determine whether to notify the memory 121 to return the target data block corresponding to the load request to the second computing core.
In the application scenario of using "different stream processor clusters as the computing circuits 110_1 to 110—n", the computing circuits 110_1 to 110—n shown in fig. 1 include a plurality of stream processor clusters, and the memory circuit 120 shown in fig. 1 includes a shared memory (e.g., the memory 121 shown in fig. 1) and the synchronization checking circuit 122. For example, the memory 121 shown in FIG. 1 may be a memory shared by different clusters of stream processors (e.g., a secondary cache or other shared memory). At each memory access, the computing circuits 110_1 to 110—n issue an access request with synchronization information to the memory 121 based on the operation task. At least one first stream processor cluster (e.g., the computing circuit 110—n) of the plurality of stream processor clusters issues a storage request to the memory 121 (shared memory) to store data (target data block) in the process of the operation task to the shared memory 121. The synchronization checking circuit 122 of the memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). The synchronization checking circuit 122 checks synchronization information of the storage request sent by the first stream processor cluster to update a count value corresponding to the target data block in the synchronization lookup table (e.g., increment a count value of a corresponding entry in the synchronization lookup table by 1). At least one second stream processor cluster (e.g., compute circuit 110_1) of the plurality of stream processor clusters issues a load request to shared memory 121. The synchronization checking circuit 122 checks the count value in the synchronization lookup table based on the synchronization information of the load request itself to determine whether to notify the shared memory 121 to return the target data block corresponding to the load request to the second stream processor cluster.
For example, the computing circuit 110_1 issues a load request Lr1_1 to the memory 121. The synchronization checking circuit 122 checks the synchronization information s1_1 of the load request lr1_1 to determine whether to inform the memory 121 to return the target data block corresponding to the load request lr1_1 to the computing circuit 110_1. When the result of the sync check circuit 122 checking the sync information s1_1 indicates that the target data block in the memory 121 is ready, the sync check circuit 122 notifies the memory 121 to return the target data block corresponding to the load request lr1_1 to the calculation circuit 110_1. When the result of the sync check circuit 122 checking the sync information s1_1 indicates that the target data block in the memory 121 is not ready, the sync check circuit 122 may report the sync check result r1_1 (the information indicating that the target data block in the memory 121 is not ready) corresponding to the load request lr1_1 back to the computing circuit 110_1.
FIG. 3 is a diagram illustrating a data structure of an access request with synchronization information according to an embodiment of the present disclosure. Referring to fig. 1 and 3, the computing circuits 110_1 to 110—n issue an access request to the memory 121 with synchronization information. Fig. 3 is an explanatory example of the loading request lr1_1 as an access request. Other types of access requirements may refer to the relevant description of the loading requirement lr1_1 and so on. In the embodiment shown in FIG. 3, the load request LR1_1 includes a cache line (cacheline) D1_1 and synchronization information S1_1. The specific data structure of the synchronization information s1_1 may be determined according to the actual design, and fig. 3 shows one specific example among many implementation examples of the data structure of the synchronization information s1_1. In the embodiment shown in fig. 3, the synchronization information s1_1 includes identification information fields (e.g., a synchronization identification field and a group field). The group field is used for recording the identification number of the computing circuit group. The synchronization identification field is used to record the identification number of the target data block. The synchronization information s1_1 further includes a type field, a count field, and a valid field. The type field is used to record the synchronization type (e.g., one write read many, many write read many … …, etc.) of the corresponding data. The count field of the synchronization information s1_1 is used to record how much data is currently needed to arrive. The valid field is used to record whether the current entry (entry) is valid.
The synchronization checking circuit 122 of the memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). The specific data structure of each entry (entry) of the synchronization look-up table can be referred to the relevant description of the data structure of the synchronization information s1_1 shown in fig. 3 and so on. The count field of the sync lookup table is used to record how much data is currently ready. The valid field of the sync lookup table is used to record whether the current entry is valid. When the access request is a storage request, the synchronization checking circuit 122 may check an identification information field (a synchronization identification field and a group field) of synchronization information of the storage request itself to update a count value corresponding to the storage request in the synchronization lookup table. For example, the synchronization checking circuit 122 may find a corresponding entry in the synchronization lookup table according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1.
When the access request is a load request (for example, the load request lr1_1 shown in fig. 3), the synchronization checking circuit 122 of the memory circuit 120 may find the count value corresponding to the load request from a synchronization lookup table (not shown in fig. 1) based on the identification information field (synchronization identification field and group field) of the synchronization information of the load request. For example, the synchronization checking circuit 122 may find a corresponding entry in the synchronization lookup table according to the synchronization identification field and the group field of the synchronization information of the load request, and then fetch the count value from the count field of the corresponding entry. The synchronization checking circuit 122 may check the count value in the synchronization lookup table based on a count field of the synchronization information of the load request, so as to determine whether to inform the memory 121 to return the target data block corresponding to the access request to the computing circuit group. For example, the count field of the synchronization information s1_1 of the load request lr1_1 may itself have a target value, and the synchronization checking circuit 122 may take the count value from the count field of the corresponding entry in the synchronization lookup table, and then compare the target value of the load request lr1_1 with the count value in the synchronization lookup table. When the count value in the synchronization lookup table does not reach the target value (for example, the target value of the load request lr1_1 is greater than the count value in the synchronization lookup table), it indicates that one or some of the computing circuits in the computing circuit group have not stored the relevant data (target data) in the memory 121, and thus the synchronization checking circuit 122 may transmit the synchronization checking result indicating that the "target data block in the memory 121 is not ready" back to the computing circuit group. When the count value in the synchronization lookup table has reached the target value (for example, the target value of the load request lr1_1 is equal to or smaller than the count value in the synchronization lookup table), it indicates that the calculation circuit group has stored the relevant data (target data) into the memory 121, and therefore the synchronization checking circuit 122 can inform the memory 121 to return the target data block corresponding to the access request to the calculation circuit group.
In summary, at least one of the computing circuits 110_1 to 110—n may be selectively organized into a computing circuit group to jointly execute a computing task. The access requests issued by the compute circuit group to the memory 121 are self-contained with synchronization information. The synchronization checking circuit 122 may check synchronization information of the access request itself to determine whether to return the target data block corresponding to the access request to the computing circuit group. Therefore, data synchronization between different ones of the calculation circuits 110_1 to 110—n can be ensured. Unlike the above embodiments, the prior art scheme processes shared data through the order of store, enclose, sync, load to ensure synchronization of the data. The prior art solution requires a large time overhead. Furthermore, the access requirements issued by the prior art schemes do not have synchronization information, i.e. the "synchronization" of the prior art schemes is separate from the "store, load", adding additional modules. The prior art scheme increases performance overhead when storage synchronization coordination is required. For the prior art scheme, the larger the access synchronization, the farther the synchronization module is from the processing unit, and the larger the synchronization overhead.
Fig. 4 is a circuit block diagram of an AI chip 400 according to an embodiment of the disclosure. The AI chip 400 shown in fig. 4 includes a plurality of stream processor clusters (e.g., stream processor cluster 410) and a shared memory 420. The stream processor cluster 410 and the shared memory 420 shown in fig. 4 can be used as one of many embodiments of the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1, and thus the AI chip 400, the stream processor cluster 410 and the shared memory 420 shown in fig. 4 can refer to the AI chip 100, the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1 and so on. In the embodiment shown in fig. 4, the shared memory 420 includes a secondary cache 421 and a synchronization check circuit 422, and the synchronization check circuit 422 includes a synchronization lookup table 422a and a check circuit 422b. The secondary cache 421 and the synchronization checking circuit 422 shown in fig. 4 can refer to the relevant description of the memory 121 and the synchronization checking circuit 122 shown in fig. 1 and so on.
At each memory access, the stream processor cluster 410 may issue an access request with synchronization information to the secondary cache 421 based on the operational task. For example, when the access request is a storage request, the stream processor cluster 410 may issue a storage request to the secondary cache 421 to store data (target data block) during the operation task to the secondary cache 421. The checking circuit 422b may check the identification information field (the synchronization identification field and the group field) of the storage request self-contained synchronization information to update the count value corresponding to the storage request in the synchronization lookup table 422 a. For example, the checking circuit 422b may find the corresponding entry in the synchronization lookup table 422a according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1 (indicating that a piece of target data is ready in the secondary cache 421).
When the access requirement is a load requirement, the stream processor cluster 410 may issue a load requirement to the secondary cache 421. The checking circuit 422b may check the count value in the synchronization lookup table 422a based on the synchronization information of the load request itself to determine whether to notify the secondary cache 421 to return the target data block corresponding to the load request to the stream processor cluster 410. For example, the checking circuit 422b may find the corresponding entry in the synchronization lookup table 422a according to the synchronization identification field and the group field of the synchronization information of the load request, and then fetch the count value from the count field of the corresponding entry. The count field of the synchronization information of the load request issued by the stream processor cluster 410 is provided with the target value, and the checking circuit 422b may retrieve the count value from the count field of the corresponding entry in the synchronization lookup table 422a and then compare the target value of the load request with the count value of the synchronization lookup table 422 a. When the count value in the synchronization lookup table 422a does not reach the target value (e.g., the target value of the load request is greater than the count value of the synchronization lookup table 422 a), the check circuit 422b may return a synchronization check result indicating that the target data block in the secondary cache 421 is not ready back to the stream processor cluster 410. When the count value in the synchronization lookup table 422a has reached the target value (e.g., the target value of the load request is equal to or less than the count value of the synchronization lookup table 422 a), the check circuit 422b may inform the secondary cache 421 to return the target data block corresponding to the access request to the stream processor cluster 410.
The stream processor cluster 410 shown in fig. 4 includes a scheduler 411, a plurality of computation cores 412 (e.g., tensor cores and vector cores), and a memory circuit 413. The computing core 412 and the memory circuit 413 shown in fig. 4 may be used as one of the embodiments of the computing circuits 110_1 to 110—n and the memory circuit 120 shown in fig. 1, so the computing core 412 and the memory circuit 413 shown in fig. 4 may refer to the related descriptions of the computing circuits 110_1 to 110—n and the memory circuit 120 shown in fig. 1 and so on. In the embodiment shown in fig. 4, the memory circuit 413 includes an input buffer, a level one buffer, and a synchronization check circuit 414, and the synchronization check circuit 414 includes a synchronization lookup table 414a and a check circuit 414b. The secondary cache and sync check circuit 414 of fig. 4 may refer to the relevant description of the memory 121 and the sync check circuit 122 of fig. 1 and so on. The scheduler 411 may issue a calculation instruction to the calculation core 412 based on the calculation task, and the calculation core 412 may issue an access request with synchronization information to the memory circuit 413 based on the calculation instruction.
For example, when the access request is a storage request, the computing core 412 may issue the storage request to the first level cache to store data (target data block) during the operation task to the first level cache. The checking circuit 414b may check the identification information field (the synchronization identification field and the group field) of the storage request self-contained synchronization information to update the count value corresponding to the storage request in the synchronization lookup table 414 a. For example, the checking circuit 414b may find the corresponding entry in the synchronization lookup table 414a according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1 (indicating that a piece of target data is ready in the first level cache).
When the access requirement is a load requirement, the compute core 412 may issue the load requirement to the level one cache. The checking circuit 414b may check the count value in the synchronization lookup table 414a based on the synchronization information of the load request itself to determine whether to notify the primary cache to return the target data block corresponding to the load request to the computing core 412. For example, the checking circuit 414b may find a corresponding entry in the synchronization lookup table 414a according to the synchronization identification field and the group field of the synchronization information of the load request, and then fetch the count value from the count field of the corresponding entry. The count field of the synchronization information of the load request issued by the compute core 412 may itself have a target value, and the check circuit 414b may retrieve the count value from the count field of the corresponding entry in the synchronization lookup table 414a and then compare the target value of the load request to the count value of the synchronization lookup table 414 a. When the count value in the synchronization lookup table 414a does not reach the target value (e.g., the target value of the load request is greater than the count value of the synchronization lookup table 414 a), the checking circuit 414b may return a synchronization check result to the computing core 412 indicating that the target data block in the first level cache is not ready. When the count value in the synchronization lookup table 414a has reached the target value (e.g., the target value of the load request is equal to or less than the count value of the synchronization lookup table 414 a), the checking circuit 414b may inform the primary cache to return the target data block to the compute core 412 corresponding to the access request.
Fig. 5 is a circuit block diagram of an AI chip 500 according to another embodiment of the disclosure. The AI chip 500 shown in fig. 5 includes a plurality of stream processor clusters (e.g., stream processor clusters SPC0, SPC1, SPC2, and SPC 3) and a shared memory 520. The stream processor clusters SPC 0-SPC 3 and the shared memory 520 shown in fig. 5 can be used as one of many embodiments of the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1, so the AI chip 500, the stream processor clusters SPC 0-SPC 3 and the shared memory 520 shown in fig. 5 can refer to the related descriptions of the AI chip 100, the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1 and so on. In the embodiment shown in fig. 5, the shared memory 520 includes a secondary cache and synchronization check circuit SCU51. The secondary cache and synchronization check circuit SCU51 shown in fig. 5 can refer to the relevant description of the secondary cache 421 and synchronization check circuit 422 shown in fig. 4 and so on.
In the embodiment shown in fig. 5, each stream processor cluster includes multiple compute cores, a level one cache, and a synchronization check circuit. For example, the stream processor cluster SPC3 comprises a plurality of computing cores, a level one cache and synchronization check circuit scu52_3. The computing core, the level one cache and sync check circuit SCU52_3 in the stream processor cluster SPC3 shown in fig. 5 can be used as one of the embodiments of the computing circuits 110_1 to 110—n, the memory 121 and the sync check circuit 122 shown in fig. 1. The computing core, the level one cache and sync checking circuit scu52_3 in the stream processor cluster SPC3 shown in fig. 5 may refer to the relevant description of the computing core 412, the level one cache and sync checking circuit 414 shown in fig. 4 and so on. The other stream processor clusters SPC 0-SPC 2 can refer to the relevant description of the stream processor cluster SPC3 and so on.
For convenience of explanation, it is assumed herein that the stream processor clusters SPC 0-SPC 3 shown in FIG. 5 are selectively organized into a group of computing circuits to collectively perform an operational task. In the execution of the operation task, the stream processor cluster SPC3 needs to use the calculation results (target data) of the other stream processor clusters SPC0 to SPC2, and the AI chip 500 can ensure the data synchronization between the stream processor clusters SPC0 to SPC 3. For data synchronization, the stream processor clusters SPC0 to SPC3 send initialization information with synchronization information to the synchronization checking circuit SCU51, and the synchronization checking circuit SCU51 opens up a new entry (hereinafter referred to as corresponding entry) in a synchronization look-up table (not shown in fig. 5) based on identification information fields (synchronization identification field and group field) of the synchronization information of the initialization information. The count value in the count field of the corresponding entry that completes the initialization is an initial value (e.g., 0 or other real number).
When the stream processor cluster SPC3 sends a load request with synchronization information to the shared memory 520, the synchronization checking circuit SCU51 may find the count value corresponding to the load request from a synchronization lookup table (not shown in fig. 5) based on the identification information field (synchronization identification field and group field) of the synchronization information of the load request. The count field of the synchronization information of the load request issued by the stream processor cluster SPC3 is itself provided with a target value (in this example, the target value is "3" because the stream processor cluster SPC3 needs to use three calculation results of the stream processor clusters SPC0 to SPC 2). The synchronization checking circuit SCU51 may fetch a count value (currently an initial value of "0") from the count field of the corresponding entry in the synchronization look-up table. Because the count value "0" in the synchronization look-up table does not reach the target value "3" (indicating that no target data block is stored in the secondary cache), the synchronization check circuit SCU51 may return a synchronization check result indicating that the target data block in the secondary cache is not ready back to the stream processor cluster SPC3. Based on the synchronization check result, the stream processor cluster SPC3 can wait for data ready.
Each time a stream processor cluster SPC 0-SPC 3 accesses the secondary cache, any one stream processor cluster may issue an access request with synchronization information to the secondary cache based on the operation task. For example, the stream processor cluster SPC0 may issue a storage request to the secondary cache to store data (target data block) during the operation task to the secondary cache. The synchronization checking circuit SCU51 may check the identification information field (synchronization identification field and group field) of the self-contained synchronization information of the storage request to update the count value corresponding to the storage request in a synchronization lookup table (not shown in fig. 5). For example, the synchronization checking circuit SCU51 may find a corresponding entry in the synchronization lookup table according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1 (indicating that a piece of target data is ready in the secondary cache). In this process, when the stream processor cluster SPC3 sends a load request with synchronization information itself to the shared memory 520, the synchronization checking circuit SCU51 may fetch a count value (not reaching "3" at this time) from the count field of the corresponding entry in the synchronization lookup table. Because the count value in the synchronization lookup table does not reach the target value "3", the synchronization check circuit SCU51 may return a synchronization check result indicating that the target data block in the secondary cache is not ready to the stream processor cluster SPC3. Based on the synchronization check result, the stream processor cluster SPC3 may block the thread to wait for data to be ready, or the stream processor cluster SPC3 may directly return a signal of "load failure" to the scheduler (not shown in fig. 5) without blocking the thread.
After each of the stream processor clusters SPC0 to SPC2 stores the respective calculation results (target data blocks) in the secondary cache, the count value in the count field of the corresponding entry in the synchronization lookup table has progressed to "3". When the stream processor cluster SPC3 sends a load request with synchronization information to the shared memory 520, the synchronization checking circuit SCU51 may find a count value (at this time, 3) corresponding to the load request from a synchronization lookup table (not shown in fig. 5) based on identification information fields (synchronization identification field and group field) of the synchronization information of the load request. Because the count value "3" in the synchronization lookup table has reached the target value "3" (indicating that the stream processor clusters SPC 0-SPC 2 have all stored the respective calculation results in the secondary cache), the synchronization checking circuit SCU51 may inform the secondary cache to return the target data block corresponding to the access request to the stream processor cluster SPC3. Therefore, the AI chip 500 can ensure data synchronization between the stream processor clusters SPC 0-SPC 3.
Based on the actual design, the technique shown in FIG. 5 may be further optimized in some embodiments. The AI chip 500 of fig. 5 may enable different levels of synchronization checking circuitry to perform synchronization checking depending on how far or how far the data is accessed. When the AI chip 500 finds that the storing and reading of the target data block to be accessed occurs only within the same stream processor cluster, the AI chip 500 may enable a local level of synchronization check circuitry (e.g., synchronization check circuitry SCU 52_3) to achieve data synchronization. When the AI chip 500 finds that the storing and reading of the target data block to be accessed occurs in a memory (e.g., a secondary cache) shared by different stream processor clusters, the AI chip 500 may enable a wide-area-level synchronization check circuit (e.g., the synchronization check circuit SCU 51) to achieve data synchronization. When the AI chip 500 finds that only "one stream processor cluster" (e.g., the stream processor cluster SPC 3) needs to read the target data block, the wide-area-level synchronization check circuit SCU51 synchronizes the corresponding entry related to the "one stream processor cluster" in the synchronization look-up table (not shown in fig. 5) of the synchronization check circuit (e.g., the synchronization check circuit scu52_3) of the "one stream processor cluster". Thus, the "certain stream processor cluster" may perform synchronization checking locally to achieve data synchronization.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present disclosure, and not for limiting the same; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present disclosure.

Claims (18)

1. An artificial intelligence chip, the artificial intelligence chip comprising:
a memory circuit; and
A plurality of computation circuits coupled to the memory circuit, wherein at least one computation circuit of the computation circuits is selectively organized into a computation circuit group to collectively execute an operation task, the computation circuit group is provided with synchronization information based on an access request issued by the operation task to the memory circuit, and the memory circuit checks the synchronization information to determine whether to return a target data block corresponding to the access request to the computation circuit group;
wherein the synchronization information includes an identification information field and a count field,
When the access requirement is a storage requirement, the memory circuit checks the identification information field of the synchronous information carried by the storage requirement to update a count value corresponding to the storage requirement in a synchronous lookup table;
when the access request is a load request, the memory circuit finds a count value corresponding to the load request from the synchronous lookup table based on the identification information field of the synchronous information carried by the load request, and the memory circuit checks the count value in the synchronous lookup table based on the count field of the synchronous information carried by the load request to determine whether to return a target data block corresponding to the access request to the computing circuit group.
2. The artificial intelligence chip of claim 1, wherein the artificial intelligence chip comprises,
When the access requirement is the loading requirement and the count value in the synchronous lookup table reaches the target value of the count field of the synchronous information carried by the loading requirement, the memory circuit returns the target data block corresponding to the access requirement to the computing circuit group; and
When the access request is the loading request and when the count value in the synchronous lookup table does not reach the target value, the memory circuit is used for indicating that the synchronous check result of the target data block in the memory circuit is not ready to be returned to the computing circuit group.
3. The artificial intelligence chip of claim 1, wherein the identification information field includes a synchronization identification field and a group field, the group field to record an identification number of the computing circuit group, and the synchronization identification field to record an identification number of the target data block.
4. The artificial intelligence chip of claim 1, wherein the synchronization information further comprises a type field and a valid field, and wherein the type field is used to record a type number of the operation task.
5. The artificial intelligence chip of claim 1, wherein at least one first computing circuit in the group of computing circuits issues the storage request to the memory circuit to store the target data block to the memory circuit, the memory circuit includes the synchronization lookup table, the memory circuit examines the synchronization information of the storage request issued by the at least one first computing circuit to update a count value corresponding to the target data block in the synchronization lookup table,
At least one second computing circuit in the computing circuit group sends the loading request to the memory circuit, and the memory circuit checks the count value in the synchronous lookup table based on the synchronous information of the loading request so as to determine whether to return the target data block corresponding to the loading request to the at least one second computing circuit.
6. The artificial intelligence chip of claim 1, wherein the plurality of computing circuits includes a plurality of computing cores within a same stream processor cluster, the memory circuit includes a memory within the same stream processor cluster and a synchronization check circuit, at least one first computing core of the plurality of computing cores issues the storage request to the memory to store the target data block to the memory, the synchronization check circuit includes the synchronization lookup table, the synchronization check circuit checks the synchronization information of the storage request issued by the at least one first computing core to update a count value corresponding to the target data block in the synchronization lookup table, at least one second computing core of the plurality of computing cores issues the load request to the memory, and the synchronization check circuit checks the count value in the synchronization lookup table based on the synchronization information of the load request self-contained to determine whether to inform the memory to return the target data block corresponding to the load request to the at least one second computing core.
7. The artificial intelligence chip of claim 6, wherein the plurality of computing cores comprises at least one tensor core and at least one vector core, and the memory comprises an input buffer and a level one cache.
8. The artificial intelligence chip of claim 1, wherein the plurality of computing circuits comprises a plurality of stream processor clusters, the memory circuit comprises a shared memory and a synchronization checking circuit, at least one first stream processor cluster of the plurality of stream processor clusters issues the storage request to the shared memory to store the target data block to the shared memory, the synchronization checking circuit comprises the synchronization lookup table, the synchronization checking circuit checks the synchronization information sent by the at least one first stream processor cluster to update a count value corresponding to the target data block in the synchronization lookup table, at least one second stream processor cluster of the plurality of stream processor clusters sends the load request to the shared memory, and the synchronization checking circuit checks the count value in the synchronization lookup table based on the synchronization information carried by the load request to determine whether to notify the shared memory to return the target data block corresponding to the load request to the at least one second stream processor cluster.
9. The artificial intelligence chip of claim 8, wherein the shared memory comprises a secondary cache.
10. A method of data synchronization for an artificial intelligence chip, the artificial intelligence chip comprising a memory circuit and a plurality of computation circuits coupled to the memory circuit, the method comprising:
Selectively organizing at least one of the plurality of computing circuits into a group of computing circuits to collectively perform an operational task;
the computing circuit group sends an access request with synchronous information to the memory circuit based on the operation task; and
Checking the synchronous information by the memory circuit to determine whether to return the target data block corresponding to the access request to the computing circuit group;
wherein the synchronization information includes an identification information field and a count field;
the data synchronization method further comprises the following steps:
when the access requirement is a storage requirement, checking the identification information field of the synchronous information of the storage requirement by the memory circuit to update a count value corresponding to the storage requirement in a synchronous lookup table; and
When the access request is a load request, the memory circuit finds a count value corresponding to the load request from the synchronous lookup table based on the identification information field of the synchronous information carried by the load request, and the memory circuit checks the count value in the synchronous lookup table based on the count field of the synchronous information carried by the load request to determine whether to return a target data block corresponding to the access request to the computing circuit group.
11. The data synchronization method according to claim 10, characterized in that the data synchronization method further comprises:
When the access requirement is the loading requirement and the count value in the synchronous lookup table reaches the target value of the count field of the synchronous information carried by the loading requirement, returning the target data block corresponding to the access requirement to the computing circuit group by the memory circuit; and
When the access request is the loading request and when the count value in the synchronous lookup table does not reach the target value, a synchronous check result used by the memory circuit to indicate that the target data block in the memory circuit is not ready is returned to the computing circuit group.
12. The data synchronization method according to claim 10, wherein the identification information field includes a synchronization identification field and a group field, the group field being used to record an identification number of the calculation circuit group, and the synchronization identification field being used to record an identification number of the target data block.
13. The data synchronization method according to claim 10, wherein the synchronization information further includes a type field and a valid field, and the type field is used to record a type number of the operation task.
14. The data synchronization method according to claim 10, characterized in that the data synchronization method further comprises:
Issuing, by at least one first computing circuit of the computing circuit group, the storage request to the memory circuit to store the target data block to the memory circuit, wherein the memory circuit includes the synchronization lookup table;
Checking, by the memory circuit, the synchronization information of the storage request sent by the at least one first computing circuit to update a count value corresponding to the target data block in the synchronization lookup table;
issuing, by at least one second computing circuit of the computing circuit group, the load request to the memory circuit; and
And checking, by the memory circuit, the count value in the synchronization lookup table based on the synchronization information of the load request itself to determine whether to return the target data block corresponding to the load request to the at least one second computing circuit.
15. The data synchronization method of claim 10, wherein the plurality of computing circuits comprise a plurality of computing cores within a same stream processor cluster, the memory circuit comprises a memory and a synchronization check circuit within the same stream processor cluster, the data synchronization method further comprising:
issuing, by at least a first computing core of the plurality of computing cores, the storage request to the memory to store the target data block to the memory, wherein the synchronization check circuit includes the synchronization lookup table;
checking, by the synchronization checking circuit, the synchronization information of the storage request sent by the at least one first computing core to update a count value corresponding to the target data block in the synchronization lookup table;
Issuing, by at least one second computing core of the plurality of computing cores, the load request to the memory; and
And checking the count value in the synchronous lookup table based on the synchronous information of the load request by the synchronous checking circuit so as to determine whether to inform the memory to return the target data block corresponding to the load request to the at least one second computing core.
16. The data synchronization method of claim 15, wherein the plurality of computing cores includes at least one tensor core and at least one vector core, and the memory includes an input buffer and a level one cache.
17. The data synchronization method of claim 10, wherein the plurality of computing circuits comprise a plurality of clusters of stream processors, the memory circuit comprises a shared memory and a synchronization check circuit, the data synchronization method further comprising:
issuing, by at least a first one of the plurality of stream processor clusters, the storage request to the shared memory to store the target data block to the shared memory, wherein the synchronization check circuit includes the synchronization lookup table;
Checking, by the synchronization checking circuit, the synchronization information of the storage request sent by the at least one first stream processor cluster to update a count value corresponding to the target data block in the synchronization lookup table;
issuing, by at least one second stream processor cluster of the plurality of stream processor clusters, the load request to the shared memory; and
And checking, by the synchronization checking circuit, the count value in the synchronization lookup table based on the synchronization information of the load request to determine whether to notify the shared memory to return the target data block corresponding to the load request to the at least one second stream processor cluster.
18. The method of claim 17, wherein the shared memory comprises a secondary cache.
CN202410194532.7A 2024-02-22 2024-02-22 Artificial intelligent chip and data synchronization method thereof Active CN117850705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410194532.7A CN117850705B (en) 2024-02-22 2024-02-22 Artificial intelligent chip and data synchronization method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410194532.7A CN117850705B (en) 2024-02-22 2024-02-22 Artificial intelligent chip and data synchronization method thereof

Publications (2)

Publication Number Publication Date
CN117850705A CN117850705A (en) 2024-04-09
CN117850705B true CN117850705B (en) 2024-05-07

Family

ID=90530459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410194532.7A Active CN117850705B (en) 2024-02-22 2024-02-22 Artificial intelligent chip and data synchronization method thereof

Country Status (1)

Country Link
CN (1) CN117850705B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739371A (en) * 2006-09-26 2010-06-16 株式会社日立制作所 Storing system and disc control device
CN105405070A (en) * 2015-12-03 2016-03-16 国家电网公司 Distributed memory power grid system construction method
CN105630413A (en) * 2015-12-23 2016-06-01 中国科学院深圳先进技术研究院 Synchronized writeback method for disk data
CN116841710A (en) * 2023-06-19 2023-10-03 蔚来汽车科技(安徽)有限公司 Task scheduling method, task scheduling system and computer storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11048656B2 (en) * 2018-03-31 2021-06-29 Micron Technology, Inc. Multi-threaded, self-scheduling reconfigurable computing fabric
WO2019191738A1 (en) * 2018-03-31 2019-10-03 Micron Technology, Inc. Backpressure control using a stop signal for a multi-threaded, self-scheduling reconfigurable computing fabric

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739371A (en) * 2006-09-26 2010-06-16 株式会社日立制作所 Storing system and disc control device
CN105405070A (en) * 2015-12-03 2016-03-16 国家电网公司 Distributed memory power grid system construction method
CN105630413A (en) * 2015-12-23 2016-06-01 中国科学院深圳先进技术研究院 Synchronized writeback method for disk data
CN116841710A (en) * 2023-06-19 2023-10-03 蔚来汽车科技(安徽)有限公司 Task scheduling method, task scheduling system and computer storage medium

Also Published As

Publication number Publication date
CN117850705A (en) 2024-04-09

Similar Documents

Publication Publication Date Title
US6038646A (en) Method and apparatus for enforcing ordered execution of reads and writes across a memory interface
US6272520B1 (en) Method for detecting thread switch events
US5450564A (en) Method and apparatus for cache memory access with separate fetch and store queues
US6725336B2 (en) Dynamically allocated cache memory for a multi-processor unit
JP4160925B2 (en) Method and system for communication between processing units in a multiprocessor computer system including a cross-chip communication mechanism in a distributed node topology
US5832262A (en) Realtime hardware scheduler utilizing processor message passing and queue management cells
US5016167A (en) Resource contention deadlock detection and prevention
US5581734A (en) Multiprocessor system with shared cache and data input/output circuitry for transferring data amount greater than system bus capacity
CN1478228A (en) Breaking replay dependency loops in processor using rescheduled replay queue
US7073026B2 (en) Microprocessor including cache memory supporting multiple accesses per cycle
US5898882A (en) Method and system for enhanced instruction dispatch in a superscalar processor system utilizing independently accessed intermediate storage
JPH04232532A (en) Digital computer system
US8190825B2 (en) Arithmetic processing apparatus and method of controlling the same
US20140129806A1 (en) Load/store picker
US20080288691A1 (en) Method and apparatus of lock transactions processing in single or multi-core processor
EP0605868A1 (en) Method and system for indexing the assignment of intermediate storage buffers in a superscalar processor system
CN114036091B (en) Multiprocessor peripheral multiplexing circuit and multiplexing method thereof
US6209081B1 (en) Method and system for nonsequential instruction dispatch and execution in a superscalar processor system
EP0265108B1 (en) Cache storage priority
CN117850705B (en) Artificial intelligent chip and data synchronization method thereof
US7900023B2 (en) Technique to enable store forwarding during long latency instruction execution
CN100573489C (en) DMAC issue mechanism via streaming ID method
EP0310446A2 (en) Cache memory management method
US20240103860A1 (en) Predicates for Processing-in-Memory
US9047092B2 (en) Resource management within a load store unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant