US11625587B2 - Artificial intelligence integrated circuit - Google Patents
Artificial intelligence integrated circuit Download PDFInfo
- Publication number
- US11625587B2 US11625587B2 US16/745,675 US202016745675A US11625587B2 US 11625587 B2 US11625587 B2 US 11625587B2 US 202016745675 A US202016745675 A US 202016745675A US 11625587 B2 US11625587 B2 US 11625587B2
- Authority
- US
- United States
- Prior art keywords
- data
- circuit
- cache
- artificial intelligence
- integrated circuit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 51
- 238000012545 processing Methods 0.000 claims abstract description 51
- 239000011159 matrix material Substances 0.000 claims description 94
- 230000036316 preload Effects 0.000 claims description 57
- 238000013135 deep learning Methods 0.000 claims description 37
- 238000004364 calculation method Methods 0.000 claims description 36
- 238000000034 method Methods 0.000 claims description 25
- 230000008569 process Effects 0.000 claims description 21
- 238000013144 data compression Methods 0.000 claims description 20
- 238000011176 pooling Methods 0.000 claims description 20
- 238000007906 compression Methods 0.000 claims description 16
- 239000000872 buffer Substances 0.000 claims description 14
- 230000006837 decompression Effects 0.000 claims description 13
- 230000006835 compression Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000009877 rendering Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 6
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 claims description 4
- 238000013507 mapping Methods 0.000 abstract 2
- 238000006243 chemical reaction Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000001133 acceleration Effects 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present invention relates to an integrated circuit and, in particular, to an artificial intelligence (AI) integrated circuit (IC).
- AI artificial intelligence
- CPUs central processing units
- GPUs graphic processing units
- TPUs tensor processing units
- FPGAs field-programmable gate arrays
- ASIC application-specific integrated circuit
- a GPU In comparison with a CPU, TPU, and FPGA, a GPU has better programmability, virtualization features, and a good software environment.
- the graphics-processing pipeline in the GPU itself has a lot of redundant functions for artificial intelligence operations, such as task management for graphics rendering, storage buffers, rasterization, and rendering output units, and the components mentioned above take up more than 1 ⁇ 3 of the area of the GPU.
- the GPU has a disadvantage in terms of performance/power ratio compared to ASIC.
- Artificial intelligence operations can be divided into a training phase and an inference phase.
- the parameters of the artificial intelligence network are mainly based on samples.
- the input parameters are used to convert input into a classification result.
- deep-learning accelerators on the market can be divided into two categories, the first is based on ASIC/FPGA, and the second is based on CPU/GPU.
- the first type of deep-learning accelerators may be designed to mainly accelerate convolution and matrix calculations.
- the advantage is that the above operations are very efficient, but the disadvantage is that a customized software development package and compilers are needed to re-support all deep-learning frameworks.
- the adaption period of the software ecology of the first type of deep-learning accelerators is very long, and the adaptability of the new algorithms is very weak. Therefore, the design of the first type of deep-learning accelerators is only suitable for the inference stage.
- the second type of deep-learning accelerators may include a main acceleration module with different instruction set architectures, such as single instruction multiple thread (SIMT), SSE, Matrix, and the like.
- SIMMT single instruction multiple thread
- SSE single instruction multiple thread
- Matrix and the like.
- the advantages of this type of accelerators are excellent software ecosystem, comprehensive develop tools, capability of virtualization, and excellent adaptability to algorithms.
- the CPU and GPU can be used in the training phase as well as in the inference phase. Because the GPU itself has parallel rendering units, so that the GPU can provide a greater speedup ratio than the CPU.
- the GPU is not dedicated for artificial intelligence calculation and there are a large number of graphics processing elements in the GPU, so that the GPU has a disadvantage in performance/power ratio compared to the ASIC accelerator.
- the GPU can use, for example, an acceleration scheme such as Matrix or SIMT, there are too many 3D graphics-processing elements in the GPU.
- the GPU also lacks buffer pre-load techniques and data compression and storage methods required for an artificial intelligence chip.
- an artificial intelligence integrated circuit includes: a command processor, a plurality of processing elements, a task constructor, an L1 cache, and an L2 cache.
- the command processor is configured to analyze a command queue to generate one or more tasks.
- the task constructor is configured to receive the task from the command processor to generate a plurality of threads to control the processing elements.
- Each processing element includes: a plurality of arithmetic logic units (ALUs), a plurality of deep-learning accelerators, a common register file, and an access controller.
- the ALUs are configured to perform arithmetic and logic operations.
- the deep-learning accelerators are configured to perform hardware multiplication-addition operations, activation functions, and pooling.
- the common register file is configured to store data and intermediate results of operations performed by the ALUs and deep-learning accelerators.
- the access controller is configured to control data access to the L1 cache and the L2 cache.
- the access controller is configured to control the L1 cache and L2 cache to dynamically prefetch data stored in a memory unit external to the artificial intelligence integrated circuit, and the prefetched data is for use by matrix multiplication-addition operations performed by the deep-learning accelerators.
- the memory unit is a dynamic random access memory or a host buffer memory of a host that is electrically connected to the artificial intelligence integrated circuit.
- the L1 cache comprises a first preload circuit and the L2 cache comprises a second preload circuit, and the first preload circuit and the second preload circuit prefetch data from the L2 cache and the memory unit, respectively.
- the first preload circuit when the access controller is tasked to write first data to the L1 cache, the first preload circuit sends the first data to a first data compressor for a first data compression process to generate second data, and writes the second data to the L2 cache.
- the second preload circuit sends the second data to a second data compressor for a second data compression process to generate third data, and the second data compressor writes the third data to the memory unit.
- the first data compression process is tasked to compress the first data using a compression algorithm for expanded matrix data to generate the second data
- the second data compression process is tasked to compress the second data using a residue-based image-compression algorithm and a sparse-matrix-compression algorithm to generate the third data.
- the second preload circuit when the access controller is tasked to read the third data stored in the memory unit, the second preload circuit sends the third data to a second decompression circuit to perform a second data decompression process on the third data to obtain the second data.
- the first preload circuit directly transmits the second data to a first decompression circuit in each processing element to perform a first data decompression process on the second data to obtain the first data, and stores the first data in the common register file of each processing element.
- the deep-learning accelerator in each processing element includes: a matrix multiplication-addition calculator, an activation-function circuit, and a pooling circuit.
- the matrix multiplication-addition calculator is configured to perform a matrix multiplication-addition calculation on the first data to obtain a first matrix calculation result.
- the activation-function circuit is configured to perform activation on the first matrix calculation result to generate a second matrix calculation result.
- the pooling circuit is configured to perform pooling on the second matrix calculation result to generate a final result, and store the final result in the common register file.
- the deep-learning accelerator in response to the first data for matrix convolution calculation stored in the common register file being ready, loads the first data to a register file in the deep-learning accelerator, and load the first data from the register file to the matrix multiplication-addition calculator to perform matrix multiplication-addition operations.
- the first preload circuit and the second preload circuit can be set to a hardware mode or a software mode.
- the first preload circuit and the second preload circuit performs address prediction using the previously fetched data, and respectively prefetches data from the L2 cache and the memory unit according to the predicted address.
- the first preload circuit and the second preload circuit respectively fetch data from the L2 cache and the memory unit according to hint information from the software.
- the matrix multiplication-addition calculator supports matrix multiplication in any matrix size and accelerated multiplication of sparse matrices, and determines the calculation of loops according to the size and sparsity of matrices.
- the activation-function circuit supports rectified linear unit (ReLU), sigmod, and tanh functions, and the pooling circuit performs mean pooling or max pooling on the second matrix calculation result to generate the final result.
- ReLU rectified linear unit
- sigmod sigmod
- tanh tanh
- the artificial intelligence integrated circuit supports application programming interfaces (API) of OpenCL, CUDA, and DirectCompute, and does not include a three-dimensional (3D) graphics rendering module.
- API application programming interfaces
- FIG. 1 is a block diagram of an artificial intelligence integrated circuit in accordance with an embodiment of the invention
- FIG. 2 is a diagram of the data-processing procedure in the artificial intelligence integrated circuit in accordance with an embodiment of the invention.
- FIG. 3 is a diagram of the L1 cache and L2 cache in accordance with an embodiment of the invention.
- FIG. 1 is a block diagram of an artificial intelligence integrated circuit in accordance with an embodiment of the invention.
- the artificial intelligence integrated circuit 100 may mainly include a command buffer 110 , a command processor 120 , a task constructor 130 , a plurality of processing elements (PEs) 140 , a level-1 (L1) cache 150 , a level-2 (L2) cache 160 , and a mixed conversion unit (MXU) 170 .
- PEs processing elements
- L1 cache 150 level-1
- L2 cache 160 level-2 cache 160
- MXU mixed conversion unit
- the command buffer 110 may be configured to receive an operation command from a host, and sequentially store the operation command in a command queue.
- the command buffer 110 may be a first-in-first-out (FIFO) buffer, but the invention is not limited thereto.
- the command processor 120 may be configured to detect the commands to be processed in the command queue of the command buffer 110 , and analyze the commands in the command queue (e.g., including entries 1101 - 110 N) in accordance with a weight-based scheduling algorithm, and dispatch commands to the task constructor 130 to control different processing elements 140 .
- the task constructor 130 may be configured to generate tasks or threads to be executed by respective processing elements 140 .
- the command processor 120 may parse commands of a general GPU application programming interface (API) of OpenCL, CUDA, or DirectCompute, etc., and distribute commands to the task constructor 130 .
- API application programming interface
- Each of the processing elements 140 may include a plurality of computing units 141 , a plurality of data decompressors 142 , a memory controller 143 , and an access controller 144 , wherein each computing unit 141 corresponds to each data decompressor 142 .
- the artificial intelligence integrated circuit 100 may support virtualization, and allow dozens of different tasks to run in parallel. For example, each processing element 140 can perform parallel operations according to individual threads.
- Each computing unit 141 may include an arithmetic logic unit (ALU) 1411 , a deep-learning accelerator (DLA) 1412 , a common register file (CRF) 1413 .
- the arithmetic logic unit 1411 may be configured to perform common arithmetic and logic operations according to the threads from the task constructor 130 .
- the deep-learning accelerator 1412 may be configured to perform artificial intelligence/deep learning related operations, such as a matrix multiplication-addition calculation (MAC) of any size, sparse matrix acceleration multiplication, activation function, and pooling.
- the command register file 1413 may be configured to store input matrices, calculation results, or intermediate numeric values of the arithmetic logic unit 1411 and deep-learning accelerators 1412 .
- the data decompressor 142 may be a matrix decompressor that is configured to decompress the compressed matrix data read from the L1 cache 150 , and store the decompressed matrix data in the common register file 1413 in each computation unit 141 or the common register file (not shown in FIG. 1 ) in each deep-learning accelerator 1412 .
- Each data decompressor 142 is coupled to each other via bus 145 .
- the memory controller 143 may be configured to control accesses of the static random access memory (SRAM) 148 , and the SRAM 148 , for example, may be configured to store the temporary numeric values or data required by calculations performed by each processing element 140 .
- the access controller 144 is configured to write data to the memory hierarchy or read data from the memory hierarchy.
- the memory hierarchy may include the L1 cache 150 , L2 cache 160 , and the dynamic random access memory (DRAM) 180 .
- the L1 cache 150 is electrically connected to bus 145 .
- Each data decompressor 142 may store data in the L1 cache 150 , L2 cache 160 , and/or the DRAM 180 (or the host buffer memory (HBM)) according to the requirements of calculations and a predetermined cache mechanism, where the predetermined cache mechanism, for example, may be controlled by the memory controller in each processing element 140 .
- the L1 cache 150 and L2 cache 160 may be used as the first-level cache memory and the second-level cache memory in the memory hierarchy, respectively, and the storage capacity of the L1 cache 150 is less than that of the L2 cache.
- the storage capacity of the L1 cache 150 for example, may be 8K bits, and the storage capacity of the L2 cache 160 may be 2K bits, but the invention is not limited to the aforementioned storage capacities.
- the first data (e.g., image data or matrix data) stored in the common register file 1413 is written to the L1 cache 150 .
- the data compressor 155 e.g., a matrix-data compressor
- the compressor 163 may perform a second data-compression process to compress the second data to generate third data, and write the third data to the DRAM 180 , wherein the third data can be regarded as the double compressed data.
- the aforementioned first data-compression process may use a well-known matrix-compression algorithm in the art of the present invention to compress the expanded matrix data.
- the matrix data is also often sparse matrices and image data, and thus the aforementioned second data-compression process may utilize the well-known residue-based image-compression algorithm and sparse-matrix compression algorithm in the art of the present invention.
- the first data-compression process and the second data-compression process are both lossless compression algorithms to ensure the correctness of the data.
- FIG. 3 is a diagram of the L1 cache 150 and L2 cache 160 .
- the data e.g., uncompressed data
- the L1 cache 150 may be divided into four equal segments 301 ⁇ 304 .
- the size of each segment 301 ⁇ 304 may be 2K bits.
- the compressed data of the segments 311 to 314 can be obtained, wherein the size of each of the segments 311 to 314 may be 0.5K bits, and the total storage capacity of the L2 cache 160 is 2K bits.
- the invention is not limited to the aforementioned storage capacities of the L1 cache 150 and L2 cache 160 , and the data compression ratio can be adjusted according to practical conditions.
- the L1 cache 150 may include a preload circuit 151 that is configured to control the L1 cache 150 to write data to the L2 cache 160 or read data (e.g., compressed data) from the L2 cache 160 , and the preload circuit can be set to a software mode or a hardware mode.
- the L2 cache 160 may include a preload circuit 161 that is configured to control the L2 cache 160 to write data to the DRAM 180 or read data from the DRAM 180 , and the preload circuit 161 can also be set to the software mode or hardware mode.
- the preload circuit 151 may prefetch data (e.g., matrix data and associated parameters required by matrix convolution operations) to the L1 cache according to the hint information provided by the software (e.g., executed by the host or CPU).
- the preload circuit 151 may perform address prediction according to the previously loaded data, and prefetch data (e.g., matrix data and associated parameters required by matrix convolution operations) according to the predicted addresses.
- the preload circuit 151 may automatically sends a read request to prefetch data from the L2 cache 160 , and these prefetched data will be marked as preloaded data in the L1 cache 150 , wherein the preloaded data, for example, may be matrix data. Because the L1 cache 150 uses a cache replacement mechanism and the preloaded data is not referenced by the processing elements 140 , the reference count of the preloaded data is 0. For example, the reference count may indicate that the number of times of the current data to be used by the processing elements 140 . Each time the current data is used by the processing elements 140 , the corresponding reference count is decreased by 1.
- the preload circuit 151 still has the opportunity to replace part of the preloaded data before the preloaded data is used.
- the priority from high to low may be unassigned cache lines, non-preload cache lines with a reference count of 0, and the preloaded cache lines.
- the preload operation of the preload circuit 151 in the hardware mode may increase or decrease the strength for preloading data into the L1 cache 150 according to the replacement status of the preloaded matrix data. If the preloaded matrix data in the L1 cache is quickly replaced, the preload circuit 151 will prefetch the matrix data with a higher frequency or amount of data. Conversely, if the preloaded data in the L1 cache 150 is rarely replaced, the preload circuit 151 will reduce the frequency or amount of data to prefetch data from the L2 cache 160 .
- the preload circuit 161 in the L2 cache 160 can also be set to the software mode or hardware mode, but the preload circuit 161 prefetches data from the DRAM 180 .
- the behaviors of the preload circuit 161 are similar to those of the preload circuit 151 , and thus the details thereof will not be described herein. It should be noted that when the preload circuit 161 is tasked to write the one-time-compressed matrix data in the L2 cache 160 to the DRAM 180 or the host buffer memory, a second compression process is performed by the compressor 163 on the one-time-compressed matrix data before writing the two-time-compressed matrix data to the DRAM 180 or the host buffer memory. Accordingly, the data written to the DRAM 180 external to the artificial intelligence integrated circuit 100 is the two-time-compressed data, and thus the amount of data during data transmission can be significantly reduced to save the bandwidth.
- the L1 cache 150 and L2 cache 160 in the invention support dynamically prefetching data, thus reducing the latency of fetching data while performing matrix operations.
- the L1 cache 150 and L2 cache 160 in the invention further support compression storage to significantly reduce the bandwidth requirement for data storage, thereby reducing the pressure of storing the intermediate results of matrix operations.
- the flow of data compression/encoding storage is described. If the access controller 144 of the processing element 140 is tasked to read data from the DRAM 180 , the flow of data decompression and decoding is converse to the flow of data compression and encoding.
- the mixed conversion unit 170 may be configured to perform conversion between virtual addresses to physical addresses, security checking, managing page tables, issuing command requests to different buses, global synchronization, wherein peer-to-peer PCIE equipment 182 can be connected to the mixed conversion unit 170 .
- the mixed conversion unit 170 can be connected to other chips in the system through the high-speed bus 184 .
- FIG. 2 is a diagram of the data-processing flow in accordance with an embodiment of the invention.
- the command processor 10 may analyze the commands in the command queue in the command buffer 110 , and provide tasks to the task constructor 130 .
- the task constructor 130 may generate a plurality of threads 20 - 1 to 20 -N for the processing elements 140 according to the tasks provided by the command processor 120 , wherein each thread may manage 32 to 64 parallel tasks.
- the ALU 1411 in each processing element 140 may perform operations such as address calculation, synchronization, management, and particular calculations according to the tasks from the task constructor 130 .
- the threads 20 - 1 to 20 -N may fetch the parameters and data (e.g., matrix data) required by the convolution calculation from the DRAM 180 via the access controller 144 .
- the threads 20 - 1 to 20 -N may control the access controller 144 to perform two compression processes on the matrix data that is not used yet and write the two-time-compressed matrix data to the DRAM 180 or the host buffer memory through path 251 , where the details of path 251 can be found in the embodiment of FIG. 1 .
- the first compressed data associated with the parameters and data to be fetched will be decompressed using a first-level decompression process by the decompressor 164 via the mixed conversion unit 170 , wherein the first-level decompression process, for example, may include sparse matrix expansion and lossless decompression on the compressed color data to generate second compressed data.
- first-level decompression process for example, may include sparse matrix expansion and lossless decompression on the compressed color data to generate second compressed data.
- the data decompressor 142 is disposed between the L1 cache 150 and the common register file 1413 of each processing element 140 .
- the bandwidth between the L1 cache 150 and each processing element 140 is limited. If the data decompressor 142 is disposed external to each processing element 140 , expanded matrix data will be obtained after the data decompressor 142 performs matrix decompression on the second compressed data, and the data amount of expanded matrix data will be expanded by 4 to 8 times that of the second compressed data.
- the data decompressor 142 of the present invention is disposed between the L1 cache 150 and the common register file 1413 of each processing element 140 , and is capable of serving multiple processing elements 140 simultaneously.
- the data decompressor 142 in each processing element 140 may decompress the second compressed data to generate the expanded matrix data, and store the expanded matrix data in the common register file 1413 .
- the matrix convolution calculations in the deep-learning accelerator 1412 require a great amount of matrix data, and the matrix data is pre-stored in the common register file 1413 in each processing element 140 . Accordingly, the aforementioned flow of receiving the second compressed data from the L1 cache and decompressing the second compressed data to generate the matrix data will be repeatedly for several times to accumulate the matrix data to a particular amount required for the convolution operations in the deep-learning accelerator 1412 .
- the deep-learning accelerator 1412 may load the matrix data from the common register file 1413 to the common register file 211 of the deep-learning accelerator 1412 , and input the matrix data 212 from the common register file 211 to the matrix multiplication-addition calculator 216 .
- the matrix multiplication-addition calculator 216 may support matrix multiplication in any size and accelerated multiplication of sparse matrices, and is capable of determining the calculation of loops according to the size and sparsity of the matrices.
- the matrix multiplication-addition calculator 216 may perform 256 multiplication-addition calculations. If the calculation result generated by the matrix multiplication-addition calculator 216 is the final matrix multiplication-addition result, the matrix multiplication-addition calculator 216 may input the final matrix multiplication-addition result to the activation-function circuit 217 to perform activation to introduce non-linear relationship into the neural network of deep learning.
- the activation-function circuit 217 may support functions such as rectified linear unit (ReLU), sigmod, tanh, etc., but the invention is not limited thereto.
- the matrix multiplication-addition calculator 216 may input the final result to the pooling circuit 218 or the activation-function circuit 217 may input the processing result to the pooling circuit 218 , thereby performing operations such as mean pooling or max pooling.
- the matrix multiplication-addition result processed by the activation-function circuit 217 and/or pooling circuit 218 may be the final result in one of the calculation layers in the neural network, and the final result will be written back to the common register file 1413 for use by the next calculation layer.
- an artificial intelligence integrated circuit reserves the portion of common calculation (e.g., ALUs) in the GPU architecture, and includes deep-learning accelerators to perform hardware acceleration on the matrix convolution operations (e.g., including accelerated matrix multiplication in any matrix size and accelerated sparse matrix multiplication) for artificial intelligence and deep learning, and the artificial intelligence integrated circuit does not include any functional modules for 3D graphics rendering.
- ALUs e.g., ALUs
- deep-learning accelerators e.g., including accelerated matrix multiplication in any matrix size and accelerated sparse matrix multiplication
- the artificial intelligence integrated circuit in the invention can perform artificial intelligence/deep learning related operations with a faster speed and lower power consumption compared with a conventional GPU, which means that the flexibility and ecological environment of the software, and the performance/power consumption ratio of hardware acceleration can both be considered, and the training phase and interference phase of the artificial intelligence/deep learning can be simultaneously applied.
- general APIs of the GPU e.g., OpenCL, CUDA, DirectCompute
- the L1 cache and L2 cache of the artificial intelligence integrated circuit of the present invention support dynamically prefetching data to reduce the latency of fetch the matrix data during artificial intelligence operations, and can support compression storage such as feature-based matrix data compression, residue-based image compression, and sparse matrix encoding compression, which can significantly reduce the requirement for data bandwidth.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Neurology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Memory System (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Image Generation (AREA)
Abstract
Description
Claims (13)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010004385.4 | 2020-01-03 | ||
CN202010004385.4A CN111240743B (en) | 2020-01-03 | 2020-01-03 | Artificial intelligence integrated circuit |
Publications (2)
Publication Number | Publication Date |
---|---|
US20210209451A1 US20210209451A1 (en) | 2021-07-08 |
US11625587B2 true US11625587B2 (en) | 2023-04-11 |
Family
ID=70874263
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/745,675 Active 2041-09-09 US11625587B2 (en) | 2020-01-03 | 2020-01-17 | Artificial intelligence integrated circuit |
Country Status (2)
Country | Link |
---|---|
US (1) | US11625587B2 (en) |
CN (1) | CN111240743B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230004804A1 (en) * | 2020-04-09 | 2023-01-05 | Micron Technology, Inc. | System on a Chip with Deep Learning Accelerator and Random Access Memory |
US11874897B2 (en) | 2020-04-09 | 2024-01-16 | Micron Technology, Inc. | Integrated circuit device with deep learning accelerator and random access memory |
US11887647B2 (en) | 2020-04-09 | 2024-01-30 | Micron Technology, Inc. | Deep learning accelerator and random access memory with separate memory access connections |
US11942135B2 (en) | 2020-04-09 | 2024-03-26 | Micron Technology, Inc. | Deep learning accelerator and random access memory with a camera interface |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11726784B2 (en) | 2020-04-09 | 2023-08-15 | Micron Technology, Inc. | Patient monitoring using edge servers having deep learning accelerator and random access memory |
CN112416817B (en) * | 2020-12-02 | 2023-02-17 | 海光信息技术股份有限公司 | Prefetching method, information processing apparatus, device, and storage medium |
US20220067524A1 (en) * | 2021-10-28 | 2022-03-03 | Intel Corporation | Sparsity-aware datastore for inference processing in deep neural network architectures |
CN116520754B (en) * | 2023-06-27 | 2023-09-22 | 厦门芯泰达集成电路有限公司 | DPS module control method and system based on preloading mode |
CN117421112A (en) * | 2023-10-18 | 2024-01-19 | 中科驭数(北京)科技有限公司 | Acceleration unit, network card, host and message processing acceleration method |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4442487A (en) * | 1981-12-31 | 1984-04-10 | International Business Machines Corporation | Three level memory hierarchy using write and share flags |
US7904951B1 (en) * | 1999-03-16 | 2011-03-08 | Novell, Inc. | Techniques for securely accelerating external domains locally |
US20190095333A1 (en) * | 2017-09-28 | 2019-03-28 | Intel Corporation | Independent tuning of multiple hardware prefetchers |
US20190325303A1 (en) * | 2018-04-24 | 2019-10-24 | Intel Corporation | Machine learning accelerator architecture |
US20190362461A1 (en) * | 2018-08-10 | 2019-11-28 | Intel Corporation | Multi-object, three-dimensional modeling and model selection |
US20200042240A1 (en) * | 2018-07-31 | 2020-02-06 | Marvell International Ltd. | Storage edge controller with a metadata computational engine |
US20200265276A1 (en) * | 2019-02-14 | 2020-08-20 | Siemens Healthcare Gmbh | Copd classification with machine-trained abnormality detection |
US20200327048A1 (en) * | 2019-04-09 | 2020-10-15 | Vmware, Inc. | Implementing fine grain data coherency of a shared memory region |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106991477B (en) * | 2016-01-20 | 2020-08-14 | 中科寒武纪科技股份有限公司 | Artificial neural network compression coding device and method |
US10402527B2 (en) * | 2017-01-04 | 2019-09-03 | Stmicroelectronics S.R.L. | Reconfigurable interconnect |
WO2018176435A1 (en) * | 2017-04-01 | 2018-10-04 | Intel Corporation | Execution unit-shared hybrid technique for accelerated computing on graphics processors |
US11934934B2 (en) * | 2017-04-17 | 2024-03-19 | Intel Corporation | Convolutional neural network optimization mechanism |
CN107590533B (en) * | 2017-08-29 | 2020-07-31 | 中国科学院计算技术研究所 | Compression device for deep neural network |
US11636327B2 (en) * | 2017-12-29 | 2023-04-25 | Intel Corporation | Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism |
US11270201B2 (en) * | 2017-12-29 | 2022-03-08 | Intel Corporation | Communication optimizations for distributed machine learning |
US10546393B2 (en) * | 2017-12-30 | 2020-01-28 | Intel Corporation | Compression in machine learning and deep learning processing |
CN108090560A (en) * | 2018-01-05 | 2018-05-29 | 中国科学技术大学苏州研究院 | The design method of LSTM recurrent neural network hardware accelerators based on FPGA |
US10572568B2 (en) * | 2018-03-28 | 2020-02-25 | Intel Corporation | Accelerator for sparse-dense matrix multiplication |
US10963394B2 (en) * | 2018-04-16 | 2021-03-30 | Samsung Electronics Co., Ltd. | System and method for optimizing performance of a solid-state drive using a deep neural network |
CN108615074B (en) * | 2018-04-28 | 2021-04-23 | 中国科学院计算技术研究所 | Neural network processing system and method based on compressed sensing |
CN109993297A (en) * | 2019-04-02 | 2019-07-09 | 南京吉相传感成像技术研究院有限公司 | A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing |
-
2020
- 2020-01-03 CN CN202010004385.4A patent/CN111240743B/en active Active
- 2020-01-17 US US16/745,675 patent/US11625587B2/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4442487A (en) * | 1981-12-31 | 1984-04-10 | International Business Machines Corporation | Three level memory hierarchy using write and share flags |
US7904951B1 (en) * | 1999-03-16 | 2011-03-08 | Novell, Inc. | Techniques for securely accelerating external domains locally |
US20190095333A1 (en) * | 2017-09-28 | 2019-03-28 | Intel Corporation | Independent tuning of multiple hardware prefetchers |
US20190325303A1 (en) * | 2018-04-24 | 2019-10-24 | Intel Corporation | Machine learning accelerator architecture |
US20200042240A1 (en) * | 2018-07-31 | 2020-02-06 | Marvell International Ltd. | Storage edge controller with a metadata computational engine |
US20190362461A1 (en) * | 2018-08-10 | 2019-11-28 | Intel Corporation | Multi-object, three-dimensional modeling and model selection |
US20200265276A1 (en) * | 2019-02-14 | 2020-08-20 | Siemens Healthcare Gmbh | Copd classification with machine-trained abnormality detection |
US20200327048A1 (en) * | 2019-04-09 | 2020-10-15 | Vmware, Inc. | Implementing fine grain data coherency of a shared memory region |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230004804A1 (en) * | 2020-04-09 | 2023-01-05 | Micron Technology, Inc. | System on a Chip with Deep Learning Accelerator and Random Access Memory |
US11874897B2 (en) | 2020-04-09 | 2024-01-16 | Micron Technology, Inc. | Integrated circuit device with deep learning accelerator and random access memory |
US11887647B2 (en) | 2020-04-09 | 2024-01-30 | Micron Technology, Inc. | Deep learning accelerator and random access memory with separate memory access connections |
US11942135B2 (en) | 2020-04-09 | 2024-03-26 | Micron Technology, Inc. | Deep learning accelerator and random access memory with a camera interface |
Also Published As
Publication number | Publication date |
---|---|
CN111240743B (en) | 2022-06-03 |
US20210209451A1 (en) | 2021-07-08 |
CN111240743A (en) | 2020-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11625587B2 (en) | Artificial intelligence integrated circuit | |
EP3274841B1 (en) | Compaction for memory hierarchies | |
CN109643443B (en) | Cache and compression interoperability in graphics processor pipelines | |
US10860326B2 (en) | Multi-threaded instruction buffer design | |
US20200342632A1 (en) | Efficient matrix format suitable for neural networks | |
KR102392816B1 (en) | System having low power computation architecture | |
CN111316261B (en) | Matrix computing engine | |
CN111310910A (en) | Computing device and method | |
CN113344171A (en) | Vector quantization decoding hardware unit for real-time dynamic decompression of neural network parameters | |
US20210026638A1 (en) | Low Latency Fetch Circuitry for Compute Kernels | |
US20240105260A1 (en) | Extended memory communication | |
US9189394B2 (en) | Memory-link compression for graphic processor unit | |
US20170083450A1 (en) | Supporting Data Conversion and Meta-Data in a Paging System | |
US10782918B2 (en) | Near-memory data-dependent gather and packing | |
US11935153B2 (en) | Data compression support for accelerated processor | |
WO2022047802A1 (en) | Processing-in-memory device and data processing method thereof | |
US20210303992A1 (en) | Executing neural networks on electronic devices | |
EP4149008A1 (en) | Verifying compressed stream fused with copy or transform operations | |
KR100463205B1 (en) | Computer system embedded sequantial buffer for improving DSP data access performance and data access method thereof | |
US11263051B2 (en) | Techniques for scaling dictionary-based compression | |
US10620958B1 (en) | Crossbar between clients and a cache | |
US10452401B2 (en) | Hints for shared store pipeline and multi-rate targets | |
KR20220100030A (en) | Pattern-Based Cache Block Compression | |
US20050204122A1 (en) | Hierarchical storage architecture for reconfigurable logic configurations | |
US20220197878A1 (en) | Compressed Read and Write Operations via Deduplication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SHANGHAI ZHAOXIN SEMICONDUCTOR CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GU, DEMING;REEL/FRAME:051544/0404 Effective date: 20191229 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED |
|
AS | Assignment |
Owner name: GLENFLY TECHNOLOGY CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SHANGHAI ZHAOXIN SEMICONDUCTOR CO., LTD.;REEL/FRAME:055083/0139 Effective date: 20210126 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |