WO2021185287A1 - 一种解压装置、加速器、和用于解压装置的方法 - Google Patents

一种解压装置、加速器、和用于解压装置的方法 Download PDF

Info

Publication number
WO2021185287A1
WO2021185287A1 PCT/CN2021/081353 CN2021081353W WO2021185287A1 WO 2021185287 A1 WO2021185287 A1 WO 2021185287A1 CN 2021081353 W CN2021081353 W CN 2021081353W WO 2021185287 A1 WO2021185287 A1 WO 2021185287A1
Authority
WO
WIPO (PCT)
Prior art keywords
engine
data
storage device
operation engine
memory
Prior art date
Application number
PCT/CN2021/081353
Other languages
English (en)
French (fr)
Inventor
徐斌
何雷骏
王明书
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021185287A1 publication Critical patent/WO2021185287A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/062Securing storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/062Securing storage systems
    • G06F3/0622Securing storage systems in relation to access
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • One or more embodiments of the present application generally relate to the field, and specifically relate to a decompression device, an accelerator, and a method for the decompression device.
  • AI Artificial Intelligence
  • terminals edge side, cloud, etc.
  • functions such as image recognition, target detection, and speech translation.
  • deep learning models are the most widely used in artificial intelligence.
  • Many manufacturers have developed corresponding AI acceleration chips.
  • the computational complexity and parameter redundancy of the deep learning model limit its deployment in some scenarios and devices.
  • model miniaturization algorithm is usually used to compress the deep learning model data (for example, model parameters and/or model input data). Because the model miniaturization algorithm reduces data redundancy, it can reduce storage occupation and communication bandwidth. And computational complexity. Model miniaturization technology has become the core technology for AI acceleration chips to ease storage walls, reduce power consumption, and improve application performance.
  • the deep model data needs to be decompressed.
  • the current AI acceleration chip usually only supports one or two model miniaturization and decompression algorithms, which are relatively fixed and cannot effectively support the evolution of subsequent model miniaturization and decompression algorithms.
  • the model miniaturization and decompression algorithms all use independent large processing units. If several large processing units work in a pipeline, the pipeline sequence is generally fixed, and there is a lot of waste of hardware resources.
  • a processing unit needs to be all The data is decompressed and stored in a large cache, and then all the decompressed data is sent to another processing unit; if several large processing units do not work in a pipeline, then each processing unit needs to read data from the memory again before operation, Waste memory bandwidth.
  • the first aspect of the present application provides a decompression device, which is used to perform at least one operation on data related to instructions, and includes:
  • At least one operation engine corresponding to at least one operation
  • At least one storage device is used to store data through each operation of at least one operation
  • the first storage device in the at least one storage device includes: a first memory and a first controller, wherein the first controller is used for When the storage capacity of a memory is greater than or equal to the first predetermined amount, generate a first back pressure signal and send the first back pressure signal to the first operation engine of the at least one operation engine for controlling the first operation engine to stop
  • the data operated by the first operation engine is output to the first memory.
  • the first predetermined amount may indicate the back pressure threshold of the first memory, where the back pressure threshold is related to the maximum storage amount of the first memory and also related to the rate at which the first operation engine outputs data to the first memory, for example, but Not limited to, if the maximum storage capacity of the first memory is 128 bytes, and the rate at which the first operation engine outputs data to the first memory is 64 bytes/clock cycle, then the back pressure threshold can be 64 bytes or higher Less than 64 bytes (for example, 96 bytes).
  • the first storage device has a real-time back pressure mechanism. Once the first operation engine receives the back pressure signal from the first storage device, it immediately suspends all operations and stops outputting data to the first memory, thereby It is possible to prevent the first memory from overflowing when the first memory has a small storage capacity.
  • the first memory is also used to input data operated by the first operation engine to a second operation engine of the plurality of operation engines.
  • the first storage device may buffer the data to be input to the second operation engine through the operation of the first operation engine, to prevent the transmission and reception delay or delay caused by the large amount of data received by the second operation engine.
  • the first storage device has a real-time backpressure mechanism, the first storage with a smaller storage capacity can realize the concurrent pipeline of the first operation engine and the second operation engine, without increasing the memory bandwidth. It improves the processing performance and minimizes the consumption of hardware resources to achieve the best end-to-end performance and power consumption.
  • the first predetermined amount at least partially indicates that in the case where the rate at which the first operation engine outputs data to the first memory is higher than the rate at which the first memory inputs data to the second operation engine, the reverse of the first memory Pressure threshold.
  • the decompression device when the decompression device includes multiple operation engines and at least one storage device further includes a second storage device, the second storage device is used to output data operated by the second operation engine to the multiple operation engines The third operating engine in the.
  • the second controller in the second storage device when the storage amount of the second memory in the second storage device is greater than or equal to the second predetermined amount, the second controller in the second storage device is used to generate the second back pressure signal, and The second back pressure signal is sent to the second operation engine, and is used to control the second operation engine to stop outputting the data operated by the second operation engine to the second memory.
  • the second storage device may buffer the data to be input to the third operation engine through the operation of the second operation engine, to prevent the transmission and reception delay or delay caused by the large amount of data received by the third operation engine.
  • the second storage device has a real-time back pressure mechanism, the second storage with a smaller storage capacity can realize the concurrent pipeline of the second operation engine and the third operation engine, without increasing the memory bandwidth. It improves the processing performance and minimizes the consumption of hardware resources to achieve the best end-to-end performance and power consumption.
  • the second predetermined amount at least partially indicates that in the case where the rate at which the second operation engine outputs data to the second memory is higher than the rate at which the second memory inputs data to the third operation engine or the calculation engine, the first 2.
  • the back pressure threshold of the memory is higher than the rate at which the second memory inputs data to the third operation engine or the calculation engine, the first 2.
  • the second operation engine is further used to send a second back pressure signal to the first operation engine, and is used to control the first operation engine to stop outputting the data operated by the first operation engine to the first memory.
  • the second operation engine after the second operation engine receives the back pressure signal from the second storage device, the first storage device stops outputting data to the second operation engine. Therefore, the second operation engine causes the second operation engine to The back pressure signal is sent to the first operation engine and the first operation engine stops outputting data to the first memory, which can prevent the first storage device from reaching the back pressure threshold in a short time.
  • the decompression device further includes:
  • the policy management device is used to determine the operation sequence of at least one operation, and start at least one operation engine and/or start at least one storage device according to the operation sequence, and is also used to determine the relationship between the at least one operation engine and the at least one storage device
  • the routing sequence of the at least one operation engine determines the input and output sequence between each operation engine in the at least one operation engine and each storage device in the at least one storage device.
  • the model miniaturization and decompression algorithm is decomposed into multiple fine-grained operations, and different operation engines are started as required, so that any combination of the operation engines can be used to support subsequent model miniaturization and decompression algorithms Without the need to modify the hardware design.
  • the policy management device is further configured to send a start signal to at least one operation engine and/or at least one storage device for starting at least one operation engine and/or at least one storage device.
  • the start signal includes a start signal sent to at least one operation engine and a channel gating signal sent to at least one storage device.
  • the at least one operation includes at least one of table lookup decompression, masking, comparison, and quantization.
  • At least one operation is related to decompression.
  • the second aspect of the present application provides an accelerator, including:
  • the calculation engine is used to calculate the data after at least one operation performed by the decompression device according to the instruction.
  • the first memory is also used to input data operated by the first operation engine to the calculation engine.
  • the first storage device may buffer the data to be input to the calculation engine that is operated by the first operation engine, so as to prevent the large amount of data received by the calculation engine from sending and receiving delays or delay changes; in addition, , Because the first storage device has a real-time backpressure mechanism, the first storage with a small storage capacity can realize the concurrent pipeline of the first operation engine and the calculation engine, which improves the processing performance without increasing the memory bandwidth. And minimize the consumption of hardware resources to achieve the best end-to-end performance and power consumption.
  • the first predetermined amount at least partially indicates that the back pressure threshold of the first memory is higher than the rate of the first memory inputting data to the calculation engine when the rate at which the first operating engine outputs data to the first memory .
  • the decompression device when the decompression device includes multiple operation engines and at least one storage device further includes a second storage device, the first memory is also used to input data operated by the first operation engine to the multiple operation engines.
  • the second operation engine in the second operation engine, the second storage device is used to output the data operated by the second operation engine to the calculation engine.
  • the second controller in the second storage device when the storage amount of the second memory in the second storage device is greater than or equal to the second predetermined amount, the second controller in the second storage device is used to generate the second back pressure signal, and The second back pressure signal is sent to the second operation engine, and is used to control the second operation engine to stop outputting the data operated by the second operation engine to the second memory.
  • the second storage device may buffer the data to be input to the calculation engine that is operated by the second operation engine, so as to prevent the large amount of data received by the calculation engine from sending and receiving delays or delay changes; in addition, , Since the second storage device has a real-time backpressure mechanism, the second storage with a smaller storage capacity can realize the concurrent pipeline of the second operation engine and the calculation engine, which improves the processing performance without increasing the memory bandwidth. And minimize the consumption of hardware resources to achieve the best end-to-end performance and power consumption.
  • the second predetermined amount indicates, at least in part, the back pressure threshold of the second memory when the rate at which the second operating engine outputs data to the second memory is higher than the rate at which the second memory inputs data to the calculation engine .
  • the third aspect of the present application provides a method for a decompression device, the method including:
  • At least one operation engine of the decompression device performs at least one operation on data related to the instruction
  • At least one storage device of the decompression device stores data operated by each of the at least one operation engine
  • the first storage device when the storage capacity of the first storage device in the at least one storage device is greater than or equal to the first predetermined amount, the first storage device generates a first back pressure signal and sends it to the first operation engine of the at least one operation engine And the first operation engine stops outputting the data operated by the first operation engine to the first storage device in response to the first back pressure signal.
  • the first predetermined amount may indicate the back pressure threshold of the first memory, where the back pressure threshold may be related to the maximum storage amount of the first memory and also related to the rate at which the first operation engine outputs data to the first memory, for example, But not limited to, if the maximum storage capacity of the first memory is 128 bytes, and the rate at which the first operation engine outputs data to the first memory is 64 bytes/clock cycle, then the back pressure threshold can be 64 bytes, or More than 64 bytes (for example, 96 bytes).
  • the first storage device has a real-time back pressure mechanism. Once the first operation engine receives the back pressure signal from the first storage device, it immediately suspends all operations and stops outputting data to the first memory, thereby It is possible to prevent the first memory from overflowing when the first memory has a small storage capacity.
  • the method further includes:
  • the first storage device inputs data operated by the first operation engine to a second operation engine among the plurality of operation engines.
  • the first storage device may buffer the data to be input to the second operation engine through the operation of the first operation engine, to prevent the transmission and reception delay or delay caused by the large amount of data received by the second operation engine.
  • the first storage device has a real-time backpressure mechanism, the first storage with a smaller storage capacity can realize the concurrent pipeline of the first operation engine and the second operation engine, without increasing the memory bandwidth. It improves the processing performance and minimizes the consumption of hardware resources to achieve the best end-to-end performance and power consumption.
  • the first predetermined amount indicates, at least in part, that the first storage device outputs data to the first storage device at a higher rate than the first storage device to input data to the second storage device.
  • the back pressure threshold of the device indicates, at least in part, that the first storage device outputs data to the first storage device at a higher rate than the first storage device to input data to the second storage device.
  • the method further includes:
  • At least one operation engine includes a plurality of operation engines and at least one storage device further includes a second storage device
  • the second storage device outputs data operated by the second operation engine to a third operation engine among the plurality of operation engines .
  • the method further includes:
  • the second storage device When the storage capacity of the second storage device is greater than or equal to the second predetermined value, the second storage device generates a second back pressure signal, and sends the second back pressure signal to the second operation engine for controlling the second operation The engine stops outputting the data operated by the second operation engine to the second storage device.
  • the second storage device may buffer the data to be input to the third operation engine through the operation of the second operation engine, to prevent the transmission and reception delay or delay caused by the large amount of data received by the third operation engine.
  • the second storage device has a real-time back pressure mechanism, the second storage with a smaller storage capacity can realize the concurrent pipeline of the second operation engine and the third operation engine, without increasing the memory bandwidth. It improves the processing performance and minimizes the consumption of hardware resources to achieve the best end-to-end performance and power consumption.
  • the second predetermined amount indicates, at least in part, that if the rate at which the second operation engine outputs data to the second storage device is higher than the rate at which the second storage device inputs data to the third operation engine, the second storage The back pressure threshold of the device.
  • the method further includes:
  • the second operation engine sends a second back pressure signal to the first operation engine for controlling the first operation engine to stop outputting the data operated by the first operation engine to the first storage device.
  • the first storage device stops outputting data to the second
  • the back pressure signal is sent to the first operation engine and the first operation engine stops outputting data to the first memory, which can prevent the first storage device from reaching the back pressure threshold in a short time.
  • the method further includes:
  • the policy management device in the decompression device determines the operation sequence of at least one operation, and starts at least one operation engine and starts at least one storage device according to the operation sequence, and the policy management device also determines the relationship between the at least one operation engine and the at least one storage device.
  • the routing sequence where the routing sequence determines the input and output sequence between each of the at least one operation engine and each of the at least one storage device.
  • the model miniaturization and decompression algorithm is decomposed into multiple fine-grained operations, and different operation engines are started as required, so that any combination of the operation engines can be used to support subsequent model miniaturization and decompression algorithms Without the need to modify the hardware design.
  • the method further includes:
  • the policy management device sends a start signal to the at least one operation engine and the at least one storage device for starting the at least one operation engine and the at least one storage device.
  • the start signal includes a start signal sent to at least one operation engine and a channel gating signal sent to at least one storage device.
  • the at least one operation includes at least one of table lookup decompression, masking, comparison, and quantization.
  • At least one operation is related to decompression.
  • the fourth aspect of the present application provides a system, including:
  • Memory storing data related to instructions on the memory
  • the accelerator is used to read data from the memory and perform any of the methods described above on the data.
  • the fifth aspect of the present application provides a decompression device, which is used to perform at least one operation on data related to instructions, and includes:
  • At least one operation engine corresponding to at least one operation
  • At least one storage device for storing data through each of the at least one operation
  • the policy management device is used to determine the operation sequence of at least one operation, and start at least one operation engine and/or start at least one storage device according to the operation sequence, and is also used to determine the relationship between the at least one operation engine and the at least one storage device
  • the routing order of the at least one operation engine determines the input/output order between each operation engine in the at least one operation engine and each storage device in the at least one storage device.
  • the model miniaturization and decompression algorithm is decomposed into multiple fine-grained operations, and different operation engines are started as required, so that any combination of the operation engines can be used to support subsequent model miniaturization and decompression algorithms Without the need to modify the hardware design.
  • Fig. 1 is a schematic structural diagram of an AI acceleration system according to an embodiment of the present application
  • Figure 2 is a schematic structural diagram of a decompression device according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the operation engine and the pipeline register device level selected and activated by the policy management device according to an embodiment of the present application;
  • FIG. 4 is a schematic diagram of the back pressure mechanism of the first-level pipeline register device according to an embodiment of the present application.
  • FIG. 5 is another schematic diagram of the operation engine and the pipeline register device level selected and activated by the policy management device according to an embodiment of the present application;
  • Fig. 6 is a schematic flowchart of a method for an AI accelerator according to an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a back pressure method of a pipeline register device according to an embodiment of the present application.
  • Fig. 1 shows a schematic structural diagram of an AI acceleration system according to an embodiment of the present application.
  • the AI acceleration system includes a central processing unit (Central Processing Unit, CPU for short) respectively coupled to the interconnection bus 3000 1000.
  • MTE Memory Transfer Engine
  • decompression device 4200 a decompression device 4200
  • post-level memory 4300 a calculation engine 4400.
  • the back-level memory 4300 can be located inside the calculation engine 4400 and used as a part of the calculation engine 4400, and the AI acceleration system can also include other modules, such as but not limited to , Input/output module.
  • the main control CPU 1000 on the one hand can be a microprocessor, a digital signal processor, a microcontroller, etc., and/or any combination thereof, on the other hand, the main control CPU 1000 can be a single-core processor, a multi-core processor, etc. , And/or any combination thereof.
  • the system memory 2000 may include any suitable memory, such as non-volatile memory, volatile memory, etc., where examples of non-volatile memory may include, but are not limited to, Read Only Memory (ROM for short) Examples of volatile memory may include, but are not limited to, double-rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM for short), cache memory (Cache), and the like.
  • AI accelerator 4000 for example, one or more of MTE 4100, UCU 4200, and calculation engine 4400
  • MTE 4100, UCU 4200, and calculation engine 4400 can be implemented by any one or a combination of any of hardware, software, and firmware, for example, Any of application-specific integrated circuits (ASIC), electronic circuits, (shared, dedicated or group) processors and/or memories that execute one or more software or firmware programs, combinational logic circuits, and other suitable components that provide the described functions Combined realization.
  • the back-level memory 4300 may include, but is not limited to, random access memory (Random Access Memory, RAM for short).
  • AI accelerators can be deployed in any devices that require AI accelerators, such as smart phones, mobile data centers, public clouds, and Internet of Things devices.
  • the system memory 2000 stores data, such as, but not limited to, deep learning model data compressed by a model miniaturization algorithm (for example, but not limited to, the parameters of the deep learning model and/or the deep learning model Input), the original deep learning model data or other types of data that have not been compressed by the model miniaturization algorithm.
  • the main control CPU 1000 can control the AI accelerator 4000 to start through the interconnect bus 3000, so that the AI accelerator 4000 can read data from the system memory 2000 through the interconnect bus 3000 for processing.
  • a model miniaturization algorithm is used to compress data, which may include, but is not limited to, pruning sparse algorithm, quantization algorithm, coding algorithm, compressed sensing algorithm based on circulant matrix, compression algorithm based on matrix decomposition, etc. .
  • the pruning sparse algorithm can prune unimportant connections in the deep learning model to make model parameters sparse, which can include weight pruning, channel pruning, and so on.
  • the quantization algorithm can cluster the sparsely pruned model parameters to some discrete, low-precision numerical points, which can include INT8/INT4/INT2/INT1 quantization, binary network quantization, ternary network quantization, vector quantization Etc., take INT8 quantization as an example.
  • the parameters of the deep neural network model trained by the backpropagation algorithm are usually represented by 32-bit floating point numbers.
  • INT8 quantization can use the clustering algorithm to gather the parameters of each layer of the deep learning model. Class. Those belonging to the same class share the same parameter represented by an 8-bit integer.
  • the coding algorithm can encode data such as model input and quantized model parameters, which can include Huffman coding, run-length coding based on dictionary technology, LZW coding, and so on.
  • the compressed sensing algorithm based on the circulant matrix uses the circulant matrix as the measurement matrix of compressed sensing to obtain a sparse representation of the parameter matrix of the deep learning model.
  • the compression algorithm based on matrix factorization uses matrix factorization to reduce the dimension of the deep learning model parameter matrix.
  • the MTE 4100 is used for the management and distribution of instructions, for example, but not limited to, sending to the decompression device 4200 an instruction to read data from the system memory 2000 and start processing, and to the calculation engine 4400
  • the level memory 4300 reads the data processed by the decompression device 4200 and starts calculation instructions.
  • the decompression device 4200 is configured to perform one or more operations on the data related to the instruction of the MTE 4100 to convert it into data that can be calculated by the calculation engine 4400.
  • the one or more operations may be related to the decompression algorithm corresponding to the model miniaturization algorithm, for example, obtained by decompressing the decompression algorithm, where the decompression algorithm is used to restore the model compressed by the model miniaturization algorithm Data, for example, a decoding algorithm can recover model data compressed by an encoding algorithm.
  • Examples of the one or more operations may include, but are not limited to, a decoding operation, used to decode model parameters and/or model input data encoded by an encoding algorithm; a quantization operation, used to input model input and/or Data type conversion is performed on model parameters and other data quantified by quantization algorithms, for example, model parameters are converted back to 32-bit floating point numbers or converted to data types that can be calculated by the calculation engine 4400; mask operations and/or comparison operations are used for Restore the model parameters pruned by the pruning sparse algorithm; shift operation is used to obtain the cyclic shift matrix to restore the original model parameter matrix; dot multiplication operation and addition operation are used to restore the original model data matrix using the reduced dimensionality Model parameter matrix, etc.
  • a decoding operation used to decode model parameters and/or model input data encoded by an encoding algorithm
  • a quantization operation used to input model input and/or Data type conversion is performed on model parameters and other data quantified by quantization algorithms, for example, model parameters are converted back to 32-bit floating point
  • the calculation engine 4400 is configured to perform calculations on data after one or more operations performed by the decompression device 4200 according to the instructions of the MTE 4100.
  • FIG. 2 shows a schematic structural diagram of a decompression device 4200 according to an embodiment of the present application.
  • the decompression device 4200 may include an instruction management device 4210, a strategy management device 4220, an operation engine device 4230, and a pipeline register device 4240. And write cache register device 4250.
  • the policy management device 422 further includes a memory 4221 (for example, but not limited to, RAM) and a controller 4222;
  • the operation engine device 4230 further includes a look-up table decoding engine 4231, a quantization engine 4232, a mask engine 4233, a comparison engine 4234, and REG RAM 4235;
  • the pipeline register device 4240 further includes a primary pipeline register device 4241 and a secondary pipeline register device 4242, while the primary pipeline register device 4241 further includes a primary pipeline register 42411, a counter 42412 and a controller 42413, and a secondary pipeline register device 4242 further includes a secondary pipeline register 42421, a counter 42422, and a controller 42423.
  • the number and types of operation engines included in the operation engine device 4230 are not limited to those shown in FIG. Including, but not limited to, shift engine, dot product engine, addition engine, transparent transmission engine, etc.
  • the transparent transmission engine does not perform other operations on the model data except for transparent transmission. It can be used for deep learning model data The scene of model miniaturization algorithm compression.
  • the number of stages of the pipeline register device included in the pipeline register device 4240 is not limited to that shown in FIG. 2, and the pipeline register device 4240 may include any number of stages of pipeline register devices.
  • FIG. 2 shows that the pipeline register device 4240 and the write cache register device 4250 are independent of each other, the write cache register device 4250 can also be used as a pipeline register device of a certain level of the pipeline register device 4240.
  • the instruction management device 4210 may receive instructions from the MTE 4100.
  • the data is stored in the form of data blocks in the system memory 2000, each data block has an index (index), the data block corresponds to the index one-to-one, and each index can indicate the total length of the corresponding data block , Whether it has been compressed and other information.
  • the instruction from the MTE 4100 may indicate the number of data blocks that need to be processed by the decompression device 4200 and the index corresponding to the starting data block.
  • the instruction management device 4210 may obtain the index corresponding to the data block to be processed from the system memory 2000 according to the instruction information, and generate and maintain an index table including the obtained index.
  • the instruction management device 4210 may also send the index information of the data block to be read to the policy management device 4220 according to the index table.
  • the controller 4222 of the policy management device 4220 may receive the index information from the instruction management device 4210, and determine the storage address of the data block to be read in the system memory 2000 according to the index information, and read it from the system memory 2000 read the corresponding data block.
  • the controller 4222 of the policy management device 4220 may also receive global configuration parameters from the MTE 4100, such as, but not limited to, the start address of the system memory 2000 (used to determine the offset address).
  • the memory 4221 of the policy management device 4220 may receive a data block read from the system memory 2000, where, as shown in FIG. 2, the data block may include a policy table, header information, and one or Data of multiple operations (for example, compressed by the model miniaturization algorithm or original deep learning model data), where the strategy table can indicate which operations need to be performed on the data related to this instruction and the execution order of the operations, for example, Perform table lookup decoding operations on the data first, and then perform quantization operations; the header information may include one or more operation engine configuration parameters of the operation engine device 4230, such as, but not limited to, the dictionary and quantization required by the table lookup decoding engine 4231 The quantization coefficient required by the engine 4234.
  • the data block may include a policy table, header information, and one or Data of multiple operations (for example, compressed by the model miniaturization algorithm or original deep learning model data), where the strategy table can indicate which operations need to be performed on the data related to this instruction and the execution order of the operations, for example, Perform table lookup decoding operations on
  • the controller 4222 of the policy management device 4220 may also parse the policy table, and according to the instruction information of the policy table, select the operation engine that needs to be started from the multiple operation engines of the operation engine device 4230. Among the multiple levels of pipeline register devices in the register device 4240, the pipeline register device level that needs to be activated is selected. It should be noted that the controller 4222 selects to activate the write cache register device 4250 by default.
  • the controller 4222 may choose to start the operation engine corresponding to the operation indicated in the strategy table. For example, if the strategy table indicates that the data needs to be looked up and decoded first, and then the quantization operation is performed, then the controller 4222 can choose to start the look-up decoding engine 4231 and the quantization engine 4232 accordingly. If the strategy table indicates that the data needs to be looked-up and decoded first, then quantized, and finally masked, then the controller 4222 can select accordingly Start the lookup table decoding engine 4231, the quantization engine 4232, and the mask engine 4233.
  • the controller 4222 may select the level of the pipeline register device that needs to be started according to the number of operation engines that need to be started.
  • the number of stages of pipeline register device that needs to be started may be the number of operation engines that need to be started minus 1. .
  • the controller 4222 may choose not to start any level of pipeline register device; if two operation engines need to be started, then the controller 4222 may choose to start the first stage pipeline register device 4241; if it needs to start three If there are two operation engines, the controller 4222 can choose to start the first-level pipeline register device 4241 and the second-stage pipeline register device 4242.
  • the controller 4222 may also determine the routing sequence between the selected operation engine and the selected level of the pipeline register device and the write cache register device 4250.
  • the routing sequence may determine the selected operation engine and the selected operation engine. The sequence of reading and writing (or input and output) between the level pipeline register device and the write cache register device 4250.
  • the controller 4222 selects to start the table lookup decoding engine 4231, the quantization engine 4232, the primary pipeline register device 4241, and the write buffer register device 4250, then the controller 4222 can determine that the table lookup decoding engine 4231 reads from the memory 4221 Data is written to the first-level pipeline register device 4241, and the quantization engine 4232 reads data from the first-level pipeline register device 4241 and writes data to the write buffer register device 4250.
  • the controller 4222 selects to start the lookup table decoding engine 4231, the quantization engine 4232, the mask engine 4233, the primary pipeline register device 4241, the secondary pipeline register device 4242, and the write buffer register device 4250, then the controller 4222 can determine that the look-up table decoding engine 4231 reads data from the memory 4221 and writes data to the primary pipeline register device 4241, and the quantization engine 4232 reads data from the primary pipeline register device 4241 and writes data to the secondary pipeline register device 4242 , The mask engine 4233 reads data from the secondary pipeline register device 4242 and writes data to the write cache register device 4250.
  • the controller 4222 may also send a start signal to the selected operation engine, the pipeline register device of the selected level, and the write cache register device 4250 for starting the selected operation engine, the pipeline register of the selected level.
  • Device and write cache register device 4250 may also send a start signal to the selected operation engine, the pipeline register device of the selected level, and the write cache register device 4250 for starting the selected operation engine, the pipeline register of the selected level.
  • the controller 4222 may send a start signal to the selected operation engine, the start signal may instruct the operation engine to start operating on data, and for the operation engine that requires configuration parameters, the controller 4222 may also send a header to it. information.
  • the controller 4222 may also send a channel gating signal to the selected operation engine, and the channel gating signal may indicate the routing sequence of the operation engine, that is, where the operation engine reads data from and where to write data.
  • the channel strobe signal sent by the controller 4222 to the lookup table decoding engine 4231 can indicate the lookup
  • the table decoding engine 4231 reads data from the memory 4221 of the strategy management device 4220, and writes data to the first-level pipeline register device 4241.
  • the channel gating signal sent by the vectorization engine 4232 can instruct the quantization engine 4232 to receive data from the first-stage pipeline register device 4241. Read data, and write data to the write buffer register device 4250. For another example, if the controller 4222 selects to start the lookup table decoding engine 4231, the quantization engine 4232, the mask engine 4233, the primary pipeline register device 4241, the secondary pipeline register device 4242, and the write cache register device 4250, then the controller 4222 checks The channel gating signal sent by the table decoding engine 4231 can instruct the table look-up decoding engine 4231 to read data from the memory 4221 of the strategy management device 4220, and write data to the first-level pipeline register device 4241, and the channel selection communication sent by the vectorization engine 4232 The number can instruct the quantization engine 4232 to read data from the first-level pipeline register device 4241 and write data to the second-level pipeline register device 4242. The channel gating signal sent to the mask engine 4233 can instruct the mask engine 4233 to switch from the second-stage pipeline
  • the channel gating signal sent by the controller 4222 to the selected operation engine may also indicate the execution order of the operation engine.
  • the controller 4222 may send channel gating information to the pipeline register device and the write cache register device 4250 of the selected level, and the channel gating message indicates that the pipeline register device and the write cache register device 4250 of each level want The operation engine to write data to.
  • the channel gating signal sent by the controller 4222 to the first-level pipeline register device 4241 may indicate The first-level pipeline register device 4241, the look-up decoding engine 4231 needs to write data to it, and the channel strobe signal sent to the write buffer register device 4250 can instruct the write buffer register device 4250 to which the quantization engine 4232 needs to write data.
  • the controller 4222 selects to start the lookup table decoding engine 4231, the quantization engine 4232, the mask engine 4233, the first-level pipeline register device 4241, the second-level pipeline register device 4242, and the write cache register device 4250, then the controller 4222 sends a
  • the channel gating signal sent by the stage pipeline register device 4241 can indicate the first stage pipeline register device 4241 to which the look-up decoding engine 4231 should write data
  • the channel gate signal sent to the second stage pipeline register device 4241 can indicate the second stage pipeline device 4241.
  • the register device 4241 and the quantization engine 4232 are to write data to it, and the channel strobe signal sent to the write buffer register device 4250 can instruct the write buffer register device 4250 to which the mask engine 4233 should write data.
  • controller 4222 determines that the selected operation engine writes to the pipeline register device and the write cache register device 4250 of the selected level and that the selected operation engine reads from the pipeline register device of the selected level.
  • the routing sequence in the case of data.
  • the controller 4222 can also determine the routing sequence when the pipeline register device of the selected level and the write cache register device 4250 read data from the selected operation engine and the pipeline register device of the selected level writes data to the selected operation engine.
  • the controller 4222 may not send the above-mentioned channel gating signal to the selected operation engine; and, the channel gating information sent by the controller 4222 to the pipeline register device of the selected level and the write buffer register device 4250 may be Indicate the routing sequence of the pipeline register device and the write cache register device 4250 of each level, that is, which operation engine the pipeline register device of the selected level reads data from and which operation engine writes data to, and which operation engine the write cache register device 4250 uses The operation engine reads the data.
  • the operation engine writing data to the pipeline register device and the write cache register device can also be replaced by the pipeline register device and the write cache
  • the register device reads data from the operation engine, and the operation engine reads data from the pipeline register device.
  • the pipeline register device writes data to the operation engine.
  • the operation engine in the operation engine device 4230 can read data from the memory 4221 in the policy management device 4220 or from the pipeline register device of the level selected by the policy management device 4220 (or, the data is from the memory 4221 Or the pipeline register device is input to the operation engine), the data is operated, and the operation result is written into the pipeline register device of the level selected by the strategy management device 4220 or the write cache register device 4250 (or the data is output from the operation engine to the pipeline register Device or write cache register device 4250).
  • Each operation engine included in the operation engine device 4230 can perform different operations on data.
  • the look-up table decoding engine 4231 can perform a decoding operation to decode model parameters and model input encoded by an encoding algorithm
  • the quantization engine 4232 can Data type conversion for model input, model parameters quantified by quantization algorithms, for example, convert model parameters back to 32-bit floating point numbers or into data types that can be calculated by calculation engine 4400
  • mask engine 4233 and comparison engine 4234 can perform mask operation and comparison operation separately to restore the model parameters pruned by the pruning sparse algorithm.
  • the amount of data operated by the operation engine per clock cycle may depend on the maximum processing capability of the operation engine, and the maximum processing capability may be related to the operation
  • the design cost and design area of the engine are related; in addition, in the case where the write cache register device 4250 does not have a back pressure mechanism (described in the following embodiments), the amount of operation data may also depend on the decompression rate level of the operated data and The maximum transmission bit width between the write cache register device 4250 and the subsequent memory 4300, where the decompression rate level of the operated data refers to the amount of data after the operation of the operation engine and the amount of data before the operation of the operation engine.
  • the ratio of the amount of data may, but is not limited to, be related to the compression ratio of the model miniaturization algorithm, for example, related to the compression ratio of the encoding algorithm.
  • the REG RAM 4235 can store the intermediate results of the operation engine. For example, when the operation of the operation engine on the currently read data depends on the data to be read next time, the operation engine can store the results of the currently read data. The intermediate result of the operation is stored in the REG RAM 4235, and after the operation on the currently read data is completed with the data read next time, the final operation result is written into the pipeline register device 4240 or the write cache register device 4250.
  • the last call can be The operation result generated by each previous call is stored in the REG RAM 4235, and the operation result generated by the last call is written into the pipeline register device 4240 or the write cache register device 4250.
  • each level of pipeline register device includes a pipeline register, a counter, and a controller.
  • the first stage pipeline register 42411 can store data written by the operating engine. It can output data to the operation engine; the counter 42412 can determine the storage capacity of the primary pipeline register 42411; the controller 42413 can store the storage capacity of the primary pipeline register 42411 higher than or equal to the back pressure pipeline (or called In the case of the back pressure threshold), a back pressure signal is generated, and according to the channel strobe signal, the back pressure signal is sent to the operation engine that writes data to it, so that the operation engine stops operating on the data and stops the slave strategy
  • the management device 4220 reads data and stops writing data to the primary pipeline register 42411. In this way, the primary pipeline register 42411 can be prevented from overflowing.
  • the controller 42413 of the primary pipeline register device 4241 can determine the back pressure of the primary pipeline register 42411 according to the maximum storage capacity of the primary pipeline register 42411 and the writing speed of the operation engine that writes data to the primary pipeline register 42411 Waterline. For example, but not limited to, if the maximum storage capacity of the primary pipeline register 42411 is 128 bytes, and the writing speed of the operation engine that writes data to the primary pipeline register 42411 is 64 bytes/clock cycle, then the controller 42413
  • the back pressure pipeline of the primary pipeline register 42411 can be set to 64 bytes or higher than 64 bytes (for example, 96 bytes).
  • the case where the storage capacity of the first-level pipeline register 42411 is higher than or equal to the back pressure water line may include the write rate of the operation engine that writes data to the first-stage pipeline register 42411 (that is, the amount of data written per clock cycle). ) Is higher than the read rate of the operation engine that reads data from the first-level pipeline register 42411 (that is, the amount of data read per clock cycle).
  • the back pressure signal may include, but are not limited to, a high-level signal with a value of 1 represented by 1 bit.
  • the operation engine when the operation engine stops operating on the data, the internal register of the operation engine that stores the operation result of the operation engine stops flipping and maintains the current state.
  • the operation engine can include a multiplier and an adder.
  • the multiplier stores the result of the operation in a register, and the adder reads data from the register for operation. After the operation engine receives the back pressure signal, the multiplier and adder will The operation is suspended, and the register will maintain its current state.
  • the controller 42413 can generate a back pressure release signal and release the back pressure
  • the signal is sent to the operation engine that writes data to the first-level pipeline register 42411, so that the operation engine resumes the operation of the data, resumes reading data from the strategy management device 4220, and resumes writing data to the first-stage pipeline register 42411.
  • the back pressure release signal may include, but are not limited to, a low-level signal with a value of 0 represented by 1 bit. In the case that the operation engine resumes the operation on the model data, the operation engine can continue the operation on the basis of the operation data stored in the internal register.
  • pipeline register devices of other levels reference may be made to the above description of the first stage pipeline register device 4241, and pipeline register devices of different levels may have different back pressure pipelines.
  • the operation engine that receives the back pressure signal can send the back pressure signal to each operation engine that has priority in the execution order according to the channel gating signal, so that these operation engines stop operating on the data, Stop reading data and stop writing data to the pipeline register device 4240.
  • the write cache register 4251 of the write cache register device 4250 can store data written by the operation engine, and can also output data to the downstream memory 4300; the counter 4252 can determine the storage capacity of the write cache register 4251; control The device 4253 can generate a back pressure signal when the storage capacity of the write buffer register 4251 is higher than or equal to the back pressure water line of the write buffer register 4251, and send the back pressure signal to the operation of writing data to the write buffer register 4251 Engine, so that the operation engine stops operating on data, stops reading data, and stops writing data to the write buffer register 4251. In this way, the write buffer register 4251 can be prevented from overflowing.
  • the case where the storage capacity of the write cache register 4251 is higher than or equal to the back pressure waterline may include that the rate at which the operating engine writes data to the write cache register 4251 is higher than the rate at which the write cache register 4251 outputs data to the downstream memory 4300.
  • the back pressure waterline of the write cache register 4251 may depend on the maximum storage capacity of the write cache register 4251.
  • examples of the back pressure signal may include, but are not limited to, a high-level signal with a value of 1 represented by 1 bit.
  • the controller 4253 can generate a back pressure release signal and send the back pressure release signal To the operation engine that writes data to the write cache register 4251, so that the operation engine resumes the operation of data, resumes reading of data, and resumes writing data to the write cache register 4251.
  • the back pressure release signal may include, but are not limited to, a low-level signal with a value of 0 represented by 1 bit.
  • the reverse of the write cache register device 4250 can be cancelled.
  • the pressure mechanism, that is, the write cache register device 4250 may not include the counter 4252.
  • FIG. 3 shows an example of the operation engine and the pipeline register device level selected and activated by the policy management device 4220 according to an embodiment of the present application, and also shows the flow of data in the decompression device 4200.
  • the controller 4222 of the strategy management device 4220 selects to start the look-up decoding engine 4231, the quantization engine 4232, the first-level pipeline register device 4241, and the write cache register device 4250 according to the strategy table.
  • the look-up decoding engine 4231 reads data from the memory 4221 of the strategy management device 4220 after receiving the start signal, header information, and channel strobe signal from the strategy management device 4220, where the amount of data read is It may depend on the maximum processing capacity of the look-up table decoding engine 4231, and the maximum processing capacity of the look-up table decoding engine 4231 may be related to the design cost and design area of the look-up table decoding engine 4231; in addition, the write cache register device 4250 does not have back pressure In the case of the mechanism, the amount of data read may also depend on the compression ratio of the encoding algorithm and the maximum transmission bit width between the write cache register device 4250 and the subsequent-level memory 4300.
  • the look-up table decoding engine 4231 will be used every clock Up to 8B data can be read from the memory 4221 for operation in a cycle.
  • the look-up table decoding engine 4231 can decode encoded (for example, but not limited to, run-length encoding) data based on the dictionary in the header information, and write the decoded data into the primary pipeline register 42411. For example, in a case where the table lookup decoding engine 4231 reads 8B data from the memory 4221 every clock cycle for decoding, the lookup table decoding engine 4231 writes 64B data to the primary pipeline register 42411 every clock cycle.
  • the quantization engine 4232 After the quantization engine 4232 receives the start-up signal, header information, and channel strobe signal from the strategy management device 4220, it can read data from the first-level pipeline register 42411, where the amount of data read can depend on the maximum size of the quantization engine 4232. Processing capacity, and the maximum processing capacity of the quantization engine 4232 can be related to the design cost and design area of the quantization engine 4232. For example, if the maximum data processing capacity of the quantization engine 4232 is 32B/clk, then the quantization engine 4232 has the most per clock cycle The 32B data can be read from the primary pipeline register 42411 for operation.
  • the amount of data read may also depend on the data type before and after conversion and the maximum transmission bit width between the write cache register device 4250 and the downstream memory 4300. For example, if the quantization engine 4232 is to convert a 16-bit floating-point number to a 32-bit floating-point number, then, when the maximum transmission bit width between the write buffer register device 4250 and the post-level memory 4300 is 64B, the quantization engine 4232 every clock cycle Up to 32B data can be read from the memory 4221 for operation.
  • the quantization engine 4232 can convert the data type of the data based on the quantization coefficient in the header information, for example, convert a 16-bit floating point number into an 8-bit integer number. Then, in the case where the quantization engine 4232 reads 32B data from the memory 4221 every clock cycle, the quantization engine 4232 writes 16B data to the write buffer register 4251 every clock cycle.
  • the write cache register 4251 can accumulate a predetermined amount of data before writing to the downstream memory 4300.
  • FIG. 4 is a schematic diagram of the back pressure mechanism of the first-level pipeline register device 4241 in FIG. 3 according to an embodiment of the present application.
  • the rate at which the register 42411 writes data is 64B/clk
  • the rate at which the quantization engine 4232 reads data from the first-level pipeline register 42411 is 32B/clk. Therefore, every clock cycle, the storage capacity of the first-level pipeline register 42411 increases by 32B.
  • the storage capacity of the first-level pipeline register 42411 is equal to the backpressure waterline, and the controller 42413 can send a request to the lookup table decoding engine
  • the 4231 sends a back pressure signal (for example, but not limited to, a high level signal).
  • the table look-up decoding engine 4231 receives the back pressure signal, it stops decoding data, stops reading data from the memory 4221 of the strategy management device 4220, and stops writing data to the primary pipeline register 42411.
  • the table lookup decoding engine 4231 stops working for one clock cycle after receiving the back pressure signal, then the storage capacity of the first-level pipeline register 42411 becomes 32B, and the controller 42413 can send a back pressure release signal to the lookup table decoding engine 4231 (For example, but not limited to, low-level signals).
  • the table look-up decoding engine 4231 resumes decoding the data, resumes reading data from the memory 4221 of the strategy management device 4220, and resumes writing data to the primary pipeline register 42411.
  • the controller 42413 will perform a back pressure every other clock cycle.
  • the table look-up decoding engine 4231 can be stopped for two clock cycles after receiving the back pressure signal, then the storage capacity of the first-level pipeline register 42411 becomes 0B, and the controller 42413 can send a negative signal to the table look-up decoding engine 4231 Depressurization signal (for example, but not limited to, low-level signal).
  • the table look-up decoding engine 4231 resumes decoding the data, resumes reading data from the memory 4221 of the strategy management device 4220, and resumes writing data to the primary pipeline register 42411.
  • the controller 42413 will perform a back pressure every two clock cycles.
  • FIG. 5 shows another example of the operation engine and the pipeline register device level selected and activated by the policy management device 4220 according to an embodiment of the present application, and also shows the flow of model data in the decompression device 4200.
  • the same operation engine and pipeline register device level as shown in FIG. 3 can refer to the description of FIG. 3.
  • the controller 4222 of the policy management device 4220 also selects to start the mask engine 4233 and secondary pipeline register device 4242.
  • the quantization engine 4232 writes data to the secondary pipeline register device 4242
  • the mask engine 4233 reads data from the secondary pipeline register device 4242, and writes data to the write buffer register 4251.
  • the controller 42423 will generate a back pressure signal (for example, but not limited to, a high level signal), and send the back pressure signal to the quantization engine 4232 according to the channel gating signal to make the quantization
  • the engine 4232 stops reading data from the primary pipeline register 42411, stops converting the data type of the data, and stops writing data to the secondary pipeline register 42421.
  • the controller 42413 can Send a back pressure signal to the look-up table decoding engine 4231 according to the channel strobe signal. In other words, the back pressure of the primary pipeline register 42411 and the secondary pipeline register 42421 can be performed independently of each other.
  • the controller 42423 will generate a back pressure release signal (for example, but not limited to, a low level signal), and select according to the channel.
  • the pass signal sends the back pressure release signal to the quantization engine 4232, so that the quantization engine 4232 resumes reading data from the primary pipeline register 42411, resumes data type conversion on the data, and resumes writing data to the secondary pipeline register 42421.
  • the quantization engine 4232 when the quantization engine 4232 receives the back pressure signal from the controller 42423, the quantization engine 4232 can send the back pressure signal to the look-up table decoding engine 4231 according to the channel gating signal to stop it from receiving the back pressure signal from the controller 42423.
  • the memory 4221 of the strategy management device 4220 reads data, stops decoding the data, and stops writing data to the primary pipeline register 42411.
  • the quantization engine 4232 may also send the back pressure release signal to the look-up table decoding engine 4231, so that the look-up table decoding engine 4231 recovers from the policy management device 4220.
  • the memory 4221 reads data, resumes decoding operations on the data, and resumes writing data to the primary pipeline register 42411.
  • the model data often needs to be restored by decompression technology before entering the calculation engine 4400 through the system memory 2000.
  • One of the biggest features of the decompression technology is that it will significantly enlarge the amount of decompressed data.
  • a larger buffer is usually needed to absorb data jitter (due to the limited processing capacity of the device, when the amount of data received by the device is large, it will cause a delay or delay in receiving and sending. Change, this is called jitter, so a buffer is needed to temporarily store the data, which is called absorbing jitter, or absorbing jitter for short).
  • all levels of pipeline register devices have a real-time backpressure mechanism. Once the operation engine receives Back pressure signal, immediately suspend all operations and maintain the current state, if the back pressure signal is canceled, immediately resume the previously suspended operations, so the use of small pipeline registers to achieve the function of shaking absorption, can achieve the minimum overhead of pipeline buffer resources at all levels change.
  • the model miniaturization and decompression algorithm is decomposed into multiple fine-grained operation engines, and different operation engines can be started as required. Therefore, the embodiment of this application can support subsequent operations through any combination of operation engines.
  • the deep learning model data is decomposed into small particle data to be operated by the operation engine, and different operation engines can operate with different data granularities. Therefore, the embodiment of this application realizes the refinement of the deep learning model data. control. Since the compression ratios of various model miniaturization algorithms are different, the magnification ratios of various decompression algorithms are inconsistent during decompression. In the embodiment of this application, by identifying the compression ratios of various model miniaturization algorithms, a reasonable selection can be made The data granularity to be operated by each operation engine in each clock cycle.
  • the concurrent pipeline between the model miniaturization and decompression algorithms can be realized without increasing the memory bandwidth.
  • the processing performance is improved, and the hardware resource consumption is minimized to achieve the best end-to-end performance and power consumption.
  • FIG. 6 is a schematic flowchart of a method for an AI accelerator 4000 according to an embodiment of the present application. Different components or other components of the AI accelerator 4000 shown in FIGS. 1 and 2 may implement different blocks or other parts of the method. For the content not described in the foregoing device embodiment, refer to the following method embodiment, and similarly, for the content not described in the method embodiment, refer to the foregoing device embodiment. As shown in FIG. 6, the method used for the AI accelerator 4000 may include:
  • Block 601 read a data block from the system memory 2000 through the policy management device 4220 or other units;
  • the data is stored in the form of data blocks in the system memory 2000, each data block has an index (index), the data block corresponds to the index one-to-one, and each index can indicate the total length of the corresponding data block , Whether it has been compressed and other information;
  • the instruction from the MTE4100 can indicate the number of data blocks that need to be processed by the decompression device 4200 and the index corresponding to the starting data block;
  • the instruction management device 4210 can obtain the data to be processed from the system memory 2000 according to the instruction information
  • the index corresponding to the data block generates and maintains an index table including the obtained index;
  • the instruction management device 4210 can also send the index information of the data block to be read to the strategy management device 4220 according to the index table;
  • the controller of the strategy management device 4220 4222 can receive the index information from the instruction management device 4210, determine the storage address of the data block to be read in the system memory 2000 according to the index information, and read the corresponding data block from the system memory 2000;
  • Block 602 through the strategy management device 4220 or other units, according to the instruction information of the strategy table, select the operation engine to be started from the multiple operation engines of the operation engine device 4230, from the multiple levels of pipeline registers of the pipeline register device 4240 In the device, select the level of the pipeline register device that needs to be started;
  • the memory 4221 of the strategy management device 4220 can receive the data block read from the system memory 2000, where the data block can include the strategy table, header information, and one or more Operational data (for example, compressed by the model miniaturization algorithm or original deep learning model data), where the strategy table can indicate which operations need to be performed on the data related to this instruction and the execution order of the operations;
  • the header information can include Configuration parameters of one or more operation engines of the operation engine device 4230, such as, but not limited to, the dictionary used by the look-up table decoding engine 4231, and the quantization coefficient required by the quantization engine 4234;
  • the controller 4222 of the policy management apparatus 4220 may select to start the operation engine corresponding to the operation indicated in the policy table;
  • the controller 4222 may select the level of the pipeline register device that needs to be started according to the number of operation engines that need to be started. For example, the number of stages of pipeline register device that needs to be started may be the number of operation engines that need to be started minus 1. ; It should be noted that if an operation engine needs to be started, the controller 4222 can choose not to start any level of pipeline register device;
  • controller 4222 can select to start the write cache register device 4250 by default;
  • the strategy management device 4220 or other units determine the routing sequence between the selected operation engine and the pipeline register device of the selected level and the write cache register device 4250;
  • the routing sequence can determine the read and write (or input and output) sequence between the selected operation engine and the selected level of the pipeline register device and the write cache register device 4250;
  • Block 604 through the strategy management device 4220 or other units, send a start signal to the selected operation engine, the selected level of pipeline register device, and the write cache register device 4250 for starting the selected operation engine, the selected level of pipeline register device And write cache register device 4250;
  • the controller 4222 may send a start signal to the selected operation engine, the start signal may instruct the operation engine to start operating on data, and for the operation engine that requires configuration parameters, the controller 4222 may also send a header to it. information;
  • controller 4222 can also send a channel strobe signal to the selected operation engine, the channel strobe signal can indicate the routing sequence of the operation engine, that is, where the operation engine reads data from and where to write data;
  • the channel gating signal sent by the controller 4222 to the selected operation engine may also indicate the execution order of the operation engine
  • the controller 4222 may send channel gating information to the pipeline register device and the write cache register device 4250 of the selected level, and the channel gating message indicates that the pipeline register device and the write cache register device 4250 of each level want The operation engine to write data to;
  • Block 605 read data and perform corresponding operations through the activated operation engine or other units;
  • the activated operation engine reads model data from the memory 4221 of the strategy management device 4220 or the activated pipeline register devices at all levels.
  • the amount of data read may depend on the maximum processing capability of the operation engine, and the maximum processing capability may be related to the design of the operation engine.
  • the amount of data read may also depend on the decompression rate level of the operated data and the write cache register device 4250 and the subsequent memory 4300
  • the maximum transmission bit width between the operating data, where the decompression rate level of the operated data refers to the ratio of the data volume of the operated data after being operated by the operating engine to the data volume before being operated by the operating engine, in an example ,
  • the ratio can be, but is not limited to, related to the compression ratio of the model miniaturization algorithm, for example, related to the compression ratio of the encoding algorithm;
  • Each operation engine included in the operation engine device 4230 can perform different operations on data.
  • the look-up table decoding engine 4231 can perform a decoding operation to decode model parameters and model input encoded by an encoding algorithm
  • the quantization engine 4232 can Data type conversion for model parameters, model input and other data, such as converting model parameters back to 32-bit floating point numbers or into data types that can be calculated by the calculation engine 4400
  • the mask engine 4233 and the comparison engine 4234 can perform masks separately Operation and comparison operations to restore the model parameters pruned by the pruning sparse algorithm
  • the operation result is written into the pipeline register device and the write cache register device 4250 of the corresponding level through the activated operation engine or other unit;
  • Block 607 output data to the post-level memory 4300 through the write cache register device 4250 or other units;
  • Block 608 output the data to the calculation engine 4400 through the back-level memory 4300 or other units;
  • Block 609 calculate the data through the calculation engine 4400 or other units
  • block 610 it is determined by the policy management device 4220 or other units whether the processing of the current data block is finished, if not, the execution of block 605 is returned, and if it is, the execution of block 611 is continued;
  • the controller 4222 may determine whether the operation engine that reads data from the memory 4221 has read all the model data in the current data block, and if so, it determines that the processing of the current data block is finished; if not, it determines that the current data block is processed. The data block is not processed and ended;
  • block 611 it is determined by the instruction management device 4210 or other units whether there are unprocessed data blocks, if yes, then return to execution block 601, if not, end the process.
  • FIG. 7 is a schematic flowchart of a backpressure method of the pipeline register device according to an embodiment of the present application.
  • One or more components of the pipeline register device 4240 shown in FIG. 2 or other components can implement different blocks or other components of the method. part.
  • the back pressure method of the second-level pipeline register device 4242 is taken as an example, and the back pressure method of other-level pipeline register devices and the write cache register device 4250 is similar to that of the second-level pipeline register device 4242. Therefore, you can refer to the back pressure method of the two-stage pipeline register device 4242 described here.
  • the back pressure method of the two-stage pipeline register device 4242 may include:
  • the storage capacity of the secondary pipeline register 42421 is determined through the counter 42422 or other units;
  • the case where the storage capacity of the secondary pipeline register 42421 is higher than or equal to the back pressure pipeline may include the write rate of the operation engine that writes data to the secondary pipeline register 42421 (that is, write data per clock cycle).
  • the amount of incoming data is higher than the reading rate of the operation engine that reads data from the secondary pipeline register 42421 (that is, the amount of data read per clock cycle);
  • the back pressure pipeline of the secondary pipeline register 42411 may depend on the maximum storage capacity of the secondary pipeline register 42411;
  • a back pressure signal is generated by the controller 42423 or other units, and the back pressure signal is sent to the operation engine that writes data to the secondary pipeline register 42421 according to the channel gating signal;
  • the back pressure signal may be a high-level signal
  • the operation engine that receives the back pressure signal stops reading data, stops operating on the data, and stops writing data to the secondary pipeline register 42411;
  • the operation engine that receives the back pressure signal can send the back pressure signal to each operation engine that has priority in the execution order according to the channel gating signal, so that these operation engines stop operating on the data, Stop reading data and stop writing data to the pipeline register device 4240;
  • block 704 it is determined by the controller 42423 or other units whether the storage capacity of the secondary pipeline register 42421 is higher than or equal to the back pressure pipeline of the secondary pipeline register 42421, if yes, repeat block 705, if not, continue execution Block 706;
  • a back pressure release signal is generated by the controller 42423 or other units, and the back pressure release signal is sent to the operation engine that writes data to the secondary pipeline register 42421 according to the channel strobe signal;
  • the back pressure release signal may be a low-level signal
  • the operation engine that received the back pressure contact signal resumes reading data, resumes operations on data, and resumes writing data to the secondary pipeline register 42411;
  • the operation engine that receives the back pressure release signal can send the back pressure release signal to each operation engine that has priority in the execution order according to the channel strobe signal, so that these operation engines restore data Operate, resume reading data and resume writing data to the pipeline register device 4240;
  • the description order of the method steps should not be interpreted as these steps must be executed depending on the order, these steps may not be executed in the order of description, and may even be executed simultaneously
  • the method may include other steps in addition to these steps, or may include some of these steps.
  • module or “unit” can refer to, be, or include: application specific integrated circuit (ASIC), electronic circuit, (shared, dedicated, or group) processing that executes one or more software or firmware programs And/or memory, combinatorial logic circuits, and/or other suitable components that provide the described functions.
  • ASIC application specific integrated circuit
  • electronic circuit shared, dedicated, or group
  • processing that executes one or more software or firmware programs And/or memory, combinatorial logic circuits, and/or other suitable components that provide the described functions.
  • the various embodiments of the mechanism disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods.
  • the embodiments of the present application can be implemented as a computer program or program code executed on a programmable system.
  • the programmable system includes at least one processor and a storage system (including volatile and non-volatile memory and/or storage elements) , At least one input device and at least one output device.
  • Program codes can be applied to input instructions to perform the functions described in this application and generate output information.
  • the output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the program code can be implemented in a high-level programming language or an object-oriented programming language to communicate with the processing system.
  • assembly language or machine language can also be used to implement the program code.
  • the mechanism described in this application is not limited to the scope of any particular programming language. In either case, the language can be a compiled language or an interpreted language.
  • the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof.
  • one or more aspects of at least some embodiments may be implemented by representative instructions stored on a computer-readable storage medium.
  • the instructions represent various logics in the processor, and the instructions, when read by a machine, cause This machine makes the logic used to execute the techniques described in this application.
  • IP cores can be stored on a tangible computer-readable storage medium and provided to multiple customers or production facilities to be loaded into the manufacturing machine that actually manufactures the logic or processor.
  • Such computer-readable storage media may include, but are not limited to, non-transitory tangible arrangements of objects manufactured or formed by machines or equipment, including storage media, such as hard disks, any other types of disks, including floppy disks, optical disks, compact disks, etc.
  • CD-ROM Compact disk rewritable
  • CD-RW compact disk rewritable
  • magneto-optical disk semiconductor devices such as read only memory (ROM), such as dynamic random access memory (DRAM) and static random access Random access memory (RAM) such as memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM); phase change memory (PCM); magnetic card Or optical card; or any other type of medium suitable for storing electronic instructions.
  • ROM read only memory
  • DRAM dynamic random access memory
  • RAM static random access Random access memory
  • SRAM erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • PCM phase change memory
  • magnetic card Or optical card or any other type of medium suitable for storing electronic instructions.
  • each embodiment of the present application also includes a non-transitory computer-readable storage medium, which contains instructions or contains design data, such as hardware description language (HDL), which defines the structures, circuits, devices, etc. described in the present application. Processor and/or system characteristics.
  • HDL hardware description language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Combined Controls Of Internal Combustion Engines (AREA)
  • Advance Control (AREA)
  • Memory System (AREA)

Abstract

一种解压装置,用于对与指令相关的数据进行至少一个操作,并且包括:与至少一个操作对应的至少一个操作引擎;和至少一个存储装置,用于存储经至少一个操作中的每个操作的数据,其中至少一个存储装置中的第一存储装置包括第一存储器和第一控制器,其中第一控制器在第一存储器的存储量大于或等于第一预定量的情况下,产生第一反压信号并将第一反压信号发送到至少一个操作引擎中的第一操作引擎,以控制第一操作引擎停止向第一存储器输出经第一操作引擎操作的数据。该装置可以通过存储装置的反压机制实现操作引擎的流水操作。

Description

一种解压装置、加速器、和用于解压装置的方法
本申请要求于2020年3月19日提交中国专利局、申请号为202010196700.8、申请名称为“一种解压装置、加速器、和用于解压装置的方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请的一个或多个实施例通常涉及领域,具体涉及一种解压装置、加速器、和用于解压装置的方法。
背景技术
当前人工智能(Artificial Intelligence,简称AI)技术广泛应用于终端、边缘侧、云端等,用来实现图像识别、目标检测、语音翻译等功能,其中深度学习模型在人工智能中应用最为广泛,已经有众多厂商开发出对应的AI加速芯片。然而,深度学习模型的计算复杂度以及参数冗余,限制了其在一些场景和设备上的部署。
为解决上述问题,通常使用模型小型化算法压缩深度学习模型数据(例如,模型的参数和/或模型的输入等数据)由于模型小型化算法降低了数据冗余,因此可以减少存储占用、通信带宽和计算复杂度。模型小型化技术已经成为AI加速芯片缓解存储墙、降低功耗、提高应用性能的核心技术。
与压缩过程相对应地,在AI加速芯片利用深度学习模型进行推理计算之前,需要对深度模型数据进行解压缩。然而,目前AI加速芯片通常仅支持一两种模型小型化解压算法,并且相对固定,不能有效支持后续的模型小型化解压算法的演进。另外,模型小型化解压算法都是采用独立的大处理单元,如果几个大处理单元流水工作,那么流水顺序一般是固定的,并且存在较多的硬件资源浪费,如一个处理单元需要先将全部数据完成解压并存入大缓存,再将解压后的全部数据送给另一个处理单元;如果几个大处理单元不流水工作,那么每个处理单元在操作之前都需要重新从内存读取数据,浪费内存带宽。
发明内容
以下从多个方面介绍本申请,以下多个方面的实施方式和有益效果可互相参考。
本申请的第一方面提供一种解压装置,用于对与指令相关的数据进行至少一个操作,并包括:
与至少一个操作相对应的至少一个操作引擎;和
至少一个存储装置,用于存储经至少一个操作中的每个操作的数据,其中至少一个存储装置中的第一存储装置包括:第一存储器和第一控制器,其中第一控制器用于在第一存储器的存储量大于或等于第一预定量的情况下,产生第一反压信号并将第一反压信号发送到至少一个操作引擎中的第一操作引擎,用于控制第一操作引擎停止向第一存储器输出经第一操作引擎操作的数据。其中,第一预定量可以指示第一存储器的反压阈值,其中,反压阈值与第一存储器的最大存储量相关,也与第一操作引擎向第一存储器输出数据的速率相关,例如,但不限于,如果第一存储器的最大存储量为128字节,并且第一操作引擎向第一存储器输出数据的速率为64字节/时钟周期,那么,反压阈值可以为64字节,或者高于64字 节(例如,96字节)。
在本申请的实施例中,第一存储装置具备实时反压机制,第一操作引擎一旦接收到来自第一存储装置的反压信号,立即暂停所有操作并停止向第一存储器输出数据,由此可以在第一存储器具有较小存储量的情况下,防止第一存储器溢出。
在一些实施例中,在解压装置包括多个操作引擎的情况下,第一存储器还用于将经第一操作引擎操作的数据输入到多个操作引擎中的第二操作引擎。
在本申请的实施例中,第一存储装置可以缓存要输入到第二操作引擎的经第一操作引擎操作的数据,防止第二操作引擎接收的数据量较大引起的收发延时或延时的变化;另外,由于第一存储装置具备实时反压机制,因此采用具有较小存储量的第一存储器即可实现第一操作引擎和第二操作引擎的并发流水,在不增加内存带宽的前提下提高了处理性能,并最小化硬件资源消耗,达到端到端的性能功耗最优。
在一些实施例中,第一预定量至少部分地指示在第一操作引擎向第一存储器输出数据的速率高于第一存储器向第二操作引擎输入数据的速率的情况下,第一存储器的反压阈值。
在一些实施例中,在解压装置包括多个操作引擎并且至少一个存储装置还包括第二存储装置的情况下,第二存储装置用于将经第二操作引擎操作的数据输出到多个操作引擎中的第三操作引擎。
在一些实施例中,在第二存储装置中的第二存储器的存储量大于或等于第二预定量的情况下,第二存储装置中的第二控制器用于产生第二反压信号,并将第二反压信号发送到第二操作引擎,用于控制第二操作引擎停止向第二存储器输出经第二操作引擎操作的数据。
在本申请的实施例中,第二存储装置可以缓存要输入到第三操作引擎的经第二操作引擎操作的数据,防止第三操作引擎接收的数据量较大引起的收发延时或延时的变化;另外,由于第二存储装置具备实时反压机制,因此采用具有较小存储量的第二存储器即可实现第二操作引擎和第三操作引擎的并发流水,在不增加内存带宽的前提下提高了处理性能,并最小化硬件资源消耗,达到端到端的性能功耗最优。
在一些实施例中,第二预定量至少部分地指示在第二操作引擎向第二存储器输出数据的速率高于第二存储器向第三操作引擎或者向计算引擎输入数据的速率的情况下,第二存储器的反压阈值。
在一些实施例中,第二操作引擎还用于将第二反压信号发送到第一操作引擎,用于控制第一操作引擎停止向第一存储器输出经第一操作引擎操作的数据。
在本申请的实施例中,由于第二操作引擎在接收到来自第二存储装置的反压信号后,第一存储装置停止向第二操作引擎输出数据,因此,使第二操作引擎将第二反压信号发送到第一操作引擎并使第一操作引擎停止向第一存储器输出数据,可以避免第一存储装置在短时间内也达到反压阈值。
在一些实施例中,解压装置还包括:
策略管理装置,用于确定至少一个操作的操作顺序,并且根据操作顺序启动至少一个操作引擎,和/或启动至少一个存储装置,并且还用于,确定至少一个操作引擎和至少一个存储装置之间的路由顺序,其中,路由顺序确定至少一个操作引擎中的每个操作引擎与至少一个存储装置中的每个存储装置之间的输入输出顺序。
在本申请的实施例中,将模型小型化解压算法分解为多个细粒度的操作,并根据需要启动不同的操作引擎,由此可以通过操作引擎的任意组合来支持后续的模型小型化解压算法的演进,而不需要修改硬件的设计。
在一些实施例中,策略管理装置还用于向至少一个操作引擎和/或至少一个存储装置发送启动信号,用于启动至少一个操作引擎和/或至少一个存储装置。
在一些实施例中,启动信号包括向至少一个操作引擎发送的开工信号和向至少一个存储装置发送 的通道选通信号。
在一些实施例中,至少一个操作包括查表解压,掩码,比较和量化中的至少一个。
在一些实施例中,至少一个操作与解压相关。
本申请的第二方面提供了一种加速器,包括:
如上的任一种解压装置;和
计算引擎,用于按照指令对由解压装置进行至少一个操作后的数据进行计算。
在一些实施例中,在解压装置包括一个操作引擎的情况下,第一存储器还用于将经第一操作引擎操作的数据输入到计算引擎。
在本申请的实施例中,第一存储装置可以缓存要输入到计算引擎的经第一操作引擎操作的数据,防止计算引擎接收的数据量较大引起的收发延时或延时的变化;另外,由于第一存储装置具备实时反压机制,因此采用具有较小存储量的第一存储器即可实现第一操作引擎和计算引擎的并发流水,在不增加内存带宽的前提下提高了处理性能,并最小化硬件资源消耗,达到端到端的性能功耗最优。
在一些实施例中,第一预定量至少部分地指示在第一操作引擎向第一存储器输出数据的速率高于第一存储器向计算引擎输入数据的速率的情况下,第一存储器的反压阈值。
在一些实施例中,在解压装置包括多个操作引擎并且至少一个存储装置还包括第二存储装置的情况下,第一存储器还用于将经第一操作引擎操作的数据输入到多个操作引擎中的第二操作引擎,第二存储装置用于将经第二操作引擎操作的数据输出到计算引擎。
在一些实施例中,在第二存储装置中的第二存储器的存储量大于或等于第二预定量的情况下,第二存储装置中的第二控制器用于产生第二反压信号,并将第二反压信号发送到第二操作引擎,用于控制第二操作引擎停止向第二存储器输出经第二操作引擎操作的数据。
在本申请的实施例中,第二存储装置可以缓存要输入到计算引擎的经第二操作引擎操作的数据,防止计算引擎接收的数据量较大引起的收发延时或延时的变化;另外,由于第二存储装置具备实时反压机制,因此采用具有较小存储量的第二存储器即可实现第二操作引擎和计算引擎的并发流水,在不增加内存带宽的前提下提高了处理性能,并最小化硬件资源消耗,达到端到端的性能功耗最优。
在一些实施例中,第二预定量至少部分地指示在第二操作引擎向第二存储器输出数据的速率高于第二存储器向计算引擎输入数据的速率的情况下,第二存储器的反压阈值。
本申请的第三方面提供了一种用于解压装置的方法,该方法包括:
解压装置的至少一个操作引擎对与指令相关的数据进行至少一个操作;
解压装置的至少一个存储装置存储经至少一个操作引擎中的每个操作引擎操作的数据;
其中,在至少一个存储装置中的第一存储装置的存储量大于或等于第一预定量的情况下,第一存储装置产生第一反压信号并发送给至少一个操作引擎中的第一操作引擎,并且第一操作引擎响应于第一反压信号停止向第一存储装置输出经第一操作引擎操作的数据。其中,第一预定量可以指示第一存储器的反压阈值,其中,反压阈值可以与第一存储器的最大存储量相关,也与第一操作引擎向第一存储器输出数据的速率相关,例如,但不限于,如果第一存储器的最大存储量为128字节,并且第一操作引擎向第一存储器输出数据的速率为64字节/时钟周期,那么,反压阈值可以为64字节,或者高于64字节(例如,96字节)。
在本申请的实施例中,第一存储装置具备实时反压机制,第一操作引擎一旦接收到来自第一存储装置的反压信号,立即暂停所有操作并停止向第一存储器输出数据,由此可以在第一存储器具有较小存储量的情况下,防止第一存储器溢出。
在一些实施例中,该方法还包括:
在至少一个操作引擎包括多个操作引擎的情况下,第一存储装置将经第一操作引擎操作的数据输入到多个操作引擎中的第二操作引擎。
在本申请的实施例中,第一存储装置可以缓存要输入到第二操作引擎的经第一操作引擎操作的数据,防止第二操作引擎接收的数据量较大引起的收发延时或延时的变化;另外,由于第一存储装置具备实时反压机制,因此采用具有较小存储量的第一存储器即可实现第一操作引擎和第二操作引擎的并发流水,在不增加内存带宽的前提下提高了处理性能,并最小化硬件资源消耗,达到端到端的性能功耗最优。
在一些实施例中,第一预定量至少部分地指示在第一操作引擎向第一存储装置输出数据的速率高于第一存储装置向第二操作引擎输入数据的速率的情况下,第一存储装置的反压阈值。
在一些实施例中,该方法还包括:
在至少一个操作引擎包括多个操作引擎并且至少一个存储装置还包括第二存储装置的情况下,第二存储装置将经第二操作引擎操作的数据输出到多个操作引擎中的第三操作引擎。
在一些实施例中,该方法还包括:
在第二存储装置的存储量大于或等于第二预定值的情况下,第二存储装置产生第二反压信号,并将第二反压信号发送到第二操作引擎,用于控制第二操作引擎停止向第二存储装置输出经第二操作引擎操作的数据。
在本申请的实施例中,第二存储装置可以缓存要输入到第三操作引擎的经第二操作引擎操作的数据,防止第三操作引擎接收的数据量较大引起的收发延时或延时的变化;另外,由于第二存储装置具备实时反压机制,因此采用具有较小存储量的第二存储器即可实现第二操作引擎和第三操作引擎的并发流水,在不增加内存带宽的前提下提高了处理性能,并最小化硬件资源消耗,达到端到端的性能功耗最优。
在一些实施例中,第二预定量至少部分地指示在第二操作引擎向第二存储装置输出数据的速率高于第二存储装置向第三操作引擎输入数据的速率的情况下,第二存储装置的反压阈值。
在一些实施例中,该方法还包括:
第二操作引擎将第二反压信号发送到第一操作引擎,用于控制第一操作引擎停止向第一存储装置输出经第一操作引擎操作的数据。
在本申请的实施例中,由于第二操作引擎在接收到来自第二存储装置的反压信号后,第一存储装置停止向第二操作引擎输出数据,因此,使第二操作引擎将第二反压信号发送到第一操作引擎并使第一操作引擎停止向第一存储器输出数据,可以避免第一存储装置在短时间内也达到反压阈值。
在一些实施例中,该方法还包括:
解压装置中的策略管理装置确定至少一个操作的操作顺序,并且根据操作顺序启动至少一个操作引擎,和启动至少一个存储装置,并且策略管理装置还确定至少一个操作引擎和至少一个存储装置之间的路由顺序,其中,路由顺序确定至少一个操作引擎中的每个操作引擎与至少一个存储装置中的每个存储装置之间的输入输出顺序。
在本申请的实施例中,将模型小型化解压算法分解为多个细粒度的操作,并根据需要启动不同的操作引擎,由此可以通过操作引擎的任意组合来支持后续的模型小型化解压算法的演进,而不需要修改硬件的设计。
在一些实施例中,该方法还包括:
策略管理装置向至少一个操作引擎和至少一个存储装置发送启动信号,用于启动至少一个操作引擎和至少一个存储装置。
在一些实施例中,启动信号包括向至少一个操作引擎发送的开工信号和向至少一个存储装置发送的通道选通信号。
在一些实施例中,至少一个操作包括查表解压,掩码,比较和量化中的至少一个。
在一些实施例中,至少一个操作与解压相关。
本申请的第四方面提供了一种系统,包括:
存储器,在存储器上存储有与指令相关的数据;和
加速器,用于从存储器读取数据并对于数据执行如上所述的任一种方法。
本申请的第五方面提供了一种解压装置,用于对与指令相关的数据进行至少一个操作,并包括:
与至少一个操作相对应的至少一个操作引擎;
至少一个存储装置,用于存储经至少一个操作中的每个操作的数据;和
策略管理装置,用于确定至少一个操作的操作顺序,并且根据操作顺序启动至少一个操作引擎,和/或启动至少一个存储装置,并且还用于,确定至少一个操作引擎和至少一个存储装置之间的路由顺序,其中路由顺序确定至少一个操作引擎中的每个操作引擎与至少一个存储装置中的每个存储装置之间的输入输出顺序。
在本申请的实施例中,将模型小型化解压算法分解为多个细粒度的操作,并根据需要启动不同的操作引擎,由此可以通过操作引擎的任意组合来支持后续的模型小型化解压算法的演进,而不需要修改硬件的设计。
附图说明
图1是根据本申请实施例的AI加速系统的一种结构示意图;
图2是根据本申请实施例的解压装置的一种结构示意图;
图3是根据本申请实施例的策略管理装置选择启动的操作引擎和流水寄存器装置级别的一种示意图;
图4是根据本申请实施例的一级流水寄存器装置的反压机制的一种示意图;
图5是根据本申请实施例的策略管理装置选择启动的操作引擎和流水寄存器装置级别的另一种示意图;
图6是根据本申请实施例的用于AI加速器的方法的一种流程示意图;
图7是根据本申请实施例的流水寄存器装置的反压方法的一种流程示意图。
具体实施方式
下面结合具体实施例和附图对本申请做进一步说明。此处描述的具体实施例仅仅是为了解释本申请,而非对本申请的限定。此外,为了便于描述,附图中仅示出了与本申请相关的部分而非全部的结构或过程。应注意的是,在本说明书中,相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
图1示出了根据本申请实施例的AI加速系统的一种结构示意图,如图1所示,AI加速系统包括分别耦合至互联总线3000的主控中央处理器(Central Processing Unit,简称CPU)1000、系统内存2000以及AI加速器4000,其中,AI加速器4000包括MTE(Memory Transfer Engine,内存传输引擎)4100、解压装置4200、后级存储器4300以及计算引擎4400。需要说明的是,AI加速系统的结构不限于图1所示,后级存储器4300可以位于计算引擎4400内部并作为计算引擎4400的一部分,而且AI加速系统还可以包括其他模块,例如,但不限于,输入/输出模块。
其中,主控CPU 1000一方面可以是微处理器、数字信号处理器、微控制器等,和/或其任何组合,另一方面,主控CPU 1000可以是单核处理器,多核处理器等,和/或其任何组合。系统内存2000可以包括任意合适的存储器,例如非易失性存储器、易失性存储器等,其中,非易失性存储器的示例可以包括,但不限于,只读存储器(Read Only Memory,简称ROM),易失性存储器的示例可以包括,但不限于,双倍速率同步动态随机存取存储器(Double Data Rate Synchronous Dynamic Random Access Memory,简称DDR SDRAM)、高速缓冲存储器(Cache)等。AI加速器4000的一个或多个组件(例如,MTE 4100、UCU 4200和计算引擎4400中的一个或多个)可以由硬件、软件、固件中的任意一个或任意多个的组合实现,例如,由专用集成电路(ASIC)、电子电路、执行一个或多个软件或固件程序的(共享、专用或组)处理器和/或存储器、组合逻辑电路、提供所描述的功能的其他合适的组件的任意组合实现。后级存储器4300可以包括,但不限于,随机存取存储器(Random Access Memory,简称RAM)。
其中,AI加速器可以部署在智能手机、移动数据中心、公有云、物联网设备等任何需要AI加速器的设备中。
根据本申请的一些实施例,系统内存2000中存储有数据,例如但不限于,经模型小型化算法压缩的深度学习模型数据(例如,但不限于,深度学习模型的参数和/或深度学习模型的输入)、未经模型小型化算法压缩的原始的深度学习模型数据或者其他类型的数据。主控CPU 1000可以通过互联总线3000控制AI加速器4000启动,使得AI加速器4000可以通过互联总线3000从系统内存2000中读取数据进行处理。
作为一种示例,模型小型化算法用于对数据进行压缩,其可以包括,但不限于,剪枝稀疏算法、量化算法、编码算法、基于循环矩阵的压缩感知算法、基于矩阵分解的压缩算法等。其中,剪枝稀疏算法可以修剪深度学习模型中不重要的连接,使模型参数变得稀疏,其可以包括权值剪枝、通道剪枝等。量化算法可以将剪枝稀疏后的模型参数聚类到一些离散、低精度的数值点上,其可以包括INT8/INT4/INT2/INT1量化、二值化网络量化、三值化网络量化、矢量量化等,以INT8量化为例,经过反向传播算法训练的深度神经网络模型的参数通常用32-bit浮点数来表示,INT8量化可以使用聚类算法,对深度学习模型每一层的参数进行聚类,属于同一类的就共享同一个以8-bit整型数表示的参数。编码算法可以对模型输入、量化后的模型参数等数据进行编码,其可以包括霍夫曼编码、基于字典技术的游程编码和LZW编码等。基于循环矩阵的压缩感知算法利用循环矩阵作为压缩感知的测量矩阵,获取深度学习模型参数矩阵的稀疏表示。基于矩阵分解的压缩算法利用矩阵分解对深度学习模型参数矩阵进行降维。
根据本申请的一些实施例,MTE 4100用于指令的管理和分发,例如,但不限于,向解压装置4200发送从系统内存2000中读取数据并开始处理的指令,向计算引擎4400发送从后级存储器4300读取经解压装置4200处理的数据并开始计算的指令。
根据本申请的一些实施例,解压装置4200用于对与MTE 4100的指令相关的数据进行一个或多个操作,以将其转换成可被计算引擎4400计算的数据。
在一种示例中,所述一个或多个操作可以与模型小型化算法对应的解压算法相关,例如,通过对解压算法进行分解得到,其中,解压算法用于恢复经模型小型化算法压缩的模型数据,例如,解码算法可以恢复经编码算法压缩的模型数据。
所述一个或多个操作的示例可以包括,但不限于,解码操作,用于对通过编码算法编码的模型参数和/或模型输入等数据进行解码;量化操作,用于对模型输入和/或通过量化算法量化的模型参数等数据进行数据类型的转换,例如,将模型参数转换回32-bit浮点数或者转换为计算引擎4400能够计算的数据 类型;掩码操作和/或比较操作,用于恢复通过剪枝稀疏算法修剪的模型参数;移位操作,用于获取循环移位矩阵以恢复原始的模型参数矩阵;点乘操作和加和操作,用于利用降维的模型数据矩阵恢复原始的模型参数矩阵等。
根据本申请的一些实施例,计算引擎4400用于按照MTE 4100的指令,对由解压装置4200进行上述一个或多个操作后的数据进行计算。
图2示出了根据本申请实施例的解压装置4200的一种结构示意图,如图2所示,解压装置4200可以包括指令管理装置4210、策略管理装置4220、操作引擎装置4230、流水寄存器装置4240以及写缓存寄存器装置4250。其中,策略管理装置422进一步包括存储器4221(例如,但不限于,RAM)和控制器4222;操作引擎装置4230进一步包括查表解码引擎4231、量化引擎4232、掩码引擎4233、比较引擎4234和REG RAM 4235;流水寄存器装置4240进一步包括一级流水寄存器装置4241和二级流水寄存器装置4242,而一级流水寄存器装置4241进一步包括一级流水寄存器42411、计数器42412和控制器42413,二级流水寄存器装置4242进一步包括二级流水寄存器42421、计数器42422和控制器42423。
需要说明的是,操作引擎装置4230包括的操作引擎的数量和类型不限于图2所示,根据需要,操作引擎装置4230可以包括任意数量以及任意类型的操作引擎,其他类型的操作引擎的示例可以包括,但不限于,移位引擎、点乘引擎、加和引擎、透传引擎等,其中透传引擎不对模型数据进行除透传之外的其他操作,其可以用于深度学习模型数据未经模型小型化算法压缩的场景。
需要说明的是,流水寄存器装置4240包括的流水寄存器装置的级数不限于图2所示,流水寄存器装置4240可以包括任意级数的流水寄存器装置。另外,虽然图2示出了流水寄存器装置4240和写缓存寄存器装置4250相互独立,但写缓存寄存器装置4250也可以作为流水寄存器装置4240的某一级别的流水寄存器装置。
根据本申请的一些实施例,如图2所示,指令管理装置4210可以接收来自MTE 4100的指令。在一种示例中,数据在系统内存2000中以数据块的形式存储,每个数据块具有索引(index),数据块与index一一对应,并且每个index可以指示对应的数据块的总长度、是否经过压缩等信息。来自MTE 4100的指令可以指示需要解压装置4200进行处理的数据块的数量、起始数据块对应的index。指令管理装置4210可以根据指令信息,从系统内存2000获取需要处理的数据块对应的index,生成并维护包括获取的index的index表格。指令管理装置4210还可以根据index表格,向策略管理装置4220发送需要读取的数据块的index信息。根据本申请的一些实施例,策略管理装置4220的控制器4222可以接收来自指令管理装置4210的index信息,根据index信息确定需要读取的数据块在系统内存2000中的存储地址,并从系统内存2000读取对应的数据块。
根据本申请的一些实施例,策略管理装置4220的控制器4222还可以从MTE 4100接收全局配置参数,例如,但不限于,系统内存2000的起始地址(用于确定偏移地址)。
根据本申请的一些实施例,策略管理装置4220的存储器4221可以接收从系统内存2000读取的数据块,其中,如图2所示,该数据块可以包括策略表、header信息以及需要进行一个或多个操作的数据(例如,经模型小型化算法压缩的或者原始的深度学习模型数据),其中,策略表可以指示需要对与本次指令相关的数据进行哪些操作以及操作的执行顺序,例如,对数据先进行查表解码操作,再进行量化操作;header信息可以包括操作引擎装置4230的一个或多个操作引擎的配置参数,例如,但不限于,查表解码引擎4231需要使用的字典、量化引擎4234需要的量化系数。
根据本申请的一些实施例,策略管理装置4220的控制器4222还可以解析策略表,并根据策略表的指示信息,从操作引擎装置4230的多个操作引擎中选择需要启动的操作引擎,从流水寄存器装置4240的多 个级别的流水寄存器装置中,选择需要启动的流水寄存器装置级别。需要说明的是,控制器4222默认选择启动写缓存寄存器装置4250。
在一种示例中,控制器4222可以选择启动与策略表中指示的操作相对应的操作引擎,例如,如果策略表中指示需要对数据先进行查表解码操作,再进行量化操作,那么控制器4222可以相应地选择启动查表解码引擎4231和量化引擎4232,如果策略表中指示需要对数据先进行查表解码操作,再进行量化操作,最后进行掩码操作,那么控制器4222可以相应地选择启动查表解码引擎4231、量化引擎4232和掩码引擎4233。
在一种示例中,控制器4222可以根据需要启动的操作引擎的数量,选择需要启动的流水寄存器装置级别,例如,需要启动的流水寄存器装置的级数可以是需要启动的操作引擎的数量减1。例如,如果需要启动一个操作引擎,那么控制器4222可以选择不启动任何级别的流水寄存器装置;如果需要启动两个操作引擎,那么控制器4222可以选择启动一级流水寄存器装置4241;如果需要启动三个操作引擎,那么控制器4222可以选择启动一级流水寄存器装置4241和二级流水寄存器装置4242。
根据本申请的一些实施例,控制器4222还可以确定所选操作引擎与所选级别的流水寄存器装置以及写缓存寄存器装置4250之间的路由顺序,该路由顺序可以确定所选操作引擎与所选级别的流水寄存器装置以及写缓存寄存器装置4250之间的读取写入(或者说输入输出)顺序。
在一种示例中,控制器4222选择启动查表解码引擎4231、量化引擎4232、一级流水寄存器装置4241和写缓存寄存器装置4250,那么控制器4222可以确定查表解码引擎4231从存储器4221读取数据并向一级流水寄存器装置4241写入数据,量化引擎4232从一级流水寄存器装置4241读取数据并向写缓存寄存器装置4250写入数据。
在另一种示例中,控制器4222选择启动查表解码引擎4231、量化引擎4232、掩码引擎4233、一级流水寄存器装置4241、二级流水寄存器装置4242和写缓存寄存器装置4250,那么控制器4222可以确定查表解码引擎4231从存储器4221读取数据并向一级流水寄存器装置4241写入数据,量化引擎4232从一级流水寄存器装置4241读取数据并向二级流水寄存器装置4242写入数据,掩码引擎4233从二级流水寄存器装置4242读取数据并向写缓存寄存器装置4250写入数据。
根据本申请的一些实施例,控制器4222还可以向所选操作引擎、所选级别的流水寄存器装置以及写缓存寄存器装置4250发送启动信号,用于启动所选操作引擎、所选级别的流水寄存器装置以及写缓存寄存器装置4250。
在一种示例中,控制器4222可以向选择的操作引擎发送开工信号,该开工信号可以指示操作引擎开始对数据的操作,并且对于需要配置参数的操作引擎,控制器4222还可以向其发送header信息。
另外,控制器4222还可以向选择的操作引擎发送通道选通信号,该通道选通信号可以指示操作引擎的路由顺序,即操作引擎从何处读取数据以及向何处写入数据。例如,如果控制器4222选择启动查表解码引擎4231、量化引擎4232、一级流水寄存器装置4241以及写缓存寄存器装置4250,那么控制器4222向查表解码引擎4231发送的通道选通信号可以指示查表解码引擎4231从策略管理装置4220的存储器4221读取数据,并向一级流水寄存器装置4241写入数据,向量化引擎4232发送的通道选通信号可以指示量化引擎4232从一级流水寄存器装置4241读取数据,并向写缓存寄存器装置4250写入数据。又如,如果控制器4222选择启动查表解码引擎4231、量化引擎4232、掩码引擎4233、一级流水寄存器装置4241、二级流水寄存器装置4242以及写缓存寄存器装置4250,那么控制器4222向查表解码引擎4231发送的通道选通信号可以指示查表解码引擎4231从策略管理装置4220的存储器4221读取数据,并向一级流水寄存器装置4241写入数据,向量化引擎4232发送的通道选通信号可以指示量化引擎4232从一级流水寄存器装置 4241读取数据,并向二级流水寄存器装置4242写入数据,向掩码引擎4233发送的通道选通信号可以指示掩码引擎4233从二级流水寄存器装置4242读取数据,并向写缓存寄存器装置4250写入数据。
在另一种示例中,控制器4222向选择的操作引擎发送的通道选通信号还可以指示操作引擎的执行顺序。
在一种示例中,控制器4222可以向所选级别的流水寄存器装置和写缓存寄存器装置4250发送通道选通信息,该通道选通消息指示每个级别的流水寄存器装置和写缓存寄存器装置4250要向其写入数据的操作引擎。例如,如果控制器4222选择启动查表解码引擎4231、量化引擎4232、一级流水寄存器装置4241以及写缓存寄存器装置4250,那么控制器4222向一级流水寄存器装置4241发送的通道选通信号可以指示一级流水寄存器装置4241,查表解码引擎4231要向其写入数据,向写缓存寄存器装置4250发送的通道选通信号可以指示写缓存寄存器装置4250,量化引擎4232要向其写入数据。又如,如果控制器4222选择启动查表解码引擎4231、量化引擎4232、掩码引擎4233、一级流水寄存器装置4241、二级流水寄存器装置4242以及写缓存寄存器装置4250,那么控制器4222向一级流水寄存器装置4241发送的通道选通信号可以指示一级流水寄存器装置4241,查表解码引擎4231要向其写入数据,向二级流水寄存器装置4241发送的通道选通信号可以指示二级流水寄存器装置4241,量化引擎4232要向其写入数据,向写缓存寄存器装置4250发送的通道选通信号可以指示写缓存寄存器装置4250,掩码引擎4233要向其写入数据。
需要说明的是,在上述示例中描述了控制器4222确定所选操作引擎向所选级别的流水寄存器装置和写缓存寄存器装置4250写入以及所选操作引擎从所选级别的流水寄存器装置读取数据情况下的路由顺序。但是,控制器4222也可以确定所选级别的流水寄存器装置和写缓存寄存器装置4250从所选操作引擎读取数据以及所选级别的流水寄存器装置向所选操作引擎写入数据情况下的路由顺序,在这种情况下,控制器4222可以不向选择的操作引擎发送上述通道选通信号;并且,控制器4222向所选级别的流水寄存器装置和写缓存寄存器装置4250发送的通道选通信息可以指示每个级别的流水寄存器装置和写缓存寄存器装置4250的路由顺序,即,所选级别的流水寄存器装置从哪个操作引擎读取数据并且向哪个操作引擎写入数据,写缓存寄存器装置4250从哪个操作引擎读取数据。
由于数据的读取和写入是相对的过程,因此在以下实施例中,为了描述的简洁性,操作引擎向流水寄存器装置和写缓存寄存器装置写入数据也可以替换为流水寄存器装置和写缓存寄存器装置从操作引擎读取数据,操作引擎从流水寄存器装置读取数据也可以替换为流水寄存器装置向操作引擎写入数据。
根据本申请的一些实施例,操作引擎装置4230中的操作引擎可以从策略管理装置4220中的存储器4221或者从策略管理装置4220所选级别的流水寄存器装置读取数据(或者说,数据从存储器4221或者流水寄存器装置输入至操作引擎),对数据进行操作,并将操作结果写入策略管理装置4220所选级别的流水寄存器装置或者写缓存寄存器装置4250(或者说,数据从操作引擎输出至流水寄存器装置或者写缓存寄存器装置4250)。
操作引擎装置4230包括的各个操作引擎可以对数据进行不同的操作,例如,查表解码引擎4231可以进行解码操作,以对通过编码算法编码的模型参数、模型输入等数据进行解码;量化引擎4232可以对模型输入、通过量化算法量化的模型参数等数据进行数据类型的转换,例如,将模型参数转换回32-bit浮点数或者转换为计算引擎4400能够计算的数据类型;掩码引擎4233和比较引擎4234可以分别进行掩码操作和比较操作,以恢复通过剪枝稀疏算法修剪的模型参数。
在一种示例中,操作引擎每个时钟周期操作的数据量(或者说,从存储器4221或者流水寄存器装置读取的数据量)可以取决于操作引擎的最大处理能力,而最大处理能力可以与操作引擎的设计成本、设计面积相关;另外,在写缓存寄存器装置4250不具有反压机制(在以下实施例中描述)的情况下,操作 的数据量还可以取决于被操作数据的解压率水平以及写缓存寄存器装置4250与后级存储器4300之间的最大传输位宽,其中,被操作数据的解压率水平指的是被操作数据在被操作引擎操作之后的数据量与在被操作引擎操作之前的数据量的比值。在一种示例中,该比值可以,但不限于,与模型小型化算法的压缩比相关,例如,与编码算法的压缩比相关。
另外,REG RAM 4235可以存储操作引擎的中间结果,例如,在操作引擎对当前读取的数据的操作依赖于下次读取的数据的情况下,该操作引擎可以将对当前读取的数据的操作产生的中间结果存储在REG RAM 4235中,并在利用下次读取的数据结束对当前读取的数据的操作之后,将最终的操作结果写入流水寄存器装置4240或写缓存寄存器装置4250。又如,在对于某个数据块的处理需要多次调用同一个操作引擎的情况下(例如,数据被二次压缩需要二次两次调用查表解码引擎4231的情况),可以将最后一次调用之前的各次调用产生的操作结果存储在REG RAM 4235中,而将最后一次调用产生的操作结果写入流水寄存器装置4240或写缓存寄存器装置4250。
根据本申请的一些实施例,每个级别的流水寄存器装置均包括流水寄存器、计数器和控制器,以一级流水寄存器装置4241为例,一级流水寄存器42411可以存储操作引擎写入的数据,也可以向操作引擎输出数据;计数器42412可以确定一级流水寄存器42411的存储量;控制器42413可以在一级流水寄存器42411的存储量高于或等于一级流水寄存器42411的反压水线(或者称为反压阈值)的情况下,产生反压信号,并根据通道选通信号,将反压信号发送到向其写入数据的操作引擎,以使该操作引擎停止对数据的操作、停止从策略管理装置4220读取数据以及停止向一级流水寄存器42411写入数据,如此,可以防止一级流水寄存器42411溢出。
其中,一级流水寄存器装置4241的控制器42413可以根据一级流水寄存器42411的最大存储量以及向一级流水寄存器42411写入数据的操作引擎的写入速度,确定一级流水寄存器42411的反压水线。例如,但不限于,如果一级流水寄存器42411的最大存储量为128字节,并且向一级流水寄存器42411写入数据的操作引擎的写入速度为64字节/时钟周期,那么控制器42413可以将一级流水寄存器42411的反压水线设置为64字节,或者高于64字节(例如,96字节)。
其中,一级流水寄存器42411的存储量高于或等于反压水线的情况可以包括,向一级流水寄存器42411写入数据的操作引擎的写入速率(即每个时钟周期写入的数据量)高于从一级流水寄存器42411读取数据的操作引擎的读取速率(即每个时钟周期读取的数据量)。反压信号的示例可以包括,但不限于,使用1bit表示的值为1的高电平信号。
其中,在操作引擎停止对数据的操作的情况下,操作引擎内部存储操作引擎的操作结果的寄存器停止翻转并保持当前状态。例如,操作引擎里可以包括乘法器和加法器,乘法器将运算结果存储在寄存器中,加法器从寄存器中读取数据进行运算,在操作引擎收到反压信号之后,乘法器和加法器会暂停操作,并且寄存器会保持当前状态。
另外,在控制器42413产生反压信号之后,如果一级流水寄存器42411的存储量重新低于一级流水寄存器42411的反压水线,控制器42413可以产生反压解除信号,并将反压解除信号发送到向一级流水寄存器42411写入数据的操作引擎,以使该操作引擎恢复对数据的操作、恢复从策略管理装置4220读取数据以及恢复向一级流水寄存器42411写入数据。其中,反压解除信号的示例可以包括,但不限于,使用1bit表示的值为0的低电平信号。在操作引擎恢复对模型数据的操作的情况下,操作引擎可以在内部寄存器中存储的操作数据的基础上,继续进行操作。
需要说明的是,对于其他级别的流水寄存器装置,可以参考上述对一级流水寄存器装置4241的描述,并且不同级别的流水寄存器装置可以具有不同的反压水线。
在另一种示例中,收到反压信号的操作引擎可以根据通道选通信号,将反压信号发送给在执行顺序上优先于其的各个操作引擎,使得这些操作引擎停止对数据的操作、停止对数据的读取以及停止向流水寄存器装置4240写入数据。
根据本申请的一些实施例,写缓存寄存器装置4250的写缓存寄存器4251可以存储操作引擎写入的数据,也可以向后级存储器4300输出数据;计数器4252可以确定写缓存寄存器4251的存储量;控制器4253可以在写缓存寄存器4251的存储量高于或等于写缓存寄存器4251的反压水线的情况下,产生反压信号,并将反压信号发送到向写缓存寄存器4251写入数据的操作引擎,以使该操作引擎停止对数据的操作、停止对数据的读取以及停止向写缓存寄存器4251写入数据,如此,可以防止写缓存寄存器4251溢出。其中,写缓存寄存器4251的存储量高于或等于反压水线的情况可以包括,操作引擎向写缓存寄存器4251写入数据的速率高于写缓存寄存器4251向后级存储器4300输出数据的速率。其中,写缓存寄存器4251的反压水线可以取决于写缓存寄存器4251的最大存储量。其中,反压信号的示例可以包括,但不限于,使用1bit表示的值为1的高电平信号。
另外,在控制器4253产生反压信号之后,如果写缓存寄存器4251的存储量重新低于写缓存寄存器4251的反压水线,控制器4253可以产生反压解除信号,并将反压解除信号发送到向写缓存寄存器4251写入数据的操作引擎,以使该操作引擎恢复对数据的操作、恢复对数据的读取以及恢复向写缓存寄存器4251写入数据。其中,反压解除信号的示例可以包括,但不限于,使用1bit表示的值为0的低电平信号。
需要说明的是,在将写缓存寄存器4251向后级存储器4300输出数据的速率设计为高于操作引擎向写缓存寄存器4251写入数据的最大速率的情况下,可以取消写缓存寄存器装置4250的反压机制,即写缓存寄存器装置4250可以不包括计数器4252。
图3示出了根据本申请实施例的策略管理装置4220选择启动的操作引擎和流水寄存器装置级别的一个示例,并且还示出了数据在解压装置4200内的流向。在图3中,策略管理装置4220的控制器4222根据策略表,选择启动查表解码引擎4231、量化引擎4232、一级流水寄存器装置4241以及写缓存寄存器装置4250。
在图3中,查表解码引擎4231在接收到来自策略管理装置4220的开工信号、header信息和通道选通信号后,从策略管理装置4220的存储器4221读取数据,其中,读取的数据量可以取决于查表解码引擎4231的最大处理能力,而查表解码引擎4231的最大处理能力可以与查表解码引擎4231的设计成本、设计面积相关;另外,在写缓存寄存器装置4250不具有反压机制的情况下,读取的数据量还可以取决于编码算法的压缩比以及写缓存寄存器装置4250与后级存储器4300之间的最大传输位宽。例如,如果写缓存寄存器装置4250与后级存储器4300之间的最大传输位宽为64字节(Bytes,简称B),并且编码算法的压缩比为8倍,那么查表解码引擎4231每个时钟周期最多可以从存储器4221读取8B数据进行操作。
在每个时钟周期,查表解码引擎4231可以基于header信息中的字典,对编码(例如,但不限于,游程编码)的数据进行解码,并且将解码后的写入一级流水寄存器42411。例如,在查表解码引擎4231每个时钟周期从存储器4221读取8B数据进行解码的情况下,查表解码引擎4231每个时钟周期向一级流水寄存器42411写入64B数据。
量化引擎4232在接收到来自策略管理装置4220的开工信号、header信息和通道选通信号后,可以从一级流水寄存器42411读取数据,其中,读取的数据量可以取决于量化引擎4232的最大处理能力,而量化引擎4232的最大处理能力可以与量化引擎4232的设计成本、设计面积相关,例如,如果量化引擎4232的最大数据处理能力为32B/clk,那么,量化引擎4232每个时钟周期最多可以从一级流水寄存器42411读取32B数据进行操作。另外,在写缓存寄存器装置4250不具有反压机制的情况下,读取的数据量还可以 取决于转换前后的数据类型以及写缓存寄存器装置4250与后级存储器4300之间的最大传输位宽,例如,如果量化引擎4232要将16bit浮点数转换为32bit浮点数,那么,在写缓存寄存器装置4250与后级存储器4300之间的最大传输位宽为64B的情况下,量化引擎4232每个时钟周期最多可以从存储器4221读取32B数据进行操作。
在每个时钟周期,量化引擎4232可以基于header信息中的量化系数,对数据的数据类型进行转换,例如,将16bit的浮点数转换为8bit的整型数。那么,在量化引擎4232每个时钟周期从存储器4221读取32B数据进行操作的情况下,量化引擎4232每个时钟周期向写缓存寄存器4251写入16B数据。
由于写缓存寄存器4251与后级存储器4300之间的传输位宽比较大,因此,写缓存寄存器4251可以积累预定量的数据之后再写入后级存储器4300。
图4是根据本申请实施例的图3中的一级流水寄存器装置4241的反压机制的一种示意图,如图4所示,对于一级流水寄存器42411,查表解码引擎4231向一级流水寄存器42411写入数据的速率为64B/clk,量化引擎4232从一级流水寄存器42411读取数据的速率为32B/clk。因此,每个时钟周期,一级流水寄存器42411的存储量增加32B。假设一级流水寄存器42411的反压水线为64B,那么查表解码引擎4231开工两个时钟周期后,一级流水寄存器42411的存储量等于反压水线,控制器42413可以向查表解码引擎4231发送反压信号(例如,但不限于,高电平信号)。查表解码引擎4231在收到反压信号后,停止对数据进行解码操作、停止从策略管理装置4220的存储器4221读取数据以及停止向一级流水寄存器42411写入数据。
在一种示例中,查表解码引擎4231在收到反压信号后停工一个时钟周期,那么一级流水寄存器42411的存储量变为32B,控制器42413可以向查表解码引擎4231发送反压解除信号(例如,但不限于,低电平信号)。查表解码引擎4231在收到反压解除信号后,恢复对数据进行解码操作、恢复从策略管理装置4220的存储器4221读取数据以及恢复向一级流水寄存器42411写入数据。另外,在查表解码引擎4231复工后,控制器42413将每隔一个时钟周期进行一次反压。
在另一种示例中,查表解码引擎4231在收到反压信号后可以停工两个时钟周期,那么一级流水寄存器42411的存储量变为0B,控制器42413可以向查表解码引擎4231发送反压解除信号(例如,但不限于,低电平信号)。查表解码引擎4231在收到反压解除信号后,恢复对数据进行解码操作、恢复从策略管理装置4220的存储器4221读取数据以及恢复向一级流水寄存器42411写入数据。另外,在查表解码引擎4231复工后,控制器42413将每隔两个时钟周期进行一次反压。
图5示出了根据本申请实施例的策略管理装置4220选择启动的操作引擎和流水寄存器装置级别的另一个示例,并且还示出了模型数据在解压装置4200内的流向。在图5中,与图3中示出的相同的操作引擎和流水寄存器装置级别可以参考对图3的描述,另外,在图5中,策略管理装置4220的控制器4222还选择启动掩码引擎4233和二级流水寄存器装置4242。其中,量化引擎4232向二级流水寄存器装置4242写入数据,掩码引擎4233从二级流水寄存器装置4242读取数据,并向写缓存寄存器4251写入数据。
在量化引擎4232向二级流水寄存器42421写入数据的速率高于掩码引擎4233从二级流水寄存器装置4242读取数据的速率的情况下,如果二级流水寄存器42421的存储量高于或等于二级流水寄存器42421的反压水线,控制器42423将产生反压信号(例如,但不限于,高电平信号),并根据通道选通信号将反压信号发送给量化引擎4232,使得量化引擎4232停止从一级流水寄存器42411读取数据、停止对数据进行数据类型的转换、停止向二级流水寄存器42421写入数据。由于量化引擎4232停止从一级流水寄存器42411读取数据,那么一级流水寄存器42411的存储量将被影响,如果一级流水寄存器42411的存储量高于或等于反压水线,控制器42413可以根据通道选通信号向查表解码引擎4231发送反压信号。也就是 说,一级流水寄存器42411和二级流水寄存器42421的反压可以相互独立地进行。
另外,如果二级流水寄存器42421的存储量低于二级流水寄存器42421的反压水线,控制器42423将产生反压解除信号(例如,但不限于,低电平信号),并根据通道选通信号将反压解除信号发送给量化引擎4232,使得量化引擎4232恢复从一级流水寄存器42411读取数据、恢复对数据进行数据类型的转换以及恢复向二级流水寄存器42421写入数据。
在另一种示例中,在量化引擎4232接收到来自控制器42423的反压信号的情况下,量化引擎4232可以根据通道选通信号将反压信号发送给查表解码引擎4231,使其停止从策略管理装置4220的存储器4221读取数据、停止对数据进行解码操作以及停止向一级流水寄存器42411写入数据。在量化引擎4232接收到来自控制器42423的反压解除信号的情况下,量化引擎4232也可以将反压解除信号发送给查表解码引擎4231,使查表解码引擎4231恢复从策略管理装置4220的存储器4221读取数据、恢复对数据进行解码操作以及恢复向一级流水寄存器42411写入数据。
由于模型小型化后,模型数据在通过系统内存2000进入计算引擎4400前往往需要通过解压技术来还原,而解压技术一个最大的特性就是会显著放大解压后的数据量,在这种情况下如果需要对解压后的数据进行下一步的处理,通常需要一个较大的缓存来做数据吸抖(由于器件的处理能力受限,当器件接收的数据量较大时会引起收发延时或延时的变化,这称为抖动,因此需要缓存来暂存数据,这称为吸收抖动,简称吸抖),在本申请的实施例中,各级流水寄存器装置具备实时反压机制,操作引擎一旦收到反压信号,立即暂停所有操作并保持当前状态,如果反压信号取消,立即恢复之前暂停的操作,因此利用很小的流水寄存器就实现了吸抖功能,可以达到各级流水缓存资源开销的最小化。
在本申请的实施例中,将模型小型化解压算法分解为多个细粒度的操作引擎,并且可以根据需要启动不同的操作引擎,因此本申请的实施例可以通过操作引擎的任意组合来支持后续的模型小型化解压算法的演进,而不需要修改硬件的设计。
在本申请的实施例中,将深度学习模型数据分解为小颗粒数据由操作引擎进行操作,并且不同的操作引擎可以操作不同的数据粒度,因此本申请的实施例实现了深度学习模型数据的精细控制。由于各种模型小型化算法的压缩比不同,也导致了解压缩时各种解压算法的放大倍率不一致,在本申请的实施例中,通过识别各种模型小型化算法的压缩比,可以合理地选择每个操作引擎每个时钟周期要操作的数据粒度。
在本申请的实施例中,通过多个细粒度的操作引擎和小颗粒模型数据,以及流水寄存器的实时反压机制,可以实现模型小型化解压算法之间的并发流水,在不增加内存带宽的前提下提高了处理性能,并最小化硬件资源消耗,达到端到端的性能功耗最优。
图6是根据本申请实施例的用于AI加速器4000的方法的一种流程示意图,AI加速器4000的在图1和2中示出的不同组件或者其他组件可以实施方法的不同块或其他部分。对于上述装置实施例中未描述的内容,可以参见下述方法实施例,同样,对于方法实施例中未描述的内容,可参见上述装置实施例。如图6所示,用于AI加速器4000的方法可以包括:
块601,通过策略管理装置4220或其他单元,从系统内存2000读取一个数据块;
在一种示例中,数据在系统内存2000中以数据块的形式存储,每个数据块具有索引(index),数据块与index一一对应,并且每个index可以指示对应的数据块的总长度、是否经过压缩等信息;来自MTE4100的指令可以指示需要解压装置4200进行处理的数据块的数量、起始数据块对应的index;指令管理装置4210可以根据指令信息,从系统内存2000获取需要处理的数据块对应的index,生成并维护包括获取的index的index表格;指令管理装置4210还可以根据index表格,向策略管理装置4220发送需要读取的 数据块的index信息;策略管理装置4220的控制器4222可以接收来自指令管理装置4210的index信息,根据index信息确定需要读取的数据块在系统内存2000中的存储地址,并从系统内存2000读取对应的数据块;
块602,通过策略管理装置4220或其他单元,根据策略表的指示信息,从操作引擎装置4230的多个操作引擎中,选择需要启动的操作引擎,从流水寄存器装置4240的多个级别的流水寄存器装置中,选择需要启动的流水寄存器装置级别;策略管理装置4220的存储器4221可以接收从系统内存2000读取的数据块,其中,该数据块可以包括策略表、header信息以及需要进行一个或多个操作的数据(例如,经模型小型化算法压缩的或者原始的深度学习模型数据),其中,策略表可以指示需要对与本次指令相关的数据进行哪些操作以及操作的执行顺序;header信息可以包括操作引擎装置4230的一个或多个操作引擎的配置参数,例如,但不限于,查表解码引擎4231需要使用的字典、量化引擎4234需要的量化系数;
在一种示例中,策略管理装置4220的控制器4222可以选择启动与策略表中指示的操作相对应的操作引擎;
在一种示例中,控制器4222可以根据需要启动的操作引擎的数量,选择需要启动的流水寄存器装置级别,例如,需要启动的流水寄存器装置的级数可以是需要启动的操作引擎的数量减1;需要说明的是,如果需要启动一个操作引擎,那么控制器4222可以选择不启动任何级别的流水寄存器装置;
需要说明的是,控制器4222可以默认选择启动写缓存寄存器装置4250;
块603,通过策略管理装置4220或其他单元,确定所选操作引擎与所选级别的流水寄存器装置以及写缓存寄存器装置4250之间的路由顺序;
该路由顺序可以确定所选操作引擎与所选级别的流水寄存器装置以及写缓存寄存器装置4250之间的读取写入(或者说输入输出)顺序;
块604,通过策略管理装置4220或其他单元,向所选操作引擎、所选级别的流水寄存器装置以及写缓存寄存器装置4250发送启动信号,用于启动所选操作引擎、所选级别的流水寄存器装置以及写缓存寄存器装置4250;
在一种示例中,控制器4222可以向选择的操作引擎发送开工信号,该开工信号可以指示操作引擎开始对数据的操作,并且对于需要配置参数的操作引擎,控制器4222还可以向其发送header信息;
另外,控制器4222还可以向选择的操作引擎发送通道选通信号,该通道选通信号可以指示操作引擎的路由顺序,即操作引擎从何处读取数据以及向何处写入数据;
在另一种示例中,控制器4222向选择的操作引擎发送的通道选通信号还可以指示操作引擎的执行顺序;
在一种示例中,控制器4222可以向所选级别的流水寄存器装置和写缓存寄存器装置4250发送通道选通信息,该通道选通消息指示每个级别的流水寄存器装置和写缓存寄存器装置4250要向其写入数据的操作引擎;
块605,通过启动的操作引擎或其他单元,读取数据并进行相应的操作;
启动的操作引擎从策略管理装置4220的存储器4221或者启动的各级流水寄存器装置读取模型数据,读取的数据量可以取决于操作引擎的最大处理能力,而最大处理能力可以与操作引擎的设计成本、设计面积相关;另外,在写缓存寄存器装置4250不具有反压机制的情况下,读取的数据量还可以取决于被操作数据的解压率水平以及写缓存寄存器装置4250与后级存储器4300之间的最大传输位宽,其中,被操作数据的解压率水平指的是被操作数据在被操作引擎操作之后的数据量与在被操作引擎操作之前的数据量的比值,在一种示例中,该比值可以,但不限于,与模型小型化算法的压缩比相关,例如,与编码算 法的压缩比相关;
操作引擎装置4230包括的各个操作引擎可以对数据进行不同的操作,例如,查表解码引擎4231可以进行解码操作,以对通过编码算法编码的模型参数、模型输入等数据进行解码;量化引擎4232可以对模型参数、模型输入等数据进行数据类型的转换,例如将模型参数转换回32-bit浮点数或者转换为计算引擎4400能够计算的数据类型;掩码引擎4233和比较引擎4234可以分别进行掩码操作和比较操作,以恢复通过剪枝稀疏算法修剪的模型参数;
块606,通过启动的操作引擎或其他单元,将操作结果写入相应级别的流水寄存器装置以及写缓存寄存器装置4250;
块607,通过写缓存寄存器装置4250或其他单元,将数据输出至后级存储器4300;
块608,通过后级存储器4300或其他单元,将数据输出至计算引擎4400;
块609,通过计算引擎4400或其他单元,对数据进行计算;
块610,通过策略管理装置4220或其他单元,确定当前数据块是否处理结束,如果否,则返回执行块605,如果是,则继续执行块611;
在一种示例中,控制器4222可以确定从存储器4221读取数据的操作引擎是否读取了当前数据块中的全部模型数据,如果是,则确定当前数据块处理结束;如果否,则确定当前数据块未处理结束;
块611,通过指令管理装置4210或其他单元,确定是否还存在未处理的数据块,如果是,则返回执行块601,如果否,则结束流程。
图7是根据本申请实施例的流水寄存器装置的反压方法的一种流程示意图,流水寄存器装置4240的在图2中示出的一个或多个组件或者其他组件可以实施方法的不同块或其他部分。对于上述装置实施例中未描述的内容,可以参见下述方法实施例,同样,对于方法实施例中未描述的内容,可参见上述装置实施例。需要说明的是,在本申请实施例中以二级流水寄存器装置4242的反压方法作为示例,其他级别的流水寄存器装置以及写缓存寄存器装置4250的反压方法具有与二级流水寄存器装置4242类似的原理,因此可以参考在此描述的二级流水寄存器装置4242的反压方法。如图7所示,二级流水寄存器装置4242的反压方法可以包括:
块701,通过计数器42422或其他单元,确定二级流水寄存器42421的存储量;
块702,通过控制器42423或其他单元,确定二级流水寄存器42421的存储量是否高于或等于二级流水寄存器42421反压水线,如果是,则继续执行块703,如果否,则返回执行块701;
在一种示例中,二级流水寄存器42421的存储量高于或等于反压水线的情况可以包括,向二级流水寄存器42421写入数据的操作引擎的写入速率(即每个时钟周期写入的数据量)高于从二级流水寄存器42421读取数据的操作引擎的读取速率(即每个时钟周期读取的数据量);
在一种示例中,二级流水寄存器42411的反压水线可以取决于二级流水寄存器42411的最大存储量;
块703,通过控制器42423或其他单元,产生反压信号,并根据通道选通信号,将反压信号发送给向二级流水寄存器42421写入数据的操作引擎;
在一种示例中,反压信号可以是高电平信号;
在一种示例中,接收到反压信号的操作引擎停止读取数据、停止对数据的操作以及停止向二级流水寄存器42411写入数据;
在另一种示例中,收到反压信号的操作引擎可以根据通道选通信号,将反压信号发送给在执行顺序上优先于其的各个操作引擎,使得这些操作引擎停止对数据的操作、停止对数据的读取以及停止向流水寄存器装置4240写入数据;
块704,通过控制器42423或其他单元,确定二级流水寄存器42421的存储量是否高于或等于二级流水寄存器42421反压水线,如果是,则重复执行块705,如果否,则继续执行块706;
块705,通过控制器42423或其他单元,产生反压解除信号,根据通道选通信号,将反压解除信号发送给向二级流水寄存器42421写入数据的操作引擎;
在一种示例中,反压解除信号可以是低电平信号;
在一种示例中,接收到反压接触信号的操作引擎恢复读取数据、恢复对数据的操作以及恢复向二级流水寄存器42411写入数据;
在另一种示例中,收到反压解除信号的操作引擎可以根据通道选通信号,将反压解除信号发送给在执行顺序上优先于其的各个操作引擎,使得这些操作引擎恢复对数据的操作、恢复对数据的读取以及恢复向流水寄存器装置4240写入数据;
在块705执行结束之后,可以返回执行块701。
需要说明的是,在本申请的实施例中,对方法步骤的描述顺序不应被解释为这些步骤必须依赖于该顺序被执行,这些步骤可以不需要按描述顺序而执行,并且甚至可以同时执行,另外,方法可以包括这些步骤之外的其他步骤,也可以包括这些步骤中的一部分。
虽然本申请的描述将结合较佳实施例一起介绍,但这并不代表此发明的特征仅限于该实施方式。恰恰相反,结合实施方式作发明介绍的目的是为了覆盖基于本申请的权利要求而有可能延伸出的其它选择或改造。为了提供对本申请的深度了解,以下描述中将包含许多具体的细节。本申请也可以不使用这些细节实施。此外,为了避免混乱或模糊本申请的重点,有些具体细节将在描述中被省略。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
此外,各种操作将以最有助于理解说明性实施例的方式被描述为多个离散操作;然而,描述的顺序不应被解释为暗示这些操作必须依赖于顺序。特别是,这些操作不需要按呈现顺序执行。
在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本申请实施例的描述中,“多个”是指两个或多于两个。
如这里所使用的,术语“模块”或“单元”可以指代、是或者包括:专用集成电路(ASIC)、电子电路、执行一个或多个软件或固件程序的(共享、专用或组)处理器和/或存储器、组合逻辑电路和/或提供所描述的功能的其他合适的组件。
在附图中,以特定布置和/或顺序示出一些结构或方法特征。然而,应该理解,可以不需要这样的特定布置和/或排序。在一些实施例中,这些特征可以以不同于说明性附图中所示的方式和/或顺序来布置。另外,在特定图中包含结构或方法特征并不意味着暗示在所有实施例中都需要这样的特征,并且在一些实施例中,可以不包括这些特征或者可以与其他特征组合。
本申请公开的机制的各实施例可以被实现在硬件、软件、固件或这些实现方法的组合中。本申请的实施例可实现为在可编程系统上执行的计算机程序或程序代码,该可编程系统包括至少一个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入设备以及至少一个输出设备。
可将程序代码应用于输入指令,以执行本申请描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理系统包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何系统。
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理系统通信。在需要时, 也可用汇编语言或机器语言来实现程序代码。事实上,本申请中描述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。
在一些情况下,所公开的实施例可以以硬件、固件、软件或其任何组合来实现。在一些情况下,至少一些实施例的一个或多个方面可以由存储在计算机可读存储介质上的表示性指令来实现,指令表示处理器中的各种逻辑,指令在被机器读取时使得该机器制作用于执行本申请所述的技术的逻辑。被称为“IP核”的这些表示可以被存储在有形的计算机可读存储介质上,并被提供给多个客户或生产设施以加载到实际制造该逻辑或处理器的制造机器中。
这样的计算机可读存储介质可以包括但不限于通过机器或设备制造或形成的物品的非瞬态的有形安排,其包括存储介质,诸如:硬盘任何其它类型的盘,包括软盘、光盘、紧致盘只读存储器(CD-ROM)、紧致盘可重写(CD-RW)以及磁光盘;半导体器件,例如只读存储器(ROM)、诸如动态随机存取存储器(DRAM)和静态随机存取存储器(SRAM)之类的随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、闪存、电可擦除可编程只读存储器(EEPROM);相变存储器(PCM);磁卡或光卡;或适于存储电子指令的任何其它类型的介质。
因此,本申请的各实施例还包括非瞬态的计算机可读存储介质,该介质包含指令或包含设计数据,诸如硬件描述语言(HDL),它定义本申请中描述的结构、电路、装置、处理器和/或系统特征。

Claims (31)

  1. 一种解压装置,用于对与指令相关的数据进行至少一个操作,其特征在于,包括:
    与所述至少一个操作相对应的至少一个操作引擎;和
    至少一个存储装置,用于存储经所述至少一个操作中的每个操作的所述数据,其中所述至少一个存储装置中的第一存储装置包括:第一存储器和第一控制器,其中所述第一控制器用于在所述第一存储器的存储量大于或等于第一预定量的情况下,产生第一反压信号并将所述第一反压信号发送到所述至少一个操作引擎中的第一操作引擎,用于控制所述第一操作引擎停止向所述第一存储器输出经所述第一操作引擎操作的所述数据。
  2. 如权利要求1所述的解压装置,其特征在于,在所述解压装置包括多个操作引擎的情况下,所述第一存储器还用于将经所述第一操作引擎操作的所述数据输入到所述多个操作引擎中的第二操作引擎。
  3. 如权利要求2所述的解压装置,其特征在于,所述第一预定量至少部分地指示在所述第一操作引擎向所述第一存储器输出所述数据的速率高于所述第一存储器向所述第二操作引擎输入所述数据的速率的情况下,所述第一存储器的反压阈值。
  4. 如权利要求2所述的解压装置,其特征在于,在所述解压装置包括多个操作引擎并且所述至少一个存储装置还包括第二存储装置的情况下,所述第二存储装置用于将经所述第二操作引擎操作的所述数据输出到所述多个操作引擎中的第三操作引擎。
  5. 如权利要求4所述的解压装置,其特征在于,在所述第二存储装置中的第二存储器的存储量大于或等于第二预定量的情况下,所述第二存储装置中的第二控制器用于产生第二反压信号,并将所述第二反压信号发送到所述第二操作引擎,用于控制所述第二操作引擎停止向所述第二存储器输出经所述第二操作引擎操作的所述数据。
  6. 如权利要求5所述的解压装置,其特征在于,所述第二预定量至少部分地指示在所述第二操作引擎向所述第二存储器输出所述数据的速率高于所述第二存储器向所述第三操作引擎输入所述数据的速率的情况下,所述第二存储器的反压阈值。
  7. 如权利要求5或6所述的解压装置,其特征在于,所述第二操作引擎还用于将所述第二反压信号发送到所述第一操作引擎,用于控制所述第一操作引擎停止向所述第一存储器输出经所述第一操作引擎操作的所述数据。
  8. 如权利要求1至7中任一项所述的解压装置,其特征在于,所述解压装置还包括:
    策略管理装置,用于确定所述至少一个操作的操作顺序,并且根据所述操作顺序启动所述至少一个操作引擎,和/或启动所述至少一个存储装置,并且还用于,确定所述至少一个操作引擎和所述至少一个存储装置之间的路由顺序,其中,所述路由顺序确定所述至少一个操作引擎中的每个操作引擎与所述至少一个存储装置中的每个存储装置之间的输入输出顺序。
  9. 如权利要求8所述的解压装置,其特征在于,所述策略管理装置还用于向所述至少一个操作引擎和/或所述至少一个存储装置发送启动信号,用于启动所述至少一个操作引擎和/或所述至少一个存储装置。
  10. 如权利要求9所述的解压装置,其特征在于,所述启动信号包括向所述至少一个操作引擎发送的开工信号和向所述至少一个存储装置发送的通道选通信号。
  11. 如权利要求1至10中任一项所述的解压装置,其特征在于,所述至少一个操作包括查表解压,掩码,比较和量化中的至少一个。
  12. 如权利要求1至11中任一项所述的解压装置,其特征在于,所述至少一个操作与解压相关。
  13. 一种加速器,其特征在于,包括:
    如权利要求1至12中任一项所述的解压装置;和
    计算引擎,用于按照指令对由所述解压装置进行至少一个操作后的数据进行计算。
  14. 如权利要求13所述的加速器,其特征在于,在所述解压装置包括一个操作引擎的情况下,所述第一存储器还用于将经所述第一操作引擎操作的所述数据输入到所述计算引擎。
  15. 如权利要求14所述的加速器,所述第一预定量至少部分地指示在所述第一操作引擎向所述第一存储器输出所述数据的速率高于所述第一存储器向所述计算引擎输入所述数据的速率的情况下,所述第一存储器的反压阈值。
  16. 如权利要求13所述的加速器,在所述解压装置包括多个操作引擎并且所述至少一个存储装置还包括第二存储装置的情况下,所述第一存储器还用于将经所述第一操作引擎操作的所述数据输入到所述多个操作引擎中的第二操作引擎,所述第二存储装置用于将经所述第二操作引擎操作的所述数据输出到所述计算引擎。
  17. 如权利要求16所述的加速器,在所述第二存储装置中的第二存储器的存储量大于或等于第二预定量的情况下,所述第二存储装置中的第二控制器用于产生第二反压信号,并将所述第二反压信号发送到所述第二操作引擎,用于控制所述第二操作引擎停止向所述第二存储器输出经所述第二操作引擎操作的所述数据。
  18. 如权利要求17所述的加速器,其特征在于,所述第二预定量至少部分地指示在所述第二操作引擎向所述第二存储器输出所述数据的速率高于所述第二存储器向所述计算引擎输入所述数据的速率的情况下,所述第二存储器的反压阈值。
  19. 一种用于解压装置的方法,其特征在于,所述方法包括:
    所述解压装置的至少一个操作引擎对与指令相关的数据进行至少一个操作;
    所述解压装置的至少一个存储装置存储经所述至少一个操作引擎中的每个操作引擎操作的所述数据;
    其中,在所述至少一个存储装置中的第一存储装置的存储量大于或等于第一预定量的情况下,所述第一存储装置产生第一反压信号并发送给所述至少一个操作引擎中的第一操作引擎,并且所述第一操作引擎响应于所述第一反压信号停止向所述第一存储装置输出经所述第一操作引擎操作的所述数据。
  20. 如权利要求19所述的方法,其特征在于,还包括:
    在所述至少一个操作引擎包括多个操作引擎的情况下,所述第一存储装置将经所述第一操作引擎操作的所述数据输入到所述多个操作引擎中的第二操作引擎。
  21. 如权利要求20所述的方法,其特征在于,所述第一预定量至少部分地指示在所述第一操作引擎向所述第一存储装置输出所述数据的速率高于所述第一存储装置向所述第二操作引擎输入所述数据的速率的情况下,所述第一存储装置的反压阈值。
  22. 如权利要求20所述的方法,其特征在于,还包括:
    在所述至少一个操作引擎包括多个操作引擎并且所述至少一个存储装置还包括第二存储装置的情况下,所述第二存储装置将经所述第二操作引擎操作的所述数据输出到所述多个操作引擎中的第三操作引擎。
  23. 如权利要求22所述的方法,其特征在于,还包括:
    在所述第二存储装置的存储量大于或等于第二预定量的情况下,所述第二存储装置产生第二反压信号,并将所述第二反压信号发送到所述第二操作引擎,用于控制所述第二操作引擎停止向所述第二存 储装置输出经所述第二操作引擎操作的所述数据。
  24. 如权利要求23所述的方法,其特征在于,所述第二预定量至少部分地指示在所述第二操作引擎向所述第二存储装置输出所述数据的速率高于所述第二存储装置向所述第三操作引擎输入所述数据的速率的情况下,所述第二存储装置的反压阈值。
  25. 如权利要求23或24所述的方法,其特征在于,还包括:
    所述第二操作引擎将所述第二反压信号发送到所述第一操作引擎,用于控制所述第一操作引擎停止向所述第一存储装置输出经所述第一操作引擎操作的所述数据。
  26. 如权利要求19至25中任一项所述的方法,其特征在于,还包括:
    所述解压装置中的策略管理装置确定所述至少一个操作的操作顺序,并且根据所述操作顺序启动所述至少一个操作引擎,和启动所述至少一个存储装置,并且所述策略管理装置还确定所述至少一个操作引擎和所述至少一个存储装置之间的路由顺序,其中,所述路由顺序确定所述至少一个操作引擎中的每个操作引擎与所述至少一个存储装置中的每个存储装置之间的输入输出顺序。
  27. 如权利要求26所述的方法,其特征在于,还包括:
    所述策略管理装置向所述至少一个操作引擎和所述至少一个存储装置发送启动信号,用于启动所述至少一个操作引擎和所述至少一个存储装置。
  28. 如权利要求27所述的方法,其特征在于,所述启动信号包括向所述至少一个操作引擎发送的开工信号和向所述至少一个存储装置发送的通道选通信号。
  29. 如权利要求19至28中任一项所述的方法,其特征在于,所述至少一个操作包括查表解压,掩码,比较和量化中的至少一个。
  30. 如权利要求19至29中任一项所述的方法,其特征在于,所述至少一个操作与解压相关。
  31. 一种系统,其特征在于,包括:
    存储器,在所述存储器上存储有与指令相关的数据;和
    加速器,用于从所述存储器读取所述数据并对于所述数据执行如权利要求15至27中任一项所述的方法。
PCT/CN2021/081353 2020-03-19 2021-03-17 一种解压装置、加速器、和用于解压装置的方法 WO2021185287A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010196700.8A CN113495669B (zh) 2020-03-19 2020-03-19 一种解压装置、加速器、和用于解压装置的方法
CN202010196700.8 2020-03-19

Publications (1)

Publication Number Publication Date
WO2021185287A1 true WO2021185287A1 (zh) 2021-09-23

Family

ID=77770148

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/081353 WO2021185287A1 (zh) 2020-03-19 2021-03-17 一种解压装置、加速器、和用于解压装置的方法

Country Status (2)

Country Link
CN (1) CN113495669B (zh)
WO (1) WO2021185287A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114723033B (zh) * 2022-06-10 2022-08-19 成都登临科技有限公司 数据处理方法、装置、ai芯片、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542525A (zh) * 2010-12-13 2012-07-04 联想(北京)有限公司 一种信息处理设备以及信息处理方法
CN105637475A (zh) * 2014-09-16 2016-06-01 华为技术有限公司 并行访问方法及系统
CN109062513A (zh) * 2018-08-06 2018-12-21 郑州云海信息技术有限公司 一种控制处理写操作的方法及装置
US20190278612A1 (en) * 2013-03-15 2019-09-12 Micron Technology, Inc. Overflow detection and correction in state machine engines

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015175950A (ja) * 2014-03-14 2015-10-05 株式会社リコー 貯留設備及びトナー製造装置
US10366026B1 (en) * 2016-12-23 2019-07-30 Amazon Technologies, Inc. Random access to decompressed blocks
CN110738316B (zh) * 2018-07-20 2024-05-14 北京三星通信技术研究有限公司 基于神经网络的操作方法、装置及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542525A (zh) * 2010-12-13 2012-07-04 联想(北京)有限公司 一种信息处理设备以及信息处理方法
US20190278612A1 (en) * 2013-03-15 2019-09-12 Micron Technology, Inc. Overflow detection and correction in state machine engines
CN105637475A (zh) * 2014-09-16 2016-06-01 华为技术有限公司 并行访问方法及系统
CN109062513A (zh) * 2018-08-06 2018-12-21 郑州云海信息技术有限公司 一种控制处理写操作的方法及装置

Also Published As

Publication number Publication date
CN113495669A (zh) 2021-10-12
CN113495669B (zh) 2023-07-18

Similar Documents

Publication Publication Date Title
US11551068B2 (en) Processing system and method for binary weight convolutional neural network
US11551065B2 (en) Neural network architecture using control logic determining convolution operation sequence
US10949736B2 (en) Flexible neural network accelerator and methods therefor
CN107256424B (zh) 三值权重卷积网络处理系统及方法
WO2023236365A1 (zh) 数据处理方法、装置、ai芯片、电子设备及存储介质
US11928599B2 (en) Method and device for model compression of neural network
CN111240746B (zh) 一种浮点数据反量化及量化的方法和设备
CN113344171A (zh) 用于神经网络参数实时动态解压缩的矢量量化解码硬件单元
CN116521380A (zh) 一种资源自适应协同的模型训练加速方法、装置及设备
CN110047477B (zh) 一种加权有限状态转换机的优化方法、设备以及系统
WO2021185287A1 (zh) 一种解压装置、加速器、和用于解压装置的方法
CN111753962B (zh) 一种加法器、乘法器、卷积层结构、处理器及加速器
KR20220030106A (ko) 저장 장치, 저장 장치의 동작 방법 및 이를 포함한 전자 장치
CN110490302B (zh) 一种神经网络编译优化方法、装置以及相关产品
CN110363291B (zh) 神经网络的运算方法、装置、计算机设备和存储介质
CN114342264A (zh) 多符号解码器
CN112189216A (zh) 数据处理方法及设备
He et al. Background noise adaptive energy-efficient keywords recognition processor with reusable DNN and reconfigurable architecture
US12020694B2 (en) Efficiency adjustable speech recognition system
CN111506518B (zh) 一种数据存储控制方法及装置
US20230289298A1 (en) Method and device for splitting operators, and storage medium
CN117035123B (zh) 一种并行训练中的节点通信方法、存储介质、设备
US20220300795A1 (en) Two-stage decompression pipeline for non-uniform quantized neural network inference on reconfigurable hardware
CN116702798A (zh) 数据处理方法以及装置
US20210303975A1 (en) Compression and decompression of weight values

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21772100

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21772100

Country of ref document: EP

Kind code of ref document: A1