WO2020062252A1 - 运算加速器和压缩方法 - Google Patents

运算加速器和压缩方法 Download PDF

Info

Publication number
WO2020062252A1
WO2020062252A1 PCT/CN2018/109117 CN2018109117W WO2020062252A1 WO 2020062252 A1 WO2020062252 A1 WO 2020062252A1 CN 2018109117 W CN2018109117 W CN 2018109117W WO 2020062252 A1 WO2020062252 A1 WO 2020062252A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation result
compression
data
sub
control instruction
Prior art date
Application number
PCT/CN2018/109117
Other languages
English (en)
French (fr)
Inventor
刘保庆
刘虎
陈清龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/109117 priority Critical patent/WO2020062252A1/zh
Priority to CN201880098124.4A priority patent/CN112771546A/zh
Priority to EP18935203.2A priority patent/EP3852015A4/en
Publication of WO2020062252A1 publication Critical patent/WO2020062252A1/zh
Priority to US17/216,476 priority patent/US11960421B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6047Power optimization with respect to the encoder, decoder, storage or transmission
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6017Methods or arrangements to increase the throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound

Definitions

  • the present application relates to data computing technology in the field of Artificial Intelligence (AI), and in particular, to an arithmetic accelerator, a processing device, a compression method, and a processing method.
  • AI Artificial Intelligence
  • Convolutional neural networks can be used to identify specific features in the input image.
  • the input image usually goes through at least four layers in the convolutional neural network, namely the Conv layer and the Corrected Linear Unit (Relu). ) (Also known as activation function) layer, pooling layer, and fully connected (FC) layer.
  • the role of the Conv layer is to identify the input data (that is, the data of the input image) through multiple filters. Each filter has a scanning range, which is used to scan data information in a certain area of the input image.
  • the calculation result obtained by the current Conv layer will be input to the next layer (such as the Relu layer, Pooling layer or FC layer) for processing.
  • the Relu layer performs an operation similar to MAX (0, x) on the input data, that is, compares each value in the input data with a value of 0, if it is greater than 0, it is retained, and if it is less than 0, it is set to 0.
  • the Relu layer will increase the sparse rate of the input data (the number of 0 values in the data as a percentage of the total number of data), and will not change the size of the input data.
  • the function of the pooling (Pooling) layer is down-sampling, that is, extracting data alternately in rows or columns in the two-dimensional matrix of each layer of the input data, thereby reducing the size of the input data.
  • the full connection (FC) layer is similar to the Conv layer.
  • the filter of the FC layer does not scan a small area of the input data, but scans the entire input data at one time and outputs a value.
  • the filter of the FC layer does not scan a small area of the input data, but scans the entire input data at one time and outputs a value.
  • the output value is equivalent to a "score", which is used to indicate the "possibility” that the features are included in the input data.
  • the core of the AI operation accelerator is Conv and FC operations.
  • Conv and FC operations In most neural networks, the calculation volume of Conv and FC operations accounts for more than 90% of the entire network calculation volume. Therefore, it can be said that the operation performance of Conv and FC usually determines AI operations.
  • the overall performance of the accelerator When the AI operation accelerator implements Conv and FC operations, due to the large amount of weight data involved, it cannot be stored in the on-chip cache. Therefore, during the inference process, the weight data needs to be imported from the memory outside the operation accelerator to the operation accelerator. To complete the calculation, and the AI operation accelerator has a large amount of data in the calculation result obtained after performing the operation of the previous layer of the neural network, which is difficult to save in the on-chip cache. The calculation result of the previous layer needs to be exported to the AI operation The external memory of the accelerator. When the AI operation accelerator needs to perform the next layer of calculation of the neural network, the calculation result of the previous layer is imported from the memory as input data for operation.
  • Importing and exporting input data will occupy the input / output (I / O) bandwidth of the AI operation accelerator. If the I / O bandwidth becomes a bottleneck, it will cause the calculation function of the AI operation accelerator to become vacant and reduce the overall performance of the AI operation accelerator.
  • the embodiments of the present application provide an arithmetic accelerator, a processing device, a compression method, and a processing method, which aim to save the I / O bandwidth of the arithmetic accelerator and improve the computing performance of the arithmetic accelerator.
  • an arithmetic accelerator including:
  • the first buffer is used to store the first input data; the second buffer is used to store the weight data; the operation circuit connected to the first buffer and the second buffer is used to perform matrix multiplication operation on the first input data and the weight data to A calculation result is obtained; a compression module is used to compress the calculation result to obtain compressed data; a direct memory access controller DMAC connected to the compression module is used to store the compressed data in a memory outside the operation accelerator.
  • the first cache is an input cache in the operation accelerator
  • the second cache is a weight cache memory in the operation accelerator.
  • the compression module is added to the operation accelerator, which reduces the amount of data transferred from the operation accelerator to the memory outside the operation accelerator, saves the I / O bandwidth of the operation accelerator, and improves the calculation performance of the operation accelerator.
  • the operation accelerator further includes:
  • a decompression module connected to the DMAC and the first cache is configured to receive the compressed data obtained by the DMAC from the memory, decompress the compressed data, and store the decompressed data as second input data.
  • the first buffer; the operation circuit is further configured to obtain the second input data from the first buffer to perform a matrix multiplication operation.
  • the compression accelerator As the compression accelerator is added to the operation accelerator, the amount of data transferred from the memory to the calculation accelerator for the next calculation is reduced, the I / O bandwidth of the operation accelerator is saved, and the calculation performance of the operation accelerator is improved.
  • the operation accelerator further includes:
  • a third cache is used to store a control instruction, and the control instruction is used to indicate whether to compress and decompress the calculation result, wherein the third cache is an instruction fetch cache in the operation accelerator; a controller connected to the third cache For obtaining the control instruction from the third cache, and analyzing the control instruction, and when the control instruction instructs to compress and decompress the calculation result, control the compression module to compress the calculation result to obtain the compressed data , And controlling the decompression module to decompress the obtained compressed data.
  • the operation accelerator further includes:
  • a fourth buffer connected to the operation circuit is used to store the calculation result calculated by the operation circuit, wherein the fourth buffer is a unified buffer in the operation accelerator; the controller is further configured to instruct the control instruction to the
  • the DMAC is controlled to store the calculation result in the fourth buffer into the memory
  • the DMAC is controlled to store the calculation result in the memory into the first buffer.
  • controller in the operation accelerator determines whether to enable the compression and decompression functions, it is possible to avoid the compression and decompression of the calculation results generated by calculating the input data with a lower sparse rate in the neural network, thereby improving the compression and decompression benefits.
  • the operation accelerator further includes:
  • a third cache for storing control instructions, the control instructions being used to indicate whether to compress and decompress the calculation result; a controller connected to the third cache, for obtaining the control instructions from the third cache, and Distributing the control instruction to the compression module and the decompression module; the compression module is used to parse the control instruction, and when the control instruction instructs to compress the calculation result, compress the calculation result to obtain the compressed data; The decompression module is used to parse the control instruction, and when the control instruction instructs to decompress the calculation result, decompress the obtained compressed data.
  • the compression module in the operation accelerator determines whether to start the compression, it is possible to avoid the calculation result generated by calculating the input data with a low sparse rate in the neural network to start the compression, improve the compression yield, and the decompression module in the operation accelerator determines whether to start the decompression. , Can avoid starting the decompression of the calculation result generated by calculating the input data with a lower sparse rate in the neural network, and improve the decompression income.
  • the operation accelerator further includes:
  • the compression module is further configured to control the DMAC to store the calculation result in the memory when the control instruction indicates that the calculation result is not compressed; the decompression module is further configured to indicate the calculation result when the control instruction indicates When decompression is not performed, the DMAC is controlled to store the calculation result in the memory into the first buffer.
  • the compression module includes a sharding module and at least one compression engine
  • the sharding module is configured to shard the calculation result to obtain at least one sub-computation result; each compression engine in the at least one compression engine is configured to perform one sub-computation result on the at least one sub-computation result Compress to obtain sub-compressed data, wherein the sum of the sub-compressed data generated by each compression engine in the at least one compression engine constitutes the compressed data.
  • Fragmentation of the data to be compressed is performed, and then compression processing is performed on each sub-compressed data after fragmentation, which can improve the compression efficiency.
  • each compression engine in the at least one compression engine is specifically configured to:
  • each compression engine in the at least one compression engine is further configured to:
  • a compression failure identifier corresponding to the sub-compression data is generated, wherein the compression failure identifier is controlled by the DMAC to be stored in the memory; when the sub-compression result is not When it is greater than the sub-computation result, a successful compression identifier corresponding to the sub-compressed data is generated, wherein the successful compression identifier is controlled via the DMAC to be stored in the memory.
  • the decompression module is specifically configured to:
  • an embodiment of the present application provides a processing apparatus, including:
  • a judging module configured to determine whether the calculation accelerator compresses and decompresses the calculation result obtained by calculating the i-th input data according to the sparse rate of the i-th input data in the neural network, where 1 ⁇ i ⁇ N, N Is the number of layers of the neural network, and the operation accelerator is a coprocessor other than the processing device;
  • the compiling module is configured to generate a control instruction according to a determination result of the determination module, and the control instruction is used to instruct the operation accelerator to compress and decompress the calculation result.
  • the processor generates control instructions for instructing the operation accelerator to perform compression and decompression according to the sparse rate of the input data in the neural network, which can prevent the operation accelerator from performing calculations on the input data with a lower sparse rate in the neural network to start compression and decompression Compression, which increases compression and decompression benefits.
  • the judgment module is specifically configured to:
  • the operation accelerator compresses the calculation result, and performs the calculation result as the i + 1-th layer input data when performing the i + 1-th layer calculation unzip;
  • the sparse rate of the input data of the i-th layer of the neural network is not greater than a threshold value, it is determined that the calculation accelerator does not compress the calculation result, and the calculation result is used as the i + 1-th layer input data for the i + 1-th layer No decompression occurs.
  • the threshold is determined based on the input / output (I / O) bandwidth gain and power consumption cost, and the I / O bandwidth gain is used to instruct the operation accelerator to compress and decompress the calculation result.
  • the reduced I / O bandwidth is processed, and the power consumption cost is used to indicate that the operation accelerator increases the power consumption by compressing and decompressing the calculation result.
  • an operation acceleration processing system including:
  • a processor for generating a control instruction the control instruction is used to indicate whether the arithmetic accelerator compresses and decompresses a calculation result obtained after calculating the input data of the i-th layer of the neural network, where 1 ⁇ i ⁇ N, where N is the The number of layers of the neural network;
  • the operation accelerator is configured to calculate the input data of the i-th layer of the neural network to obtain the calculation result, and obtain the control instruction generated by the processor, and determine whether to compress and decompress the calculation result according to the control instruction.
  • the operation accelerator includes:
  • the calculation result is decompressed; the compression module is used to compress the calculation result; and the decompression module is used to decompress the calculation result.
  • the operation acceleration processing system further includes:
  • the memory is configured to store the control instruction generated by the processor; correspondingly, the processor is further configured to store the generated control instruction in the memory; the operation accelerator is further configured to obtain the control instruction from the memory .
  • an embodiment of the present application provides a compression method, which is applied to an operation accelerator.
  • the operation accelerator includes a first cache and a second cache.
  • the method includes:
  • the compression method further includes:
  • the compression method includes:
  • Obtaining a control instruction, which is used to indicate whether to compress and decompress the calculation result; to parse the control instruction; to compress the calculation result to obtain compressed data includes: when the control instruction instructs to compress the calculation result , Compress the calculation result to obtain the compressed data.
  • the decompressing the compressed data includes: when the control instruction instructs to decompress the calculation result, decompressing the compressed data.
  • the compression method further includes:
  • the calculation result is stored in the memory, and the calculation result is obtained from the memory and stored in the first buffer.
  • an embodiment of the present application provides a compression method, which is applied to an operation accelerator.
  • the operation accelerator includes a first cache and a second cache.
  • the compression method includes:
  • the compression method further includes:
  • control instruction instructs to decompress the calculation result, decompress the compressed data obtained from the memory, and perform matrix multiplication on the decompressed data as the second input data; the control instruction instructs the calculation When the result is not decompressed, a matrix multiplication operation is performed on the calculation result obtained from the memory as the second input data.
  • an embodiment of the present application provides a processing method, which is applied to a processing device and includes:
  • the operation accelerator determines whether the operation accelerator compresses and decompresses the calculation results obtained by calculating the input data of the i-th layer, where 1 ⁇ i ⁇ N, where N is the The number of layers.
  • the operation accelerator is a coprocessor other than the processing device.
  • a control instruction is generated, and the control instruction is used to instruct the operation accelerator to compress and decompress the calculation result.
  • determining whether the operation accelerator compresses and decompresses the calculation result obtained by calculating the i-th input data according to the sparse rate of the i-th input data in the neural network includes:
  • the operation accelerator compresses the calculation result, and performs the calculation result as the i + 1-th layer input data when performing the i + 1-th layer calculation.
  • Decompression when the sparse rate of the input data of the i-th layer of the neural network is not greater than the threshold, determine that the operation accelerator does not compress the calculation result, and perform the i + th on the calculation result as the i + 1th layer input data Decompression is not performed during 1 layer calculation.
  • the threshold is determined based on an input / output (I / O) bandwidth gain and a power consumption cost, and the I / O bandwidth gain is used to instruct the operation accelerator to compress and decompress the calculation result.
  • the I / O bandwidth reduced by the compression process, and the power consumption cost is used to indicate that the operation accelerator increases the power consumption by compressing and decompressing the calculation result.
  • An embodiment of the present application further provides a processing apparatus, where the processing apparatus includes: a memory for storing instructions; a processor for reading the instructions in the memory and executing the sixth aspect or the various possible aspects of the sixth aspect. Approach.
  • An embodiment of the present application further provides a computer storage medium.
  • a software program is stored in the storage medium, and the software program is read by one or more processors and executes the foregoing sixth aspect or various possible processing methods of the sixth aspect.
  • the embodiment of the present application further provides a computer program product containing instructions, which when executed on a computer, causes the computer to execute the foregoing sixth aspect or the various possible processing methods of the sixth aspect.
  • FIG. 1 is a structural diagram of an operation accelerator provided by the present application.
  • FIG. 2 is a structural diagram of an operation accelerator provided by an embodiment of the present application.
  • FIG. 3 is a structural diagram of an arithmetic accelerator according to another embodiment of the present application.
  • FIG. 4 is a structural diagram of an operation accelerator provided by another embodiment of the present application.
  • FIG. 5 is a structural diagram of a compression module applied to an operation accelerator provided by an embodiment of the present application.
  • FIG. 6 is a structural diagram of another compression module applied to an operation accelerator according to an embodiment of the present application.
  • FIG. 7 is a flowchart of a method for controlling an arithmetic accelerator to perform compression according to an embodiment of the present application.
  • the operation accelerator provided by the embodiments of the present application can be applied to the fields of machine learning, deep learning, and convolutional neural networks, and also to the fields of digital image processing and digital signal processing. It can also be applied to other fields involving matrix multiplication operations.
  • the operation accelerator may be a neural network processor (Neural Network Processing Unit, NPU) or other processors, and may be applied to devices that can perform convolution operations such as mobile phones, tablet computers, servers, and wearable devices.
  • NPU Neurological Network Processing Unit
  • the input data can be raw data that is initially input to the operation accelerator for inference operations, such as picture data, voice data, etc., or intermediate data generated by the operation accelerator during the execution of neural network operations. Because the data amount of intermediate data is usually Larger, so the operation accelerator will store the intermediate data calculated by the upper layer of the neural network into external storage, and then perform the calculation of the next layer of the neural network to read the intermediate data from the memory and load it into the operation accelerator for calculation ;
  • the weight data refers to the weight data obtained after training the neural network, and the training process of the neural network is a process of continuously adjusting the weight value;
  • the calculation result refers to the intermediate data or final data generated by the operation accelerator during the execution of the neural network operation. It can be the data output by the operation unit in the operation accelerator, or it can be the vector calculation unit to re-output the data output by the operation unit. The data obtained after the operation. It should be noted that the calculation result is also an input data. The calculation result of the layer above the neural network is often used as input data to participate in the calculation of the next layer of the neural network.
  • the sparseness of data usually refers to the proportion of data with missing or zero values in the data set to the total data.
  • FIG. 1 is a hardware structural diagram of an operation accelerator provided by the present application.
  • the arithmetic accelerator 30 is mounted on the host central processing unit (CPU) 10 as a coprocessor, and the main CPU 10 assigns tasks.
  • a core part of the operation accelerator 30 is an operation circuit 303.
  • the controller 304 controls the operation circuit 303 to extract data in an input buffer (Input buffer) 301 or a weight buffer (Weight buffer) 302 and perform operations.
  • the computing circuit 303 internally includes multiple processing engines (Process Engines, PEs).
  • the arithmetic circuit 303 is a two-dimensional pulsating array.
  • the arithmetic circuit 303 may also be a one-dimensional pulsation array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the corresponding data of the matrix B from the weight buffer 302, and buffers the data on each PE in the arithmetic circuit.
  • the arithmetic circuit takes the data corresponding to the matrix A and the data corresponding to the matrix B from the input buffer 301 and performs a matrix multiplication operation to obtain a partial result or a final result of the matrix, and stores the result in the accumulator 308accumulator.
  • the vector calculation unit 307 may further process the output of the operation circuit 303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, and size comparison.
  • the vector calculation unit 307 can be specifically used for network calculation of non-convolution / non-FC layers in a convolutional neural network, such as pooling, batch normalization, and local response normalization. normalization) and so on.
  • the vector calculation unit 307 stores the processed output vector into the unified buffer 306.
  • the vector calculation unit 307 may apply a non-linear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 307 generates a normalized value, a merged value, or both.
  • a vector of the processed output can be used as an activation input to the arithmetic circuit 303, for example for use in subsequent layers in a neural network.
  • a unified buffer (Unified buffer) 306 is used to store output calculation results and input data of some layers.
  • a direct memory access controller (DMAC) 305 is used to store input data (or input matrices) in the memory 20 other than the operation accelerator 30 into the input buffer 301 and the unified buffer 306 to store weight data (or (Called a weight matrix) is stored in the weight buffer 302, or data in the unified buffer 306 is stored in the memory 20.
  • DMAC direct memory access controller
  • a bus interface unit (Bus Interface Unit) (BIU) 310 is used to interact between the main CPU 10, the DMAC 305, and the instruction fetch buffer (Instruction, Fetch Buffer) 309 through a bus.
  • BIU Bus Interface Unit
  • An instruction fetch buffer 309 connected to the controller 304 is used to store instructions used by the controller 304;
  • the controller 304 is configured to call an instruction buffered in the instruction fetch buffer 309 to control the working process of the operation accelerator 30.
  • the unified cache 306, the input cache 301, the weight cache 302, and the fetch cache 309 are all on-chip buffers.
  • the memory 20 is a memory external to the operation accelerator 30.
  • the memory 20 may have a double data rate. Synchronous Dynamic Random Access Memory (Double Data Rate, Synchronous Random Access Memory, DDR SDRAM for short), High Bandwidth Memory (High Bandwidth Memory, HBM) or other readable and writable memory.
  • Synchronous Dynamic Random Access Memory Double Data Rate, Synchronous Random Access Memory, DDR SDRAM for short
  • High Bandwidth Memory High Bandwidth Memory, HBM
  • HBM High Bandwidth Memory
  • the input cache is the first cache
  • the weight cache is the second cache
  • the fetch cache is the third cache
  • the unified cache is the fourth cache.
  • the operation accelerator When the above-mentioned operation accelerator implements convolution and FC operations, because the amount of weight data involved in the operation is large, it cannot be stored in the weight cache. Therefore, the operation accelerator requires real-time slave memory during the execution of the operation.
  • the weight data is imported for calculation, and the calculation accelerator has a large amount of data after performing the calculation of the previous layer of the neural network, which is difficult to store in the unified cache.
  • the calculation result of the previous layer needs to be stored.
  • Export to memory, and when the arithmetic accelerator needs to perform the next layer of the neural network calculation import the calculation result of the previous layer from the memory as input data to perform the operation.
  • Both the export and import operation results will occupy the input / output (I / O) bandwidth of the operation accelerator. If the I / O bandwidth becomes a bottleneck, the calculation function of the operation accelerator will be vacant and the operation performance of the operation accelerator will be reduced.
  • FIG. 2 is a hardware structure diagram of an operation accelerator 40 provided in an embodiment of the present application.
  • the operation accelerator 40 mainly includes a compression module 311 and a decompression module 312.
  • the input buffer 301 stores input data
  • the weight buffer 302 stores weight data
  • the arithmetic circuit 303 performs matrix multiplication operation on the input data obtained from the input buffer 301 and the weight data obtained from the weight buffer 302 to obtain a calculation result.
  • the calculation result may be It is an intermediate result or a final result.
  • the calculation result is stored in the accumulator 308.
  • the vector calculation unit 307 can take the calculation result from the accumulator 308 for further processing, such as vector multiplication, vector addition, exponential operation, logarithmic operation, and size. Compare and so on, and the vector calculation unit 307 stores the processed calculation result in the unified buffer 306.
  • the compression module 311 obtains the calculation result from the unified buffer 306, and compresses the calculation result to obtain compressed data, and the DMAC 305 stores the compressed data output by the compression module 311 into the memory 20.
  • the operation accelerator 40 It also includes a decompression module 312, configured to obtain the compressed data from the memory 20 through the DMAC 305, decompress the compressed data to obtain decompressed data, and store the decompressed data as input data in the input buffer 301.
  • the arithmetic circuit 303 performs a matrix multiplication operation on the input data obtained from the input buffer 301 and the weight data obtained from the weight buffer 302.
  • the compression module and decompression module are added to the operation accelerator, which reduces the amount of data transferred from the calculation accelerator to the memory, and reduces the calculation result from the memory to the operation accelerator.
  • the amount of data calculated at one time saves the I / O bandwidth of the operation accelerator and improves the calculation performance of the operation accelerator.
  • the first layer of data in the neural network (that is, the initial input data) is calculated as the input data of the second layer after the calculation in the operation accelerator. After that, the calculation results output by the previous layer are used as the input data of the next layer. Until the final layer (fully connected layer) operation is done, the final result is obtained. Because the sparse rate of the first layer of data is usually low, compressing the first layer of data will bring less I / O bandwidth benefits, and it will also cause power loss caused by the compression function, which will result in lower compression benefits. However, with the deepening of the number of neural network calculation layers, the constantly appearing Rectified Linear Unit (Relu) (also known as the activation function) will gradually increase the sparse rate of the calculation results. A higher sparse rate can increase I / O. Bandwidth benefits. Therefore, the computing accelerator can start the compression function when the computing neural network reaches a certain level to maximize the compression benefits of the computing accelerator.
  • Rectified Linear Unit also known as the activation function
  • FIG. 3 is a structure of an operation accelerator 50 provided in an embodiment of the present application.
  • the operation accelerator 50 the controller 504 and the compression module 311, the decompression module 312, the unified cache 306, DMAC305, and the fetch cache 309 is connected, the instruction fetch buffer 309 obtains a control instruction from the memory 20 and stores the control instruction, and the control instruction is used to instruct the operation accelerator 50 to compress the calculation result of each layer operation in the neural network, and instruct the operation accelerator Whether 50 decompresses the calculation result obtained from the memory 20, and the controller 504 reads the control instruction from the instruction fetch buffer to control the relevant components in the operation accelerator.
  • the controller 504 obtains the control instruction from the instruction fetch cache, parses the control instruction, and when the control instruction instructs to compress the calculation result, the control compression module 311 compresses the calculation result obtained from the unified cache 306
  • the DMAC305 transfers the compressed calculation result to the memory 20; when the control instruction indicates that the calculation result is not compressed, the control unified buffer 306 sends the calculation result to the DMAC305, and the DMAC305 sends the calculation result The result is transferred to the memory 20, and the calculation result is not subjected to the compression processing of the compression module at this time.
  • the controller 504 also needs to control the decompression module 312 to perform the decompression process.
  • the above control instruction not only instructs the operation accelerator 50 to compress the calculation result of each layer in the neural network, but also indicates whether the operation accelerator 50 decompresses the calculation result obtained from the memory 20.
  • the controller 504 obtains the control instruction from the instruction fetch buffer 309, parses the control instruction, and when the control instruction instructs to decompress the calculation result, the control decompression module 312 performs decompression processing on the obtained calculation result, and The decompression module 312 stores the decompressed data as input data in the input buffer 301.
  • the control DMAC305 directly stores the calculation result as the input data in the input buffer 301. At this time, The calculation result has not undergone the decompression process of the decompression module 312.
  • FIG. 3 also shows the structure of the main CPU 10.
  • the main CPU 10 includes a software-implemented acceleration library and a compilation module.
  • the acceleration library may include multiple components to complete different Accelerate optimization operations, such as quantization modules that quantify data, and sparsity modules that support sparse computing architectures.
  • the compilation module is used to generate instructions to control the operation accelerator to complete calculation operations.
  • the main CPU may further include a driver and task scheduling module (not shown in FIG. 4), and the connection between the main CPU and the operation accelerator is achieved through the driver and task scheduling module.
  • the acceleration library further includes a judgment module, which is used to analyze the characteristics of the neural network formed after training (the neural network formed after training is the nerve that the arithmetic accelerator performs inference operations on. Network), for example, through algorithmic analysis or measured data, and then inferring the sparse rate of the input data of each layer in the neural network process based on the characteristics of the neural network obtained from the analysis, and determining whether to calculate the layer based on the sparse rate of the input data of each layer
  • the results are compressed and decompressed, and the information of whether the calculation result of each layer is compressed and decompressed is sent to the compilation module, and the compilation module generates specific control instructions.
  • the judgment module compares the sparse rate of the input data of the i-th layer in the neural network with a preset threshold. When the sparse rate of the input data of the i-th layer of the neural network is greater than the threshold, it is determined that the calculation result of the i-th layer needs to be compressed.
  • the above threshold can be determined according to the benefits of I / O bandwidth and the cost of power consumption.
  • the benefit of I / O bandwidth refers to the I / O bandwidth reduced by the compression and decompression of the calculation result by the operation accelerator.
  • the cost of power consumption refers to The operation accelerator increases the power consumption by compressing and decompressing the calculation results.
  • the threshold may be determined in advance. For example, in a previous test, when the sparse rate of the input data is equal to the critical value, the gain of the I / O bandwidth brought by the compression and decompression of the operation accelerator is equal to the power consumption cost.
  • the threshold may be used as the threshold.
  • the threshold may be adjusted to determine the threshold. The method for determining the threshold is not limited in this application.
  • the preset thresholds may be different.
  • Compilation module which is used to decode the information of whether the calculation result of each layer obtained from the judgment module is compressed and decompressed to obtain the above-mentioned control instruction, that is, the control instruction is used to instruct the operation accelerator whether to perform each layer in the neural network.
  • the calculated results are compressed and decompressed.
  • the operation accelerator starts compression and decompression, it will bring smaller I / O bandwidth benefits, and also cause the power brought by the compression and decompression functions. Consumption loss, low compression gain.
  • the main CPU judges that the sparse rate of the input data in a layer in the neural network is large, it controls the arithmetic accelerator to compress and decompress the calculation result of the layer. Because the I / O bandwidth gain is large at this time, It can offset part of the power loss caused by the compression and decompression functions, and increase the compression gain.
  • the controller analyzes the control instructions stored in the fetch buffer, and performs different control operations on the compression module, the unified cache, the decompression module, and the DMAC according to the analysis result.
  • the controller may not execute the analysis of the control instruction, and the analysis of the control instruction is handed over to the compression module and the decompression module in the operation accelerator.
  • FIG. 4 is a structure of an operation accelerator 60 according to an embodiment of the present application.
  • the instruction fetch buffer 309 obtains a control instruction from the memory 20 and stores the control instruction, and the controller 304 allocates the control instruction to the compression.
  • the compression module 611 parses the control instruction.
  • the control instruction instructs to compress the calculation result
  • the calculation result is compressed to obtain the compressed calculation result, and the compressed calculation result is transferred to the memory by the DMAC305.
  • the calculation result is directly sent to the DMAC 305, and the DMAC 305 transfers the calculation result to the memory 20.
  • the decompression module 612 parses the control instruction, and when the control instruction instructs the decompression of the calculation result obtained from the memory 20, the decompression processing is performed on the obtained calculation result, and the decompressed data is used as an input Data is stored in the input buffer 301; when the control instruction instructs that the calculation result obtained from the memory 20 is not decompressed, the obtained calculation result is directly stored as input data in the input buffer 301.
  • a fragmentation module 3110 is introduced into the compression module 311, which is used for fragmentation processing of the calculation results received by the compression module 311, and is performed for each fragment separately. Compression processing.
  • FIG. 5 is a structural diagram of a compression module 311 provided in the present application.
  • the compression module 311 includes a fragmentation module 3110 and at least one compression engine 3111.
  • the fragmentation module 3110 is configured to fragment the received calculation result.
  • processing to obtain at least one sub-computation result, and each compression engine 3111 is configured to compress one of the sub-computation results to obtain sub-compressed data, wherein the sub-compressed data generated by each compression engine in the at least one compression engine is The sum constitutes the compressed data output by the compression module 311.
  • this application does not limit the compression algorithm used in the compression engine 3111.
  • Commonly used compression algorithms in the industry include entropy coding and run-length coding. Different compression algorithms have their own applicable scenarios.
  • the compression engine can The value of 0 is compressed. Since the compression module runs on the hardware logic circuit of the operation accelerator 40, the selection of the compression algorithm needs to consider hardware resources, power consumption, and performance, etc. This application does not limit which compression algorithm the compression engine uses.
  • the calculation result is divided into four sub-calculation results as shown in FIG. 3, and each sub-calculation result is assigned to a corresponding compression engine for compression processing, that is, a total of four compression engines.
  • the number of compression engines is It can be determined by the performance of a single compression engine and the performance requirements of an arithmetic accelerator, etc. This application does not limit the number of compression engines, and the number of compression engines and the number of sub-computation results may not correspond one-to-one, for example, one compression
  • the engine can process two or more sub-computation results.
  • the compression engine 3111 may compress the data to be compressed, the size of the compressed data may be larger than the size of the data to be compressed. At this time, if the compressed data is moved to a memory storage, it is not beneficial to reduce the I / O bandwidth of the operation accelerator. Therefore, in order to further reduce the I / O bandwidth of the operation accelerator, a further design can be made for the compression engine 3111.
  • each compression engine 3111 compresses the received sub-computation result to obtain a sub-compression result, compares the size of the sub-compression result and the sub-computation result, and when the sub-compression result is greater than the sub-computation result, the sub-compression result is The calculation result is used as the output sub-compressed data, and an identification of the compression failure corresponding to the sub-compressed data is generated. Because the output sub-compressed data at this time is data that has not undergone compression processing, the compression fails; When the result is greater than the sub-computation result, the sub-compression result is used as the output sub-compression data, and an identification of the compression success corresponding to the sub-compression data is generated. Since the sub-compression data output at this time is compression-processed data, The compression is successful. The identifier of the compression failure and the identifier of the compression success are stored in the memory through the DMAC.
  • the compression engine 3111 may use the sub-compression result as the The sub-compressed data is output to the DMAC. Although the decompression process for the sub-compressed data is added in the next calculation, the compression engine 3111 can be avoided to read the sub-calculation result again and then output the sub-calculation result to the DMAC, saving Read the power consumption caused by the sub-computation result. In the second implementation, the compression engine 3111 can also directly output the sub-computation result to the DMAC as the sub-compression data. At this time, the sub-compression data is not compressed.
  • the compression engine 3111 will read the sub-calculation result again, in the next calculation, the decompression process of the sub-compressed data can be avoided, and the power consumption caused by decompressing the sub-compressed data can be saved. Therefore, for the scenario where the size of the sub-compression result is equal to the size of the sub-computation result, the above two implementation methods have advantages and disadvantages. In practice, an appropriate choice can be made according to the performance requirements of the operation accelerator.
  • the compression module 312 may include multiple decompression engines, and each decompression engine decompresses the received sub-compressed data separately. Specifically, after each decompression engine obtains the sub-compressed data, it receives the sub-compressed data. The corresponding identifier is identified.
  • the identifier corresponding to the sub-compressed data is an identifier that fails to compress
  • the sub-compressed data is directly stored as input data in the input cache, that is, no decompression processing is performed;
  • the identifier corresponding to the sub-compressed data is
  • the identification is successful, the sub-compressed data is decompressed to obtain the decompressed data, and the decompressed data is stored as input data in the input buffer.
  • the number of decompression engines is determined by the decompression performance of a single decompression engine and the performance requirements of the operation accelerator. At the same time, the coupling between the decompression process and the compression process needs to be taken into account.
  • the The decompression algorithm corresponds to the compression algorithm used by the compression engine in the compression module.
  • FIG. 6 is a structural diagram of a compression module 611 according to another embodiment of the present application. Compared with the compression module 311 described in FIG. 5, the compression module 611 mainly includes an analysis module 610.
  • the parsing module 610 is used for parsing a control instruction.
  • the control instruction instructs to compress the calculation result
  • the calculation result is provided to the sharding module 3110 for sharding, and then the compression engine 3111 compresses the sub-calculation result after sharding and Sent to DMAC305, and finally the compressed calculation result is transferred to memory 20 by DMAC305.
  • This part of the implementation is the same as the implementation of compression module 311 in FIG. 5 described above, and will not be described in detail here; the control instruction indicates the calculation result When no compression is performed, the calculation result is directly sent to the DMAC 305, and the calculation result is transferred to the memory 20 by the DMAC 305.
  • the decompression module 612 for parsing a control instruction.
  • the control instruction instructs to decompress the calculation result obtained from the memory 20
  • the obtained calculation result is decompressed, and the decompression is performed.
  • the compressed data is stored in the input buffer 301 as input data; when the control instruction instructs that the calculation result obtained from the memory 20 is not decompressed, the obtained calculation result is directly stored in the input buffer 301 as input data.
  • the structural diagram of the decompression module 612 is not given in this application.
  • this embodiment of the present application provides a method for controlling the arithmetic accelerator to perform compression. As shown in FIG. 7, the method may include the following steps S701 to S709, where S701 to S704 are performed by The main CPU executes, and S705 to S709 are executed by the operation accelerator.
  • the CPU analyzes the characteristics of the neural network formed after training (the neural network formed after training is the neural network that the arithmetic accelerator performs inference operations), for example, by analyzing or measuring data with an algorithm, and then based on the characteristics of the neural network obtained from the analysis Infer the sparse rate of the input data of each layer in the neural network process, and determine whether to compress and decompress the calculation results of the layer based on the sparse rate of the input data of each layer. Specifically, the sparse rate and threshold of the input data of each layer can be determined. It is implemented by comparison. Since the specific implementation method has been described in detail in the above embodiment, it will not be repeated here.
  • the compiler module inside the CPU performs instruction decoding to generate control instructions according to the information of whether the compression result of each layer is compressed and decompressed.
  • the control instruction is used to instruct the operation accelerator whether to compress and decompress the calculation results of each layer in the neural network. .
  • the CPU stores the generated control instructions in a memory external to the operation accelerator.
  • the CPU controls the control instructions stored in the memory to the instruction fetch buffer in the operation accelerator.
  • the operation accelerator reads the control instruction from the instruction fetch buffer.
  • the operation accelerator calculates the input data of each layer in the neural network and obtains the calculation result. Specifically, the operation can be performed by an operation circuit in the operation accelerator.
  • S707. Determine whether to perform compression and decompression processing on the calculation result according to the control instruction.
  • the operation accelerator parses the control instruction, and determines whether to perform compression and decompression processing on the calculation results of each layer in the neural network according to the control instruction. When it is determined that the compression and decompression processing is performed, S708 is performed, and when it is not determined that the compression and decompression processing is not to be performed, Go to S709.
  • the operation accelerator determines that the calculation result needs to be compressed, it compresses the calculation result and stores the compressed calculation result in the memory.
  • the operation accelerator obtains the calculation result from the memory, and Decompress the obtained calculation results to obtain the decompressed data, and use the decompressed data as input data to participate in the next layer of calculation of the neural network.
  • the operation accelerator directly stores the calculation result in the memory when it is determined that the calculation result does not need to be compressed.
  • the operation accelerator obtains the calculation result from the memory, and participates in the obtained calculation result as input data.
  • the calculation result obtained by the operation accelerator does not need to be decompressed.
  • the operation accelerator starts compression and decompression, the compression gain is low.
  • the CPU judges that the sparse rate of the input data in a layer in the neural network is large, it controls the operation accelerator to compress and decompress the calculation result of the layer. Compressed earnings.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination.
  • the above embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product described above includes one or more computer instructions.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like, including one or more sets of available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Advance Control (AREA)

Abstract

本申请公开了一种运算加速器,该运算加速器包括:第一缓存,用于存储第一输入数据;第二缓存,用于存储权重数据;与该输入缓存和该权重缓存连接的运算电路,用于对第一输入数据和该权重数据进行矩阵乘运算以得到计算结果;压缩模块,用于对该计算结果进行压缩以得到压缩数据;与该压缩模块连接的直接存储器访问控制器DMAC,用于将该压缩数据存入到该运算加速器之外的存储器。由于在该运算加速器中增加了压缩模块,降低了从该运算加速器中搬运计算结果到存储器的数据量,节省该运算加速器的I/O带宽,提升该运算加速器的计算性能。

Description

运算加速器和压缩方法 技术领域
本申请涉及人工智能(Artificial Intelligence,AI)领域的数据计算技术,尤其涉及一种运算加速器、处理装置、压缩方法和处理方法。
背景技术
由于卷积神经网络在图像分类、图像识别、音频识别以及其他相关领域的不俗表现,使其成为了学术界与工业界的研究和开发热门。使用AI运算加速器的方法对卷积神经网络进行运算加速,可以提升卷积神经网络相关应用的运行效率,缩短卷积神经网络相关应用的执行时间,是当前的研究热点。
卷积神经网络可用于对输入图像中的具体特征进行识别,输入图像在卷积神经网络中通常要至少经过4种层,分别是卷积(Conv)层、修正线性单元(Rectified Linear Unit,Relu)(又称激活函数)层、池化(Pooling)层和全连接(FC)层。卷积(Conv)层的作用是通过多个滤波器对输入数据(即输入图像的数据)进行特征识别,每个滤波器具有一个扫描范围,用来扫描输入图像的一定区域内的数据信息。当前的Conv层得到的计算结果会被输入到下一层(比如Relu层、Pooling层或者FC层)进行处理。Relu层是对输入数据进行类似求MAX(0,x)的运算,即将输入数据中的每个值与0值进行比较,如果比0值大就保留,比0值小就置为0值。Relu层会提高输入数据的稀疏率(数据中0值个数占数据总个数的百分比),不会改变输入数据的尺寸。池化(Pooling)层的作用是下采样,即在输入数据的每一层的二维矩阵中隔行或者隔列抽取数据,进而缩小输入数据的尺寸。全连接(FC)层与Conv层相似,唯一不同在于:FC层的滤波器不是对输入数据的某一个小区域进行扫描,而是一次性扫描整个输入数据,然后输出一个值。FC层中会有多个滤波器,对应多个不同的非常具体的图像特征,而输出的值则相当于“分值”,用以表示输入数据中包含这些特征的“可能性”。
AI运算加速器的核心是Conv和FC运算,在多数神经网络中Conv和FC运算的计算量占整个网络计算量的比例可达90%以上,因此可以说Conv和FC的运算性能通常决定了AI运算加速器的总体性能。AI运算加速器在实现Conv和FC运算时,由于涉及到的权重数据的数量较大,无法全部保存在片上缓存内,因此在推理过程中,需要将权重数据从运算加速器外部的存储器导入到运算加速器来完成计算,并且AI运算加速器在执行神经网络的上一层的运算之后所得到的计算结果的数据量也较大,难以保存在片上缓存内,需要将上一层的计算结果导出到AI运算加速器外部的存储器,在AI运算加速器需要执行该神经网络的下一层计算时,再从该存储器中导入上一层的计算结果作为输入数据来进行运算。
导入和导出输入数据都将占用AI运算加速器的输入/输出(I/O)带宽,如果I/O带宽成为瓶颈,将导致该AI运算加速器的计算功能空置,降低AI运算加速器的整体性能。
发明内容
本申请实施例提供了一种运算加速器、处理装置、压缩方法和处理方法,旨在节省运 算加速器的I/O带宽,提升运算加速器的计算性能。
为达到上述目的,本申请实施例提供如下技术方案:
第一方面,本申请实施例提供了一种运算加速器,包括:
第一缓存,用于存储第一输入数据;第二缓存,用于存储权重数据;与第一缓存和第二缓存连接的运算电路,用于对第一输入数据和权重数据进行矩阵乘运算以得到计算结果;压缩模块,用于对该计算结果进行压缩以得到压缩数据;与该压缩模块连接的直接存储器访问控制器DMAC,用于将该压缩数据存入到运算加速器之外的存储器。
其中,第一缓存是运算加速器中的输入缓存,第二缓存是运算加速器中的权重缓存存储器。
由于该运算加速器中增加了压缩模块,降低了从该运算加速器中搬运计算结果到运算加速器外部的存储器的数据量,节省该运算加速器的I/O带宽,提升该运算加速器的计算性能。
在一个可选的实现方式中,该运算加速器还包括:
与该DMAC和该第一缓存连接的解压缩模块,用于接收由该DMAC从该存储器中获取的该压缩数据,对该压缩数据进行解压,并将解压后的数据作为第二输入数据存入该第一缓存;该运算电路,还用于从该第一缓存中获取该第二输入数据以进行矩阵乘运算。
由于该运算加速器中增加了解压缩模块,降低了从存储器中搬运计算结果到该运算加速器中进行下一次计算的数据量,节省该运算加速器的I/O带宽,提升该运算加速器的计算性能。
在一个可选的实现方式中,该运算加速器还包括:
第三缓存,用于存储控制指令,该控制指令用于指示是否对该计算结果进行压缩和解压缩,其中,第三缓存是该运算加速器中的取指缓存;与该第三缓存连接的控制器,用于从该第三缓存中获取该控制指令,并且解析该控制指令,在该控制指令指示对该计算结果进行压缩和解压缩时,控制该压缩模块对该计算结果进行压缩以得到该压缩数据,以及控制该解压缩模块对获取的该压缩数据进行解压缩。
在一个可选的实现方式中,该运算加速器还包括:
与该运算电路连接的第四缓存,用于存储该运算电路计算的该计算结果,其中,第四缓存为该运算加速器中的统一缓存;该控制器,还用于在该控制指令指示对该计算结果不进行压缩和解压缩时,控制该DMAC将该第四缓存中的该计算结果存入该存储器,以及控制该DMAC将该存储器中的该计算结果存入该第一缓存。
由于运算加速器中控制器确定是否启动压缩和解压缩功能,可以避免对神经网络中稀疏率较低的输入数据进行计算所生成的计算结果启动压缩和解压缩,从而提高压缩收益和解压缩收益。
在一个可选的实现方式中,该运算加速器还包括:
第三缓存,用于存储控制指令,该控制指令用于指示是否对该计算结果进行压缩和解压缩;与该第三缓存连接的控制器,用于从该第三缓存中获取该控制指令,并且将该控制指令分发给该压缩模块和该解压缩模块;该压缩模块,用于解析该控制指令,在该控制指令指示对该计算结果进行压缩时,对该计算结果进行压缩以得到该压缩数据;该解压缩模 块,用于解析该控制指令,在该控制指令指示对该计算结果进行解压缩时,对获取的该压缩数据进行解压缩。
由于运算加速器中压缩模块确定是否启动压缩,可以避免对神经网络中稀疏率较低的输入数据进行计算所生成的计算结果启动压缩,提高压缩收益,以及运算加速器中解压缩模块确定是否启动解压缩,可以避免对神经网络中稀疏率较低的输入数据进行计算所生成的计算结果启动解压缩,提高解压缩收益。
在一个可选的实现方式中,该运算加速器还包括:
该压缩模块,还用于在该控制指令指示对该计算结果不进行压缩时,控制该DMAC将该计算结果存入该存储器;该解压缩模块,还用于在该控制指令指示对该计算结果不进行解压缩时,控制该DMAC将该存储器中的该计算结果存入该第一缓存。
在一个可选的实现方式中,该压缩模块包括分片模块和至少一个压缩引擎,
该分片模块,用于对该计算结果进行分片处理以得到至少一个子计算结果;该至少一个压缩引擎中每个压缩引擎,用于对该至少一个子计算结果中的一个子计算结果进行压缩以得到子压缩数据,其中,该至少一个压缩引擎中每个压缩引擎生成的子压缩数据的总和组成该压缩数据。
对待压缩的数据进行分片处理,然后针对分片后的每个子压缩数据进行压缩处理,可以提高压缩效率。
在一个可选的实现方式中,该至少一个压缩引擎中每个压缩引擎,具体用于:
对该子计算结果进行压缩以得到子压缩结果;比较该子压缩结果和该子计算结果的大小;在该子压缩结果大于该子计算结果时,将该子计算结果作为该子压缩数据;在该子压缩结果不大于该子计算结果时,将该子压缩结果作为该子压缩数据。
在一个可选的实现方式中,该至少一个压缩引擎中每个压缩引擎还用于:
在该子压缩结果大于该子计算结果时,生成一个与该子压缩数据对应的压缩失败的标识,其中,该压缩失败的标识经由该DMAC进行控制以存入该存储器;在该子压缩结果不大于该子计算结果时,生成一个与该子压缩数据对应的压缩成功的标识,其中,该压缩成功的标识经由该DMAC进行控制以存入该存储器。
在一个可选的实现方式中,该解压缩模块具体用于:
接收由该DMAC从该存储器中获取的该子压缩数据;在该子压缩数据对应的标识为压缩失败的标识时,将该子压缩数据作为该第二输入数据存入该第一缓存;在该子压缩数据对应的标识为压缩成功的标识时,对该子压缩数据进行解压,并将解压后的数据作为该第二输入数据存入该第一缓存。
第二方面,本申请实施例提供了一种处理装置,包括:
判断模块,用于根据神经网络中第i层输入数据的稀疏率,确定运算加速器是否对第i层输入数据进行计算后所得到的计算结果进行压缩和解压缩,其中,1≤i≤N,N为该神经网络的层数,该运算加速器为处理装置之外的协处理器;
编译模块,用于根据该判断模块的确定结果生成控制指令,该控制指令用于指示该运算加速器是否对该计算结果进行压缩和解压缩。
处理器根据神经网络中输入数据的稀疏率来生成是否指示运算加速器进行压缩和解压 缩的控制指令,可以避免运算加速器对神经网络中稀疏率较低的输入数据进行计算所生成的计算结果启动压缩和解压缩,从而提高压缩收益和解压缩收益。
在一个可选的实现方式中,该判断模块,具体用于:
在该神经网络第i层输入数据的稀疏率大于阈值时,确定该运算加速器对该计算结果进行压缩,以及在将该计算结果作为第i+1层输入数据进行第i+1层计算时进行解压缩;
在该神经网络第i层输入数据的稀疏率不大于阈值时,确定该运算加速器对该计算结果不进行压缩,以及在将该计算结果作为第i+1层输入数据进行第i+1层计算时不进行解压缩。
在一个可选的实现方式中,该阈值基于输入/输出(I/O)带宽的收益和功耗代价确定,该I/O带宽的收益用于指示该运算加速器对该计算结果进行压缩和解压缩处理所减少的I/O带宽,该功耗代价用于指示该运算加速器对该计算结果进行压缩和解压缩处理所增加的功耗。
第三方面,本申请实施例提供了一种运算加速处理系统,包括:
处理器,用于生成控制指令,该控制指令用于指示运算加速器是否对神经网络第i层输入数据进行计算后所得到的计算结果进行压缩和解压缩,其中,1≤i≤N,N为该神经网络的层数;
该运算加速器,用于对该神经网络第i层输入数据进行计算以得到该计算结果,并且获取该处理器生成的该控制指令,根据该控制指令确定是否实现对该计算结果进行压缩和解压缩。
在一个可选的实现方式中,该运算加速器包括:
运算电路,用于对该神经网络中第i层输入数据进行计算以得到该计算结果;控制器,用于根据获取的该控制指令控制压缩模块对该计算结果进行压缩,以及解压缩模块对该计算结果进行解压缩;该压缩模块,用于对该计算结果进行压缩;该解压缩模块,用于对该计算结果进行解压。
在一个可能的实现方式中,该运算加速处理系统还包括:
存储器,用于存储该处理器生成的该控制指令;对应地,该处理器,还用于将生成的该控制指令存储在该存储器;该运算加速器,还用于从该存储器中获取该控制指令。
第四方面,本申请实施例提供了一种压缩方法,该压缩方法应用于运算加速器,该运算加速器包括第一缓存和第二缓存,该方法包括:
对从该第一缓存中获取的第一输入数据和从该第二缓存中获取的权重数据进行矩阵乘运算以得到计算结果;将该计算结果进行压缩以得到压缩数据;将该压缩数据存入该运算加速器之外的存储器。
在一个可选的实现方式中,该压缩方法还包括:
从该存储器中获取该压缩数据;对该压缩数据进行解压,并将解压后的数据作为第二输入数据存入该第一缓存;对从该第一缓存中获取的第二输入数据进行矩阵乘运算。
在一个可选的实现方式中,该压缩方法包括:
获取控制指令,该控制指令用于指示是否对该计算结果进行压缩和解压缩;解析该控制指令;该将该计算结果进行压缩以得到压缩数据包括:在该控制指令指示对该计算结果 进行压缩时,将该计算结果进行压缩以得到该压缩数据。
对应地,该对该压缩数据进行解压包括:在该控制指令指示对该计算结果进行解压缩时,对该压缩数据进行解压。
在一个可选的实现方式中,该压缩方法还包括:
在该控制指令指示对该计算结果不进行压缩和解压缩时,将该计算结果存储在该存储器,以及从该存储器中获取该计算结果存入该第一缓存。
第五方面,本申请实施例提供了一种压缩方法,该压缩方法应用于运算加速器,该运算加速器包括第一缓存和第二缓存,该压缩方法包括:
对从该第一缓存中获取的第一输入数据和从该第二缓存中获取的权重数据进行矩阵乘运算以得到计算结果;获取控制指令,该控制指令用于指示是否对该计算结果进行压缩和解压缩;在该控制指令指示对该计算结果进行压缩时,对该计算结果进行压缩以得到压缩数据,并且将该压缩数据存储在该运算加速器之外的存储器;在该控制指令指示对该计算结果不进行压缩时,将该计算结果存储在该运算加速器之外的存储器。
在一个可选的实现方式中,该压缩方法还包括:
在该控制指令指示对该计算结果进行解压缩时,将从该存储器中获取的该压缩数据进行解压,并且对解压后的数据作为第二输入数据进行矩阵乘运算;在该控制指令指示该计算结果不进行解压缩时,对从该存储器中获取的该计算结果作为第二输入数据进行矩阵乘运算。
第六方面,本申请实施例提供了一种处理方法,应用于处理装置,包括:
根据神经网络中第i层输入数据的稀疏率,确定运算加速器是否对第i层输入数据进行计算后所得到的计算结果进行压缩和解压缩,其中,1≤i≤N,N为该神经网络的层数,该运算加速器为处理装置之外的协处理器;生成控制指令,该控制指令用于指示该运算加速器是否对该计算结果进行压缩和解压缩。
在一个可选的实现方式中,根据神经网络中第i层输入数据的稀疏率,确定该运算加速器是否对第i层输入数据进行计算后所得到的计算结果进行压缩和解压缩包括:
在该神经网络第i层输入数据的稀疏率大于阈值时,确定该运算加速器对该计算结果进行压缩,以及在将该计算结果作为第i+1层输入数据进行第i+1层计算时进行解压缩;在该神经网络第i层输入数据的稀疏率不大于阈值时,确定该运算加速器对该计算结果不进行压缩,以及在将该计算结果作为第i+1层输入数据进行第i+1层计算时不进行解压缩。
在一个可选的实现方式中,该阈值基于输入/输出(I/O)带宽的收益和功耗代价确定,所述I/O带宽的收益用于指示该运算加速器对该计算结果进行压缩和解压缩处理所减少的I/O带宽,该功耗代价用于指示所述运算加速器对该计算结果进行压缩和解压缩处理所增加的功耗。
本申请实施例还提供一种处理装置,所述处理装置包括:存储器,用于存储指令;处理器,用于读取该存储器中的指令并执行上述第六方面或第六方面各种可能的处理方法。
本申请实施例还提供一种计算机存储介质,该存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行上述第六方面或第六方面各种可能的处理方法。
本申请实施例还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得 计算机执行上述第六方面或第六方面各种可能的处理方法。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1为本申请提供的一种运算加速器的结构图;
图2为本申请实施例提供的一种运算加速器的结构图;
图3为本申请又一个实施例提供的一种运算加速器的结构图;
图4为本申请又一个实施例提供的一种运算加速器的结构图;
图5为本申请实施例提供的一种应用于运算加速器中的压缩模块的结构图;
图6为本申请实施例提供的又一种应用于运算加速器中的压缩模块的结构图;
图7为本申请实施例提供的一种控制运算加速器进行压缩的方法流程图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
本申请实施例提供的运算加速器可以应用于机器学习、深度学习以及卷积神经网络等领域,也可以应用到数字图像处理和数字信号处理等领域,还可以应用在其他涉及矩阵乘法运算的领域。本申请中,运算加速器可以是神经网络处理器(Neural Network Processing Unit,NPU)或者其他处理器,可以应用到手机、平板电脑、服务器、可穿戴设备等可执行卷积运算的设备中。
首先对本申请中涉及的几个术语进行解释:
输入数据,可以是初始输入给运算加速器进行推理运算的原始数据,比如图片数据、语音数据等,也可以是运算加速器在执行神经网络运算过程中所产生的中间数据,由于中间数据的数据量通常较大,因此运算加速器会将神经网络上一层计算得到的中间数据存入外部储存器,在执行神经网络下一层计算时再从存储器中读取该中间数据并加载到运算加速器中进行计算;
权重数据,是指对神经网络进行训练后得到的权重数据,神经网络的训练过程就是不断的对权重值进行调整的过程;
计算结果,是指运算加速器在执行神经网络运算过程中所产生的中间数据或最终数据,可以是运算加速器中运算单元运算后输出的数据,也可以是向量计算单元对运算单元输出的数据进行再次运算后得到的数据。需要说明的是,计算结果也是一种输入数据,神经网络上一层的计算结果往往作为输入数据参与到神经网络下一层计算;
数据的稀疏率,通常是指数据集中数值缺失或数值为0的数据占总体数据的比例。
图1是本申请提供的一种运算加速器的硬件结构图。运算加速器30作为协处理器挂载到主中央处理器(Host CPU)10上,由主CPU10分配任务。运算加速器30的核心部分为运算电路303,控制器304控制运算电路303提取输入缓存(Input Buffer)301或权重缓存 (Weight Buffer)302中的数据并进行运算。
在一些实现中,运算电路303内部包括多个处理引擎(Process Engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重缓存302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入缓存301中取矩阵A相应的数据与矩阵B相应的数据进行矩阵乘运算,得到矩阵的部分结果或最终结果,保存在累加器308accumulator中。
向量计算单元307可以对运算电路303的输出做进一步处理,如向量乘,向量加,指数运算,对数运算和大小比较等处理。例如,向量计算单元307具体可以用于卷积神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元307将经处理过的输出的向量存储到统一缓存器306中。例如,向量计算单元307可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路303的激活输入,例如用于在神经网络中的后续层中的使用。
统一缓存(Unified Buffer)306用于存放输出的计算结果和某些层的输入数据。
直接存储器访问控制器(Direct Memory Access Controller,DMAC)305用于将运算加速器30之外的存储器20中的输入数据(或称输入矩阵)存入输入缓存301和统一缓存306,将权重数据(或称权重矩阵)存入权重缓存302中、或者将统一缓存306中的数据存入存储器20。
总线接口单元(Bus Interface Unit,BIU)310,用于通过总线在主CPU10、DMAC305和取指缓存(Instruction Fetch Buffer)309之间进行交互。
与控制器304连接的取指缓存(instruction fetch buffer)309,用于存储控制器304使用的指令;
控制器304,用于调用取指缓存309中缓存的指令,实现控制该运算加速器30的工作过程。
一般地,统一缓存306,输入缓存301,权重缓存302以及取指缓存309均为片上缓存(On-Chip Buffer),存储器20为该运算加速器30外部的存储器,该存储器20可以为双倍数据率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,简称DDR SDRAM)、高带宽存储器(High Bandwidth Memory,HBM)或其他可读可写的存储器。
在本申请中,输入缓存是第一缓存,权重缓存是第二缓存,取指缓存是第三缓存,统一缓存是第四缓存。
上述运算加速器在实现卷积和FC运算时,由于运算所涉及到的权重数据的数据量较大,无法全部保存在权重缓存中,因此该运算加速器在执行运算的过程中,需要实时的从 存储器中导入权重数据来进行计算,并且该运算加速器在执行神经网络的上一层的运算之后所得到的计算结果的数据量也较大,难以保存在统一缓存中,需要将上一层的计算结果导出到存储器,在运算加速器需要执行该神经网络的下一层计算时,再从存储器中导入上一层的计算结果作为输入数据来进行运算。
导出和导入运算结果都将占用该运算加速器的输入/输出(I/O)带宽,如果I/O带宽成为瓶颈,将导致该运算加速器的计算功能空置,降低该运算加速器的运算性能。
图2是本申请实施例提供的一种运算加速器40的硬件结构图,该运算加速器40和图1提供的运算加速器30相比,主要增加了压缩模块311和解压缩模块312。
输入缓存301存储输入数据,权重缓存302存储权重数据,运算电路303将从输入缓存301中获取的输入数据和从权重缓存302中获取的权重数据进行矩阵乘运算以得到计算结果,该计算结果可以是中间结果或最终结果,该计算结果被保存在累加器308中,向量计算单元307可以从累加器308中取出计算结果做进一步处理,比如向量乘、向量加、指数运算、对数运算和大小比较等,并且向量计算单元307将经过处理后的计算结果存储到统一缓存306。
压缩模块311从统一缓存器306中获取计算结果,并且对计算结果进行压缩以得到压缩数据,再由DMAC305将压缩模块311输出的压缩数据存入存储器20。
进一步,由于该压缩数据是该运算加速器40在对神经网络的某层进行计算后的计算结果,该压缩数据可以作为输入数据参与到该运算加速器40的下一次计算,因此,该运算加速器40中还包括解压缩模块312,用于通过DMAC305从存储器20中获取该压缩数据,对该压缩数据进行解压以获得解压后的数据,并将解压后的数据作为输入数据存入输入缓存301中。运算电路303将从输入缓存301中获取的输入数据和从权重缓存302中获取的权重数据进行矩阵乘运算。
由上可知,由于在该运算加速器中增加了压缩模块和解压缩模块,降低了从该运算加速器中搬运计算结果到存储器的数据量,以及降低了从存储器中搬运计算结果到该运算加速器中进行下一次计算的数据量,节省该运算加速器的I/O带宽,提升该运算加速器的计算性能。
神经网络中第一层数据(即初始的输入数据)在运算加速器中运算后得到的计算结果作为第二层的输入数据,之后都是上一层输出的计算结果作为下一层的输入数据,直至做完最后一层(全连接层)运算后得到最终结果。由于第一层数据的稀疏率通常较低,对第一层数据进行压缩会带来较小的I/O带宽收益,同时还造成启动压缩功能所带来的功耗损失,导致压缩收益较低,然而随着神经网络计算层数的深入,不断出现的修正线性单元(Rectified Linear Unit,Relu)(又称激活函数)会逐渐提高计算结果的稀疏率,较高的稀疏率可以提高I/O带宽收益,因此,运算加速器在计算神经网络到一定层级时再启动压缩功能,可以实现运算加速器的压缩收益最大化。
基于上述考虑,图3为本申请实施例提供的一种运算加速器50的结构,在该运算加速器50中,控制器504和压缩模块311、解压缩模块312、统一缓存306、DMAC305、取指 缓存309连接,取指缓存309从存储器20中获取控制指令并存储该控制指令,该控制指令用于指示该运算加速器50是否对神经网络中每层运算后的计算结果进行压缩,以及指示该运算加速器50是否对从存储器20中获取的计算结果进行解压缩,控制器504从取指缓存中读取控制指令以实现对运算加速器中相关组件的控制。
具体地,控制器504从取指缓存中获取该控制指令,解析该控制指令,在该控制指令指示对计算结果进行压缩时,则控制压缩模块311对从统一缓存306中获取的计算结果进行压缩以得到压缩后的计算结果,由DMAC305将压缩后的计算结果搬运到存储器20;在该控制指令指示对计算结果不进行压缩时,则控制统一缓存306将计算结果发送给DMAC305,由DMAC305将计算结果搬运到存储器20,此时计算结果未经过压缩模块的压缩处理。由上可知,在该控制指令指示对计算结果进行压缩时,存储器20中存储的是压缩后的计算结果,由于该压缩后的计算结果还会作为输入数据参与到运算加速器50对神经网络的下一层计算中,因此,控制器504还需控制解压缩模块312进行解压缩处理。
具体地,上述控制指令除了指示该运算加速器50是否对神经网络中每层运算后的计算结果进行压缩,还用于指示该运算加速器50是否对从存储器20中获取的计算结果进行解压缩。控制器504从取指缓存309中获取该控制指令,解析该控制指令,在该控制指令指示对计算结果进行解压缩时,则控制解压缩模块312对获取的计算结果进行解压缩处理,并由解压缩模块312将解压后的数据作为输入数据存入输入缓存301;在该控制指令指示对计算结果不进行解压缩时,则控制DMAC305直接将计算结果作为输入数据存入输入缓存301,此时计算结果未经过解压缩模块312的解压缩处理。
如下将结合图3进一步描述存储器20中存储的控制指令如何生成。
图3除了提供运算加速器50的结构之外,也给出了主CPU10的结构,该主CPU10包括软件实现的加速库和编译模块,其中,该加速库可以包含多个组件,用以完成不同的加速优化操作,比如对数据进行量化的量化(Quantization)模块,支持稀疏计算架构的稀疏(Sparsity)模块等。编译模块用于生成指令,以控制运算加速器完成计算操作。除了加速库和编译模块之外,主CPU中还可以包括驱动和任务调度模块(图4中未示出),通过驱动和任务调度模块实现主CPU与运算加速器的连接。
在本申请提供的实施例中,该加速库还包括判断模块,该判断模块用于分析训练后所形成的神经网络的特征(该训练后所形成的神经网络即为运算加速器执行推理运算的神经网络),比如通过算法分析或实测数据,然后根据分析得出的神经网络的特征推断该神经网络过程中每层输入数据的稀疏率,根据每层输入数据的稀疏率确定是否对该层的计算结果进行压缩和解压缩,并且将每层计算结果是否进行压缩和解压缩的信息发送给编译模块,由编译模块来生成具体的控制指令。
具体地,判断模块将神经网络中第i层输入数据的稀疏率和预设的阈值进行比较,在该神经网络第i层输入数据的稀疏率大于阈值时,确定第i层计算结果需要进行压缩,以及在将i层计算结果作为第i+1层输入数据进行第i+1层计算时需要解压缩;在该神经网络中第i层计算结果的稀疏率不大于阈值时,确定第i层计算结果不需要进行压缩,以及在将第i层计算结果作为第i+1层输入数据进行第i+1层计算时不需要解压缩,其中,1≤i≤N, N为该神经网络的层数。
上述阈值可以根据I/O带宽的收益和功耗代价确定,其中,I/O带宽的收益是指运算加速器对计算结果进行压缩和解压缩处理所减少的I/O带宽,该功耗代价是指运算加速器对计算结果进行压缩和解压缩处理所增加的功耗。
具体地,该阈值可以预先确定,例如,在预先的测试中,当输入数据的稀疏率等于临界值时,运算加速器开启压缩和解压缩所带来的I/O带宽的收益等于功耗代价,则可以将该临界值作为上述阈值,当然实际实现中,考虑到希望I/O带宽的收益更多,可以对该临界值做些调整以确定阈值,本申请对于上述阈值的确定方法不做限定。
需要说明的是,针对不同的神经网络模型,上述预设的阈值可以不同。
编译模块,用于对从判断模块获取到的每层计算结果是否进行压缩和解压缩的信息进行指令译码以得到上述控制指令,即该控制指令用于指示该运算加速器是否对神经网络中每层运算后的计算结果进行压缩和解压缩。
由上可知,针对神经网络中稀疏率较低的输入数据,如果运算加速器启动压缩和解压缩,会带来较小的I/O带宽收益,同时还造成因为启动压缩和解压缩功能所带来的功耗损失,压缩收益低。本申请实施例中,主CPU在判断神经网络中某层输入数据的稀疏率较大时,才控制运算加速器对该层的计算结果进行压缩和解压缩,由于此时I/O带宽收益较大,可以抵消部分因为启动压缩和解压缩功能所带来的功耗损失,提高了压缩收益。
在图3所述的运算加速器中,由控制器来对取指缓存中存储的控制指令进行解析,根据解析的结果对压缩模块、统一缓存、解压缩模块和DMAC做出不同的控制操作,在另外一种可实现的方式中,控制器也可以不执行对控制指令的解析,将控制指令的解析交由运算加速器中压缩模块和解压缩模块处理。
图4为本申请一个实施例提供的运算加速器60的结构,在该运算加速器60中,取指缓存309从存储器20中获取控制指令并存储该控制指令,控制器304将该控制指令分配给压缩模块611和解压缩模块612。
压缩模块611解析该控制指令,在该控制指令指示对计算结果进行压缩时,对该计算结果进行压缩以得到压缩后的计算结果,由DMAC305将压缩后的计算结果搬运到存储器;在控制指令指示对该计算结果不进行压缩时,直接将该计算结果发送给DMAC305,由DMAC305将该计算结果搬运到存储器20。
同样地,解压缩模块612解析该控制指令,在该控制指令指示对从存储器20中获取的计算结果进行解压缩时,对获取的计算结果进行解压缩处理,并且将解压缩后的数据作为输入数据存入输入缓存301;在该控制指令指示对从存储器20中获取的计算结果不进行解压缩时,直接将获取的计算结果作为输入数据存入输入缓存301。
下面将描述上述实施例中提到的运算加速器中压缩模块311如何实现压缩的功能。
为了提高压缩模块311的压缩效率,通常在设计压缩模块311时,在压缩模块311中引入一个分片模块3110,用于对压缩模块311接收的计算结果进行分片处理,针对每个片分别进行压缩处理。
具体地,图5为本申请提供的一种压缩模块311的结构图,该压缩模块311包括分片模块3110和至少一个压缩引擎3111,分片模块3110,用于对接收的计算结果进行分片处理以得到至少一个子计算结果,每个压缩引擎3111,用于对其中的一个子计算结果进行压缩以得到子压缩数据,其中,该至少一个压缩引擎中每个压缩引擎生成的子压缩数据的总和组成了压缩模块311输出的压缩数据。
需要说明的是,本申请对于压缩引擎3111中采用的压缩算法不做限制,业界常用的压缩算法有熵编码、游程编码等,不同压缩算法有各自的适用场景,例如,压缩引擎可以根据数据中的0值进行压缩。由于压缩模块是运行在运算加速器40的硬件逻辑电路上,因此,压缩算法的选择需要考虑硬件的资源、功耗和性能等,本申请对于压缩引擎采用哪种压缩算法不做限定。另外,图3中示例的将计算结果分成了四个子计算结果,并且将每个子计算结果分给一个对应的压缩引擎进行压缩处理,即共有四个压缩引擎,实际实现中,压缩引擎的个数可以由单个压缩引擎的性能和运算加速器的性能需求等决定,本申请对于压缩引擎的个数不做限制,并且压缩引擎的个数和子计算结果的个数可以不一一对应,比如,一个压缩引擎可以处理两个或两个以上的子计算结果。
进一步,由于压缩引擎3111对待压缩数据进行压缩后可能出现压缩后的数据的大小大于待压缩数据的大小,此时如果将压缩后的数据搬运到存储器存储不利于减少运算加速器的I/O带宽,因此,为了进一步减少运算加速器的I/O带宽,针对压缩引擎3111可以做进一步设计。
具体地,每个压缩引擎3111对接收的子计算结果进行压缩以得到子压缩结果,比较该子压缩结果和该子计算结果的大小,在该子压缩结果大于该子计算结果时,将该子计算结果作为输出的子压缩数据,并生成一个与该子压缩数据对应的压缩失败的标识,由于此时输出的子压缩数据为没有经过压缩处理的数据,因此压缩失败;在该子压缩结果不大于该子计算结果时,将该子压缩结果作为输出的子压缩数据,并生成一个与该子压缩数据对应的压缩成功的标识,由于此时输出的子压缩数据为经过压缩处理的数据,因此压缩成功,其中,该压缩失败的标识和该压缩成功的标识通过DMAC存入存储器。
需要说明的是,针对该子压缩结果的大小等于该子计算结果的大小的场景,可以有两种实现方式:第一种实现方式,如上所述,压缩引擎3111可以将该子压缩结果作为该子压缩数据输出给DMAC,这虽然导致在下次计算中增加了对该子压缩数据的解压缩过程,但是可以避免压缩引擎3111再次读取该子计算结果然后将该子计算结果输出给DMAC,节省读取该子计算结果所带来的功耗;第二种实现方式,压缩引擎3111也可以将该子计算结果直接作为该子压缩数据输出给DMAC,此时该子压缩数据为没有经过压缩处理的数据,虽然压缩引擎3111会再次读取该子计算结果,但是在下次计算中可以避免对该子压缩数据的解压缩过程,节省解压缩该子压缩数据所带来的功耗。因此,针对该子压缩结果的大小等于该子计算结果的大小的场景,上述两种实现方式互有优劣,在实际中,可以根据运算加速器所需达到的性能要求作出合适的选择。
由上可知,压缩模块中压缩引擎在压缩过程中会出现压缩失败和压缩成功的情形,因此,DMAC305在将存储器中存储的子压缩数据搬运到输入缓存的过程中,增加了解压缩模块312,解压缩模块312中可以包括多个解压缩引擎,每个解压缩引擎对接收的子压缩 数据分别进行解压缩处理,具体地,每个解压缩引擎在获取子压缩数据之后,对接收的子压缩数据对应的标识进行识别,在该子压缩数据对应的标识为压缩失败的标识时,直接将该子压缩数据作为输入数据存储在输入缓存,即不进行解压缩处理;在该子压缩数据对应的标识为压缩成功的标识时,对该子压缩数据进行解压缩处理以得到解压后的数据,并且将解压后的数据作为输入数据存储在输入缓存。
需要说明的是,解压缩引擎的个数由单个解压缩引擎的解压缩性能和运算加速器的性能需求等决定,同时需要兼顾解压缩过程和压缩过程的耦合性;另外,解压缩引擎中采用的解压缩算法是和压缩模块中压缩引擎所采用的压缩算法相对应的算法。
下面将描述上述实施例中提到的运算加速器中压缩模块611如何实现压缩的功能。图6为本申请另一个实施例提供的压缩模块611的结构图,该压缩模块611与图5所述的压缩模块311相比,主要是增加了解析模块610。
解析模块610用于解析控制指令,在该控制指令指示对计算结果进行压缩时,将计算结果提供给分片模块3110进行分片,然后由压缩引擎3111对分片后的子计算结果进行压缩并发送给DMAC305,最后由DMAC305将压缩后的计算结果搬运到存储器20,这部分实现和上述图5中关于压缩模块311的实现相同,此处不再进行具体描述;在控制指令指示对该计算结果不进行压缩时,直接将该计算结果发送给DMAC305,由DMAC305将该计算结果搬运到存储器20。
对应地,解压缩模块612中也存在一个解析模块,用于解析控制指令,在该控制指令指示对从存储器20中获取的计算结果进行解压时,对获取的计算结果进行解压处理,并且将解压缩后的数据作为输入数据存入输入缓存301;在该控制指令指示对从存储器20中获取的计算结果不进行解压缩时,直接将获取的计算结果作为输入数据存入输入缓存301。解压缩模块612的结构图本申请不再给出。
结合上述实施例中运算加速器的硬件结构图,本申请实施例提供了一种控制运算加速器进行压缩的方法,如图7所示,该方法可以包括如下步骤S701~S709,其中,S701~S704由主CPU执行,S705~S709由运算加速器执行。
S701、判断是否对神经网络中每层计算结果进行压缩和解压缩。
CPU分析训练后所形成的神经网络的特征(该训练后所形成的神经网络即为运算加速器执行推理运算的神经网络),比如通过算法分析或实测数据,然后根据分析得出的神经网络的特征推断该神经网络过程中每层输入数据的稀疏率,根据每层输入数据的稀疏率确定是否对该层的计算结果进行压缩和解压缩,具体地,可以通过将每层输入数据的稀疏率和阈值进行比较来实现,由于具体的实现方法在上述实施例中已做了详细描述,此处不再赘述。
S702、生成控制指令。
CPU内部的编译模块根据每层计算结果是否进行压缩和解压缩的信息进行指令译码以生成控制指令,该控制指令用于指示运算加速器是否对神经网络中每层运算后的计算结果进行压缩和解压缩。
S703、将控制指令存储在存储器中。
CPU将生成的控制指令存储在运算加速器外部的存储器中。
S704、将控制指令置于运算加速器中的取指缓存中。
CPU控制将存储在存储器中的控制指令搬运到运算加速器中的取指缓存。
S705、读取取指缓存中的控制指令。
运算加速器从取指缓存中读取控制指令。
S706、对神经网络中每层进行计算得到计算结果。
运算加速器对神经网络中每层的输入数据进行计算并得到计算结果,具体地,可以由运算加速器中的运算电路来执行运算。
S707、根据控制指令确定是否对计算结果进行压缩和解压缩处理。
运算加速器解析该控制指令,根据控制指令确定是否对神经网络中每层的计算结果进行压缩和解压缩处理,在确定进行压缩和解压缩处理时,执行S708,在不确定不执行压缩和解压缩处理时,执行S709。
S708、对计算结果进行压缩,将压缩后的计算结果存储在存储器,并且在下次计算中对从存储器中获取的计算结果进行解压缩以得到输入数据。
运算加速器在确定需要对计算结果进行压缩时,对计算结果进行压缩,并且将压缩后的计算结果存储在存储器,在执行神经网络的下一层计算时,运算加速器从存储器中获取计算结果,并且对获取的计算结果进行解压缩处理以得到解压后的数据,将解压后的数据作为输入数据参与到神经网络的下一层计算中。
S709、将计算结果存储在存储器,并且在下次计算中将从该存储器中读取的计算结果作为输入数据。
运算加速器在确定无需对计算结果进行压缩时,直接将计算结果存储在存储器,在执行神经网络的下一层计算时,运算加速器从存储器中获取计算结果,并且将获取的计算结果作为输入数据参与到神经网络的下一层计算中,此时运算加速器获取的计算结果无需经过解压缩处理。
由上可知,针对神经网络中稀疏率较低的输入数据,如果运算加速器启动压缩和解压缩,压缩收益低。本申请实施例中,CPU在判断神经网络中某层输入数据的稀疏率较大时,才控制运算加速器对该层的计算结果进行压缩和解压缩,由于此时I/O带宽收益较大,提高了压缩收益。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。上述计算机程序产品包括一个或多个计算机指令。在运算加速器上加载或执行上述计算机程序指令时,全部或部分地产生按照本申请实施例上述的流程或功能。上述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。上述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state Drive,SSD)。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (24)

  1. 一种运算加速器,其特征在于,包括:
    第一缓存,用于存储第一输入数据;
    第二缓存,用于存储权重数据;
    与所述第一缓存和所述第二缓存连接的运算电路,用于对所述第一输入数据和所述权重数据进行矩阵乘运算以得到计算结果;
    压缩模块,用于对所述计算结果进行压缩以得到压缩数据;
    与所述压缩模块连接的直接存储器访问控制器DMAC,用于将所述压缩数据存入到所述运算加速器之外的存储器。
  2. 如权利要求1所述的运算加速器,其特征在于,还包括:
    与所述DMAC和所述第一缓存连接的解压缩模块,用于接收由所述DMAC从所述存储器中获取的所述压缩数据,对所述压缩数据进行解压,并将解压后的数据作为第二输入数据存入所述第一缓存;
    所述运算电路,还用于从所述第一缓存中获取所述第二输入数据以进行矩阵乘运算。
  3. 如权利要求2所述的运算加速器,其特征在于,还包括:
    第三缓存,用于存储控制指令,所述控制指令用于指示是否对所述计算结果进行压缩和解压缩;
    与所述第三缓存连接的控制器,用于从所述第三缓存中获取所述控制指令,并且解析所述控制指令,在所述控制指令指示对所述计算结果进行压缩和解压缩时,控制所述压缩模块对所述计算结果进行压缩以得到所述压缩数据,以及控制所述解压缩模块对获取的所述压缩数据进行解压缩。
  4. 如权利要求3所述的运算加速器,其特征在于,还包括:
    与所述运算电路连接的第四缓存,用于存储所述运算电路计算的所述计算结果;
    所述控制器,还用于在所述控制指令指示对所述计算结果不进行压缩和解压缩时,控制所述DMAC将所述第四缓存中的所述计算结果存入所述存储器,以及控制所述DMAC将所述存储器中的所述计算结果存入所述第一缓存。
  5. 如权利要求2所述的运算加速器,其特征在于,还包括:
    第三缓存,用于存储控制指令,所述控制指令用于指示是否对所述计算结果进行压缩和解压缩;
    与所述第三缓存连接的控制器,用于从所述第三缓存中获取所述控制指令,并且将所述控制指令分发给所述压缩模块和所述解压缩模块;
    所述压缩模块,用于解析所述控制指令,在所述控制指令指示对所述计算结果进行压缩时,对所述计算结果进行压缩以得到所述压缩数据;
    所述解压缩模块,用于解析所述控制指令,在所述控制指令指示对所述计算结果进行 解压缩时,对获取的所述压缩数据进行解压缩。
  6. 如权利要求5所述的运算加速器,其特征在于,
    所述压缩模块,还用于在所述控制指令指示对所述计算结果不进行压缩时,控制所述DMAC将所述计算结果存入所述存储器;
    所述解压缩模块,还用于在所述控制指令指示对所述计算结果不进行解压缩时,控制所述DMAC将所述存储器中的所述计算结果存入所述第一缓存。
  7. 如权利要求2-5任一所述的运算加速器,其特征在于,所述压缩模块包括分片模块和至少一个压缩引擎,
    所述分片模块,用于对所述计算结果进行分片处理以得到至少一个子计算结果;
    所述至少一个压缩引擎中每个压缩引擎,用于对所述至少一个子计算结果中的一个子计算结果进行压缩以得到子压缩数据,其中,所述至少一个压缩引擎中每个压缩引擎生成的子压缩数据的总和组成所述压缩数据。
  8. 如权利要求7所述的运算加速器,其特征在于,所述至少一个压缩引擎中每个压缩引擎,具体用于:
    对所述子计算结果进行压缩以得到子压缩结果;
    比较所述子压缩结果和所述子计算结果的大小;
    在所述子压缩结果大于所述子计算结果时,将所述子计算结果作为所述子压缩数据;
    在所述子压缩结果不大于所述子计算结果时,将所述子压缩结果作为所述子压缩数据。
  9. 如权利要求8所述的运算加速器,其特征在于,所述至少一个压缩引擎中每个压缩引擎还用于:
    在所述子压缩结果大于所述子计算结果时,生成一个与所述子压缩数据对应的压缩失败的标识,其中,所述压缩失败的标识经由所述DMAC进行控制以存入所述存储器;
    在所述子压缩结果不大于所述子计算结果时,生成一个与所述子压缩数据对应的压缩成功的标识,其中,所述压缩成功的标识经由所述DMAC进行控制以存入所述存储器。
  10. 如权利要求9所述的运算加速器,其特征在于,所述解压缩模块具体用于:
    接收由所述DMAC从所述存储器中获取的所述子压缩数据;
    在所述子压缩数据对应的标识为压缩失败的标识时,将所述子压缩数据作为所述第二输入数据存入所述第一缓存;
    在所述子压缩数据对应的标识为压缩成功的标识时,对所述子压缩数据进行解压,并将解压后的数据作为所述第二输入数据存入所述第一缓存。
  11. 一种处理装置,其特征在于,包括:
    判断模块,用于根据神经网络中第i层输入数据的稀疏率,确定运算加速器是否对第i 层输入数据进行计算后所得到的计算结果进行压缩和解压缩,其中,1≤i≤N,N为所述神经网络的层数,所述运算加速器为处理装置之外的协处理器;
    编译模块,用于根据所述判断模块的确定结果生成控制指令,所述控制指令用于指示所述运算加速器是否对所述计算结果进行压缩和解压缩。
  12. 如权利要求11所述的处理装置,其特征在于,所述判断模块,具体用于:
    在所述神经网络第i层输入数据的稀疏率大于阈值时,确定所述运算加速器对所述计算结果进行压缩,以及在将所述计算结果作为第i+1层输入数据进行第i+1层计算时进行解压缩;
    在所述神经网络第i层输入数据的稀疏率不大于阈值时,确定所述运算加速器对所述计算结果不进行压缩,以及在将所述计算结果作为第i+1层输入数据进行第i+1层计算时不进行解压缩。
  13. 如权利要求12所述的处理器,其特征在于,所述阈值基于输入/输出(I/O)带宽的收益和功耗代价确定,所述I/O带宽的收益用于指示所述运算加速器对所述计算结果进行压缩和解压缩处理所减少的I/O带宽,所述功耗代价用于指示所述运算加速器对所述计算结果进行压缩和解压缩处理所增加的功耗。
  14. 一种压缩方法,其特征在于,所述压缩方法应用于运算加速器,所述运算加速器包括第一缓存和第二缓存,所述方法包括:
    对从所述第一缓存中获取的第一输入数据和从所述第二缓存中获取的权重数据进行矩阵乘运算以得到计算结果;
    将所述计算结果进行压缩以得到压缩数据;
    将所述压缩数据存入所述运算加速器之外的存储器。
  15. 如权利要求14所述的压缩方法,其特征在于,还包括:
    从所述存储器中获取所述压缩数据;
    对所述压缩数据进行解压,并将解压后的数据作为第二输入数据存入所述第一缓存;
    对从所述第一缓存中获取的第二输入数据进行矩阵乘运算。
  16. 如权利要求15所述的压缩方法,其特征在于,所述方法包括:
    获取控制指令,所述控制指令用于指示是否对所述计算结果进行压缩和解压缩;
    解析所述控制指令;
    所述将所述计算结果进行压缩以得到压缩数据包括:
    在所述控制指令指示对所述计算结果进行压缩时,将所述计算结果进行压缩以得到所述压缩数据。
    所述对所述压缩数据进行解压包括:
    在所述控制指令指示对所述计算结果进行解压缩时,对所述压缩数据进行解压。
  17. 如权利要求16所述的压缩方法,其特征在于,所述方法还包括:
    在所述控制指令指示对所述计算结果不进行压缩和解压缩时,将所述计算结果存储在所述存储器,以及从所述存储器中获取所述计算结果存入所述第一缓存。
  18. 一种压缩方法,其特征在于,所述压缩方法应用于运算加速器,所述运算加速器包括第一缓存和第二缓存,所述方法包括:
    对从所述第一缓存中获取的第一输入数据和从所述第二缓存中获取的权重数据进行矩阵乘运算以得到计算结果;
    获取控制指令,所述控制指令用于指示是否对所述计算结果进行压缩和解压缩;
    在所述控制指令指示对所述计算结果进行压缩时,对所述计算结果进行压缩以得到压缩数据,并且将所述压缩数据存储在所述运算加速器之外的存储器;
    在所述控制指令指示对所述计算结果不进行压缩时,将所述计算结果存储在所述运算加速器之外的存储器。
  19. 如权利要求18所述的方法,其特征在于,所述方法还包括:
    在所述控制指令指示对所述计算结果进行解压缩时,将从所述存储器中获取的所述压缩数据进行解压,并且对解压后的数据作为第二输入数据进行矩阵乘运算;
    在所述控制指令指示所述计算结果不进行解压缩时,对从所述存储器中获取的所述计算结果作为第二输入数据进行矩阵乘运算。
  20. 一种处理方法,其特征在于,应用于处理装置,包括:
    根据神经网络中第i层输入数据的稀疏率,确定运算加速器是否对第i层输入数据进行计算后所得到的计算结果进行压缩和解压缩,其中,1≤i≤N,N为所述神经网络的层数,所述运算加速器为处理装置之外的协处理器;
    生成控制指令,所述控制指令用于指示所述运算加速器是否对所述计算结果进行压缩和解压缩。
  21. 如权利要求20所述的方法,其特征在于,所述根据神经网络中第i层输入数据的稀疏率,确定运算加速器是否对第i层输入数据进行计算后所得到的计算结果进行压缩和解压缩包括:
    在所述神经网络第i层输入数据的稀疏率大于阈值时,确定所述运算加速器对所述计算结果进行压缩,以及在将所述计算结果作为第i+1层输入数据进行第i+1层计算时进行解压缩;
    在所述神经网络第i层输入数据的稀疏率不大于阈值时,确定所述运算加速器对所述计算结果不进行压缩,以及在将所述计算结果作为第i+1层输入数据进行第i+1层计算时不进行解压缩。
  22. 如权利要求21所述的方法,其特征在于,所述阈值基于输入/输出(I/O)带宽的 收益和功耗代价确定,所述I/O带宽的收益用于指示所述运算加速器对所述计算结果进行压缩和解压缩处理所减少的I/O带宽,所述功耗代价用于指示所述运算加速器对所述计算结果进行压缩和解压缩处理所增加的功耗。
  23. 一种处理装置,其特征在于,包括:
    存储器,用于存储指令;
    处理器,用于读取所述存储器中的指令并执行权利要求20至权利要求22中任一项所述的处理方法。
  24. 一种计算机存储介质,其特征在于,所述存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时实现权利要求20至权利要求22中任一项所述的处理方法。
PCT/CN2018/109117 2018-09-30 2018-09-30 运算加速器和压缩方法 WO2020062252A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2018/109117 WO2020062252A1 (zh) 2018-09-30 2018-09-30 运算加速器和压缩方法
CN201880098124.4A CN112771546A (zh) 2018-09-30 2018-09-30 运算加速器和压缩方法
EP18935203.2A EP3852015A4 (en) 2018-09-30 2018-09-30 OPERATIONAL ACCELERATOR AND COMPRESSION PROCESS
US17/216,476 US11960421B2 (en) 2018-09-30 2021-03-29 Operation accelerator and compression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/109117 WO2020062252A1 (zh) 2018-09-30 2018-09-30 运算加速器和压缩方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/216,476 Continuation US11960421B2 (en) 2018-09-30 2021-03-29 Operation accelerator and compression method

Publications (1)

Publication Number Publication Date
WO2020062252A1 true WO2020062252A1 (zh) 2020-04-02

Family

ID=69949799

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/109117 WO2020062252A1 (zh) 2018-09-30 2018-09-30 运算加速器和压缩方法

Country Status (4)

Country Link
US (1) US11960421B2 (zh)
EP (1) EP3852015A4 (zh)
CN (1) CN112771546A (zh)
WO (1) WO2020062252A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559043A (zh) * 2020-12-23 2021-03-26 苏州易行电子科技有限公司 一种轻量级人工智能加速模块
KR102383962B1 (ko) * 2020-11-19 2022-04-07 한국전자기술연구원 가변 데이터 압축/복원기를 포함하는 딥러닝 가속 장치

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11507831B2 (en) * 2020-02-24 2022-11-22 Stmicroelectronics International N.V. Pooling unit for deep learning acceleration

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1947107A (zh) * 2004-01-30 2007-04-11 英飞凌科技股份公司 用于在存储器间传输数据的装置
CN106954002A (zh) * 2016-01-07 2017-07-14 深圳市汇顶科技股份有限公司 一种指纹数据的压缩方法及装置
CN107341544A (zh) * 2017-06-30 2017-11-10 清华大学 一种基于可分割阵列的可重构加速器及其实现方法
CN108416434A (zh) * 2018-02-07 2018-08-17 复旦大学 针对神经网络的卷积层与全连接层进行加速的电路结构

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007226615A (ja) * 2006-02-24 2007-09-06 Matsushita Electric Ind Co Ltd 情報処理装置、圧縮プログラム生成方法及び情報処理システム
US9153230B2 (en) * 2012-10-23 2015-10-06 Google Inc. Mobile speech recognition hardware accelerator
US10540588B2 (en) * 2015-06-29 2020-01-21 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory
US10733505B2 (en) * 2016-11-10 2020-08-04 Google Llc Performing kernel striding in hardware
US10096134B2 (en) * 2017-02-01 2018-10-09 Nvidia Corporation Data compaction and memory bandwidth reduction for sparse neural networks
US20190228037A1 (en) * 2017-08-19 2019-07-25 Wave Computing, Inc. Checkpointing data flow graph computation for machine learning
US10846363B2 (en) * 2018-11-19 2020-11-24 Microsoft Technology Licensing, Llc Compression-encoding scheduled inputs for matrix computations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1947107A (zh) * 2004-01-30 2007-04-11 英飞凌科技股份公司 用于在存储器间传输数据的装置
CN106954002A (zh) * 2016-01-07 2017-07-14 深圳市汇顶科技股份有限公司 一种指纹数据的压缩方法及装置
CN107341544A (zh) * 2017-06-30 2017-11-10 清华大学 一种基于可分割阵列的可重构加速器及其实现方法
CN108416434A (zh) * 2018-02-07 2018-08-17 复旦大学 针对神经网络的卷积层与全连接层进行加速的电路结构

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102383962B1 (ko) * 2020-11-19 2022-04-07 한국전자기술연구원 가변 데이터 압축/복원기를 포함하는 딥러닝 가속 장치
WO2022107929A1 (ko) * 2020-11-19 2022-05-27 한국전자기술연구원 가변 데이터 압축/복원기를 포함하는 딥러닝 가속 장치
CN112559043A (zh) * 2020-12-23 2021-03-26 苏州易行电子科技有限公司 一种轻量级人工智能加速模块

Also Published As

Publication number Publication date
US20210216483A1 (en) 2021-07-15
US11960421B2 (en) 2024-04-16
EP3852015A4 (en) 2021-09-01
CN112771546A (zh) 2021-05-07
EP3852015A1 (en) 2021-07-21

Similar Documents

Publication Publication Date Title
US11960421B2 (en) Operation accelerator and compression method
WO2019127838A1 (zh) 卷积神经网络实现方法及装置、终端、存储介质
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
Fox et al. Training deep neural networks in low-precision with high accuracy using FPGAs
CN110546611A (zh) 通过跳过处理操作来减少神经网络处理器中的功耗
WO2023236365A1 (zh) 数据处理方法、装置、ai芯片、电子设备及存储介质
US11570477B2 (en) Data preprocessing and data augmentation in frequency domain
CN113570033B (zh) 神经网络处理单元、神经网络的处理方法及其装置
US11960467B2 (en) Data storage method, data obtaining method, and apparatus
WO2023124428A1 (zh) 芯片、加速卡以及电子设备、数据处理方法
WO2023284745A1 (zh) 一种数据处理方法、系统及相关设备
WO2022246986A1 (zh) 数据处理方法、装置、设备及计算机可读存储介质
CN105068875A (zh) 一种智能数据处理方法及装置
US20200242467A1 (en) Calculation method and calculation device for sparse neural network, electronic device, computer readable storage medium, and computer program product
US20230088915A1 (en) Method and apparatus for sequence processing
CN113641674B (zh) 一种自适应全局序号发生方法和装置
WO2021179117A1 (zh) 神经网络通道数搜索方法和装置
US20220101100A1 (en) Load distribution for a distributed neural network
CN111783958A (zh) 一种数据处理系统、方法、装置和存储介质
US11726544B2 (en) Dynamic agent for multiple operators optimization
CN113570034B (zh) 处理装置、神经网络的处理方法及其装置
CN113095211B (zh) 一种图像处理方法、系统及电子设备
WO2024066547A1 (zh) 数据压缩方法、装置、计算设备及存储系统
WO2024021827A1 (zh) 数据处理方法及装置
EP4310671A1 (en) Dynamic agent for multiple operators optimization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18935203

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018935203

Country of ref document: EP

Effective date: 20210415