WO2022257920A1 - 优化深度神经网络的参数的处理系统、集成电路及板卡 - Google Patents

优化深度神经网络的参数的处理系统、集成电路及板卡 Download PDF

Info

Publication number
WO2022257920A1
WO2022257920A1 PCT/CN2022/097372 CN2022097372W WO2022257920A1 WO 2022257920 A1 WO2022257920 A1 WO 2022257920A1 CN 2022097372 W CN2022097372 W CN 2022097372W WO 2022257920 A1 WO2022257920 A1 WO 2022257920A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
quantization
parameters
processing system
buffer
Prior art date
Application number
PCT/CN2022/097372
Other languages
English (en)
French (fr)
Inventor
喻歆
俞烨昊
王楠
赵彦君
邬领东
赵永威
庄毅敏
陈小兵
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110637685.0A external-priority patent/CN113238987B/zh
Priority claimed from CN202110639079.2A external-priority patent/CN113238989A/zh
Priority claimed from CN202110639072.0A external-priority patent/CN113238976B/zh
Priority claimed from CN202110639078.8A external-priority patent/CN113238988B/zh
Priority claimed from CN202110637698.8A external-priority patent/CN113238975A/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2022257920A1 publication Critical patent/WO2022257920A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit

Definitions

  • the present invention relates generally to the field of neural networks. More specifically, the present invention relates to a processing system, an integrated circuit and a board for optimizing parameters of a deep neural network.
  • the deep neural network model tends to be complex, and some models include hundreds of layers of operators, which makes the calculation amount increase rapidly.
  • Quantization refers to the conversion of weights and activation values represented by high-precision floating-point numbers to approximate representations with low-precision integers. Its advantages include low memory bandwidth, low power consumption, low computing resource usage, and low model storage requirements.
  • Quantization is a commonly used method to simplify the amount of data at present, but the quantization operation still lacks hardware support.
  • the solution of the present invention provides a processing system, an integrated circuit and a board for optimizing parameters of a deep neural network.
  • the present invention discloses a processing system for optimizing parameters of a deep neural network, including a near-data processing device and an acceleration device.
  • the near data processing device is used to store and quantize the raw data running on the deep neural network to generate quantified data.
  • the acceleration device is used for training the deep neural network based on the quantized data, so as to generate and quantify the training results.
  • the near data processing device updates the parameters based on the quantized training results, and the image data reason the deep neural network based on the updated parameters.
  • the present invention discloses an integrated circuit device including the aforementioned elements, and also discloses a board including the aforementioned integrated circuit device.
  • the invention realizes the quantification of online dynamic statistics, reduces unnecessary data access, achieves the technical effect of high-precision parameter update, makes the neural network model more accurate and lighter, and quantifies the data directly on the memory side, suppressing the long-term Errors due to tail distribution data.
  • Fig. 1 is the structural diagram showing the plate card of the embodiment of the present invention.
  • FIG. 2 is a structural diagram showing an integrated circuit device according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present invention.
  • Fig. 5 is a schematic diagram showing when a processor core intends to write data to another cluster's processor core
  • FIG. 6 is a schematic diagram showing hardware related to quantization operations according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram illustrating a statistical quantizer of an embodiment of the present invention.
  • FIG. 8 is a schematic diagram showing a cache controller and a cache array according to an embodiment of the present invention.
  • Fig. 9 is a schematic diagram showing a near data processing device according to an embodiment of the present invention.
  • Figure 10 is a schematic diagram illustrating an optimizer of an embodiment of the present invention.
  • Fig. 11 is a flow chart showing a method for quantizing raw data according to another embodiment of the present invention.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • Deep learning has been proven to work well on tasks including image classification, object detection, natural language processing, and more.
  • a large number of applications today are equipped with image (computer vision) related deep learning algorithms.
  • Deep learning is generally implemented using neural network models. As model predictions become more accurate and networks become deeper, the memory capacity and memory bandwidth required to run neural networks is considerable, making devices expensive to become smart.
  • Quantization is one of the most widely used compression methods.
  • the so-called quantization refers to converting high-precision floating-point data (such as FP32) into low-precision fixed-point data (INT8).
  • High-precision floating-point numbers require more bits to describe, while low-precision fixed-point numbers require less
  • the bits can be fully described, and by reducing the number of data bits, the burden on smart devices can be effectively released.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present invention.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit can use quantitative and optimized processing methods to support various deep learning and machine learning algorithms, and meet the intelligent processing needs of complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a memory device 104 for storing data, which includes one or more memory elements 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a near data processing device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present invention can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the near data processing device 204 is a memory with processing capability for storing data to be processed.
  • the size of the memory is usually 16G or larger, and is used for storing data of the computing device 201 and/or the processing device 203 .
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 .
  • the computing device 201 is used to process input data such as computer vision, speech, natural language, data mining, etc.
  • the computing device 201 in the figure adopts a multi-core layered structure design, and the computing device 201 is used as a system on a chip, which includes a plurality of clusters (cluster), Each cluster includes a plurality of processor cores.
  • the computing device 201 is constituted by a system-on-a-chip-cluster-processor core level.
  • the computing device 201 includes an external storage controller 301 , a peripheral communication module 302 , an on-chip interconnection module 303 , a synchronization module 304 and multiple clusters 305 .
  • the peripheral communication module 302 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to execute tasks.
  • the on-chip interconnection module 303 connects the external memory controller 301 , the peripheral communication module 302 and multiple clusters 305 to transmit data and control signals among the various modules.
  • the synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC global barrier controller
  • a plurality of clusters 305 are the computing cores of the computing device 201, four of which are exemplarily shown in the figure, and with the development of hardware, the computing device 201 of the present invention may also include 8, 16, 64, or even more Cluster 305. Cluster 305 is used to efficiently execute deep learning algorithms.
  • each cluster 305 includes a plurality of processor cores (IPU core) 306 and a storage core (MEM core) 307.
  • processor cores IPU core
  • MEM core storage core
  • processor cores 306 are exemplarily shown in the figure, and the present invention does not limit the number of processor cores 306 . Its internal architecture is shown in Figure 4. Each processor core 306 includes three modules: a control module 41 , an operation module 42 and a storage module 43 .
  • the control module 41 is used to coordinate and control the work of the operation module 42 and the storage module 43 to complete the task of deep learning, which includes an instruction fetch unit (instruction fetch unit, IFU) 411 and an instruction decoding unit (instruction decode unit, IDU) 412.
  • the instruction fetching unit 411 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 412 decodes the obtained instructions and sends the decoding results to the computing module 42 and the storage module 43 as control information.
  • the operation module 42 includes a vector operation unit 421 and a matrix operation unit 422 .
  • the vector operation unit 421 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 43 is used to store or transport related data, including a neuron cache element (neuron RAM, NRAM) 431, a weight cache element (weight RAM, WRAM) 432, an input/output direct memory access module (input/output direct memory access , IODMA) 433, moving direct memory access module (move direct memory access, MVDMA) 434.
  • NRAM 431 is used to store feature maps and intermediate results calculated by processor core 306; WRAM 432 is used to store weights of deep learning networks; IODMA 433 controls NRAM 431/WRAM 432 and near data processing through broadcast bus 309
  • the memory access of the device 204; the MVDMA 434 is used to control the memory access of the NRAM 431/WRAM 432 and the SRAM 308.
  • the storage core 307 is mainly used for storage and communication, that is, to store shared data or intermediate results between the processor cores 306, and to execute the communication between the cluster 305 and the near data processing device 204, and the communication between the clusters 305 , communication between the processor cores 306 and the like.
  • the storage core 307 has a scalar operation capability, and is used for performing scalar operations.
  • the storage core 307 includes a shared cache element (SRAM) 308, a broadcast bus 309, a cluster direct memory access module (cluster direct memory access, CDMA) 310 and a global direct memory access module (global direct memory access, GDMA) 311.
  • SRAM shared cache element
  • CDMA cluster direct memory access
  • GDMA global direct memory access
  • the data multiplexed between different processor cores 306 in the same cluster 305 does not need to be obtained from the near data processing device 204 through the processor cores 306 respectively, but through the SRAM 308 Transferring between processor cores 306, the storage core 307 only needs to quickly distribute the multiplexed data from the SRAM 308 to multiple processor cores 306, so as to improve the communication efficiency between cores and greatly reduce on-chip and off-chip input/output access.
  • the broadcast bus 309, the CDMA 310 and the GDMA 311 are respectively used for communication between the processor cores 306, communication between the clusters 305, and data transmission between the cluster 305 and the near data processing device 204. They will be described separately below.
  • the broadcast bus 309 is used to complete high-speed communication among the processor cores 306 in the cluster 305 .
  • the broadcast bus 309 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (that is, a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from the SRAM 308 to specific processor cores 306, and broadcasting is to transfer a data
  • a communication method in which data is transmitted from SRAM 308 to all processor cores 306 belongs to a special case of multicast.
  • the CDMA 310 is used to control the memory access of the SRAM 308 between different clusters 305 in the same computing device 201.
  • FIG. 5 shows a schematic diagram when one processor core intends to write data to another cluster of processor cores to illustrate the working principle of CDMA 310.
  • the same computing device includes multiple clusters.
  • Cluster 0 and cluster 1 respectively include multiple processor cores.
  • the Cluster 0 shows only processor core 0, and cluster 1 shows only processor core 1.
  • Processor core 0 intends to write data to processor core 1.
  • processor core 0 sends a unicast write request to write data into the local SRAM 0
  • CDMA 0 acts as the master (master) end
  • CDMA 1 acts as the slave (slave) end
  • the master end pushes the write request to the slave end, that is, the master
  • the end sends the write address AW and the write data W, and transfers the data to SRAM 1 of cluster 1, and then sends a write response B as a response from the end
  • processor core 1 of cluster 1 sends a unicast read request to transfer the data from SRAM 1 read it out.
  • the GDMA 311 cooperates with the external storage controller 301 to control the memory access from the SRAM 308 of the cluster 305 to the near data processing device 204, or to read data from the near data processing device 204 into the SRAM 308.
  • the communication between the near data processing device 204 and the NRAM 431 or WRAM 432 can be realized through two channels.
  • the first channel is to directly contact the near data processing device 204 and the NRAM 431 or WRAM 432 through the IODAM 433; the second channel is to transmit data between the near data processing device 204 and the SRAM 308 through the GDMA 311, and then pass the MVDMA 434 to make Data is transferred between SRAM 308 and NRAM 431 or WRAM 432.
  • the bandwidth of the second channel is much larger than that of the first channel, so the data processing device 204 and Communication between NRAM 431 or WRAM 432 may be more efficient through the second channel.
  • the embodiment of the present invention can select a data transmission channel according to its own hardware conditions.
  • the functionality of the GDMA 311 and the functionality of the IODMA 433 may be integrated in the same component.
  • the present invention regards GDMA 311 and IODMA 433 as different components for convenience of description, and for those skilled in the art, as long as the functions realized and the technical effects achieved are similar to those of the present invention, they belong to the protection scope of the present invention.
  • the function of GDMA 311, the function of IODMA 433, the function of CDMA 310, and the function of MVDMA 434 can also be realized by the same part.
  • This processing system can optimize the parameters of the deep neural network during the training process, and it includes a near data processing device 204 and a computing device 201, wherein the near data processing device 204 is used to store and quantize the raw data running on the deep neural network to generate quantized Data; the computing device 201 is an acceleration device for training a deep neural network based on quantized data to generate and quantify training results.
  • the near data processing device 204 updates parameters based on the quantized training results, and the computing device 201 runs the trained deep neural network for various types of data based on the updated parameters to obtain calculation results (prediction results).
  • the near data processing device 204 not only has storage capabilities, but also has basic computing capabilities. As shown in FIG. device 603.
  • Memory 601 can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), such as variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM) ), Static Random Access Memory (SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory, HBM), Hybrid Memory Cube (Hybrid Memory Cube , HMC), ROM and RAM, etc.
  • RRAM Resistive Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • the input data required to run the deep neural network is stored in memory 601 .
  • the statistical quantizer 602 is used for quantizing the input data.
  • FIG. 7 shows a schematic diagram of the statistical quantizer 602 of this embodiment.
  • the statistical quantizer 602 includes a buffer element 701 , a statistical element 702 and a filtering element 703 .
  • the buffer element 701 is used for temporarily storing a plurality of input data from the memory 601 .
  • the input data here refers to the original data for training, such as weights, biases or other parameters used for training.
  • the input data here refers to the training result, that is, the updated weight, bias or other parameters, etc., to obtain the trained deep neural network model, and the trained deep neural network Used by the model for inference.
  • the buffer element 701 includes a plurality of buffer components, for the convenience of description, the first buffer component and the second buffer component are taken as an example.
  • a plurality of input data from the memory 601 is first temporarily stored in the first buffer component sequentially, and when the space of the first buffer component is filled, the buffer component 701 will switch, so that subsequent input data are temporarily stored in the second buffer component sequentially. Buffer components. While the input data is sequentially temporarily stored in the second buffer component, the filtering component 703 reads the temporarily stored input data from the first buffer component. When the space of the second buffer component is full, the buffer element 701 switches again, and the subsequent input data is temporarily stored in the first buffer component, overwriting the input data originally stored in the first buffer component.
  • each buffer component is 4KB.
  • the size of the buffer component in this embodiment is only an example, and the size can be planned according to actual conditions.
  • the statistical component 702 is used for generating statistical parameters according to a plurality of input data from the memory 601 .
  • This embodiment is quantified based on the statistical quantization method, which has been widely used in the deep neural network, which needs to calculate the statistical parameters according to the quantitative historical data. Several statistical quantification methods are introduced below.
  • the first statistical quantification method was disclosed in “N.Wang, J.Choi, D.Brand, C.Chen, and K.Gopalakrishnan, "Training deep neural networks with 8-bit floating point numbers,” in NeurIPS, 2018".
  • This method can quantize the input data into FP8 intermediate data, and the required statistical parameter is the maximum value (max
  • the second statistical quantification method is disclosed in "Y.Yang, S.Wu, L.Deng, T.Yan, Y.Xie, and G.Li," Training high-performance and large-scale deep neural networks with full 8- bit integers," Neural Networks, 2020".
  • This method can quantize the input data into INT8 intermediate data, and the required statistical parameter is the maximum value (max
  • the third statistical quantification method is disclosed in "X. Zhang, S. Liu, R. Zhang, C. Liu, D. Huang, S. Zhou, J. Guo, Y. Kang, Q. Guo, Z. Du et al ., "Fixed-point back-propagation training," in CVPR, 2020".
  • This method uses a dynamically selected data format to estimate the quantization error value between INT8 and INT16 as needed to cover different distributions, quantizes the input data into intermediate data of INT8 or INT16, and its required statistical parameters are input data x
  • the fourth statistical quantification method is disclosed in "K.Zhong, T.Zhao, X.Ning, S.Zeng, K.Guo, Y.Wang, and H.Yang, "Towards lower bit multiplication for convolutional neural network training," arXiv preprint arXiv:2006.02804,2020".
  • This method is a shiftable fixed-point data format that encodes two data with different fixed-point ranges and an additional bit, thereby covering the representable range and resolution, quantizing the input data into adjustable
  • the intermediate data of INT8, the required statistical parameter is the maximum value (max
  • the fifth statistical quantification method is disclosed in "Zhu, R. Gong, F. Yu, X. Liu, Y. Wang, Z. Li, X. Yang, and J. Yan," Towards unified int8 training for convolutional neural network, "arXiv preprint arXiv:1912.12607, 2019".
  • This method clips (clip) the long-tail data in multiple input data with minimal precision penalty, and then quantizes the input data into INT8 intermediate data.
  • the required statistical parameters are the input data x
  • the statistical element 702 can be a processor with basic computing capability or an ASIC logic circuit, which is used to generate the maximum value (max
  • the statistical quantification method needs to perform global statistics on all input data before quantization to obtain statistical parameters. If you want to make global statistics, you need to move all the input data. become a bottleneck in the training process.
  • the statistical element 702 of this embodiment is directly installed on the memory 601 side, not on the computing device 201 side, so that global statistics and quantification can be done locally in the memory, eliminating the need to transfer all input data from the memory 601 to the computing device 201
  • the program greatly relieves the pressure on hardware capacity and bandwidth.
  • the filtering element 703 is used to read the input data one by one from the buffer component of the buffer element 701 according to the statistical parameters to generate the output data, wherein the output data is the quantized result of the input data, that is, the quantized data.
  • the screening element 703 includes a plurality of quantization components 704 and an error multiplexing component 705 .
  • the quantization component 704 receives input data from the buffer component of the buffer component 701, and quantizes the input data (or called raw data) based on different quantization formats. Each quantization component 704 performs different quantization operations to obtain different intermediate data according to the statistical parameter max
  • Four quantization components 704 are shown in the figure, which means that the aforementioned various statistical quantization methods can be classified into four kinds of quantization operations, and each quantization component 704 performs one quantization operation.
  • the difference between these quantization operations lies in the difference in the amount of input data clipping, that is, each quantization format corresponds to a different amount of clipping of input data, for example, a certain quantization operation uses 95% of the data amount of all input data, and another Quantization operations use 60% of the data volume of all input data, etc., and these clipping volumes are determined by the aforementioned various statistical quantization methods.
  • the quantization component 704 needs to be adjusted accordingly.
  • the screening element 703 selects and executes corresponding single or multiple quantization components 704 to obtain quantized intermediate data.
  • the first statistical quantification method only needs to use one quantization component 704 to perform a quantization
  • the second statistical quantization method needs to use all the quantization components 704 to perform four kinds of quantization operations.
  • These quantization components 704 can execute their respective quantization format operations synchronously, or implement the quantization format operations of each quantization component 704 one by one in time division.
  • the error multiplexing component 705 is used to determine the corresponding error according to the intermediate data and the input data, select one of the multiple intermediate data as the output data, that is, determine the quantized data according to these errors.
  • the error multiplexing component 705 includes a plurality of error calculation units 706 , a selection unit 707 , a first multiplexing unit 708 and a second multiplexing unit 709 .
  • Multiple error calculation units 706 receive input data, intermediate data and statistical parameters, and calculate the error value between the input data and intermediate data.
  • each error calculation unit 706 corresponds to a quantization component 704, and the intermediate data generated by the quantization component 704
  • the data is output to the corresponding error calculation unit 706, and the error calculation unit 706 calculates the error value between the intermediate data generated by the corresponding quantization component 704 and the input data.
  • the gap of the input data, the gap is the same as the statistical parameter cos(x,x') from the statistical element 702 or compared to.
  • the error calculation unit 706 will also generate a label for recording the quantization format of the corresponding quantization component 704 , that is, recording the quantization format according to which the error value is generated.
  • the selection unit 707 receives all the error values of the error calculation unit 706 and compares them with the input data, selects the smallest of these error values, and generates a control signal corresponding to the intermediate data of the smallest error value.
  • the first multiplexing unit 708 is used to output the intermediate data with the minimum error value according to the control signal as the output data, in other words, the control signal controls the first multiplexing unit 708 to output the intermediate data with the smallest error among several quantization formats as the output data, also It is quantitative data.
  • the second multiplexing unit 709 is used for outputting the label of the intermediate data with the minimum error value according to the control signal, that is, recording the quantization format of the output data (quantization data).
  • Arrows in FIG. 6 represent data streams.
  • unquantized data is represented by solid line arrows
  • quantized data is represented by dotted line arrows, for example, data transmitted from memory 601 to statistical quantizer 602
  • the input data is the original unquantized data, so its data flow is indicated by solid arrows
  • the output data from the statistical quantizer 602 is quantized data, so its data flow is indicated by dotted arrows.
  • the data flow of the label is omitted in the figure.
  • the near data processing device 204 obtains the quantized data with the smallest error value as the output data and the quantized format in which the output data is recorded after being quantized, calculated and selected by the statistical quantizer 602. Label.
  • the computing device 201 of this embodiment includes DMA, a cache controller 604 and a cache array.
  • DMA is the external storage controller 301 , responsible for controlling the data transfer between the computing device 201 and the near data processing device 204 , for example, moving the output data and tags of the near data processing device 204 to the cache array of the computing device 201 .
  • the cache array includes NRAM 431 and WRAM 432.
  • FIG. 8 shows a schematic diagram of the cache controller 604 and the cache array 801 .
  • the cache controller 604 is used to temporarily store the output data and tags sent by the external storage controller 301 , and control the output data and tags to be stored in appropriate locations in the cache array 801 .
  • the cache array 801 may be an existing or customized storage space, which includes a plurality of cache elements, and these cache elements form an array in a physical structure, and each cache element can be represented by a row and a column of the array. Further, , the cache array 801 is controlled by the row selection element 802 and the column selection element 803.
  • the external memory controller 301 When it is necessary to access the cache element in the i-th row and the j-th column in the cache array 801, the external memory controller 301 will send a row selection signal and a column selection signal respectively.
  • the signal is sent to the row selection element 802 and the column selection element 803, and the row selection element 802 and the column selection element 803 enable the buffer array 801 according to the row selection signal and the column selection signal, so that the quantization element 807 can read the i-th row and the jth column
  • the data of the cache element in the cache array 801 is written to the cache element in the cache array 801 of the i-th row and the j-th column.
  • the quantization format of each quantized data is not necessarily the same, for the convenience of storage and management, the data in the same row in the cache array 801 can only be in the same quantization format, but different rows can store data in different quantization formats.
  • the cache controller 604 includes a tag cache 804 , a quantized data cache element 805 , a priority cache element 806 and a quantization element 807 .
  • the tag register 804 is used to store a row tag, and the row tag records the quantization format of the row in the buffer array 801 .
  • the same row of the buffer array 801 stores data in the same quantization format, but not necessarily stores data in the same quantization format between rows.
  • the tag buffer 804 is used to record the quantization format of each row.
  • the number of tag registers 804 is the same as the number of rows of the cache array 801, and each tag register 804 corresponds to a row of the cache array 801, that is, the i-th tag register 804 records the i-th row of the cache array 801 Quantization format.
  • the quantized data cache component 805 includes a data cache component 808 and a tag cache component 809 .
  • the data cache component 808 is used for temporarily storing the quantized data sent from the external storage controller 301
  • the tag cache component 809 is used for temporarily storing the tags sent from the external storage controller 301 .
  • external storage controller 301 When the quantized data is to be stored in the cache element of row i and column j of cache array 801, external storage controller 301 sends a priority tag to priority cache element 806, and the priority tag is used to indicate that this access should be based on a specific At the same time, the external memory controller 301 sends a row selection signal to the row selection element 802, and in response to the row selection signal, the row selection element 802 fetches the row label of the i-th row and sends it to the priority buffer element 806.
  • the priority cache component 806 judges that the priority label is consistent with the row label, it means that this access is processed in the quantization format of the i-th row, and the quantization component 807 ensures that the quantization format of the quantized data is consistent with the quantization format of the i-th row.
  • the priority label shall prevail, that is, this access shall be processed with the quantization format recorded by the priority label, and the quantization component 807 shall not only ensure that the quantization format of the quantified data is consistent with the one recorded by the priority label
  • the quantization format is consistent, and the quantization format of the data originally stored in the i-th row needs to be adjusted so that the quantization format of the quantization data of the entire row is the specific quantization format recorded in the priority label.
  • the priority cache component 806 determines whether the tag of the quantized data is the same as the priority tag. If it is the same, it means that the quantization format of the quantization data to be stored is consistent with the quantization format of the priority label, and the quantization data does not need to be adjusted. The priority cache component 806 further determines whether the row tag is the same as the priority tag. As in the same way, the quantized data stored in the i-th row does not need to be adjusted.
  • the row selection element 802 opens the channel of the i-th row of the cache array 801, and the quantization element 807 in the j-th column stores the quantized data in the i-th row and the j-th column. in the cache element.
  • the priority buffer component 806 controls all the quantization components 807 to convert the quantization format of the quantized data of each ith row into the quantization format of the priority label.
  • the row selection element 802 opens the channel of the i-th row of the buffer array 801, and the quantization element 807 stores the quantized data after the format adjustment into the i-th row of the buffer element.
  • the priority cache component 806 judges that the label of the quantized data is different from the priority label, then the format conversion of the quantized data is required, and the priority cache component 806 further determines whether the row label is the same as the priority label. As in the same way, the quantized data stored in the ith row does not need to be adjusted, only the quantized data from the external memory controller 301 needs to be format converted, and the priority buffer element 806 controls the quantized element 807 of the j-th column to perform the conversion from the external memory controller The quantized data of 301 is format-converted so that it becomes the quantized format of the priority label.
  • the row selection element 802 opens the channel of the ith row of the buffer array 801 , and the quantization element 807 of the jth column stores the converted quantized data into the buffer element of the ith row and jth column. If the priority cache component 806 judges that the row label is different from the priority label, the priority cache component 806 controls all the quantization components 807 to convert the quantization format of the quantized data of each ith row into the quantization format of the priority label.
  • the row selection element 802 opens the channel of the i-th row of the buffer array 801, and the quantization element 807 stores the quantized data after the format adjustment into the i-th row of the buffer element.
  • the cache array 801 includes M ⁇ N cache elements, that is There are M rows and N columns, assuming that the length of the quantized data is fixed at S bits, the length of each buffer element is also S bits, and the length of each row is equal to N ⁇ S. If the cache array 801 has N columns, there are N quantization elements 807 , and each column corresponds to a quantization element 807 .
  • the cache array includes 8092 ⁇ 32 cache elements, that is, there are 8092 rows (row 0 to row 8191 in the figure) and 32 columns, and there are 32 quantization elements 807 correspondingly. (quantization element 0 to quantization element 31 in the figure), and the length of the quantized data, the space of the quantization element 807, and the space of the buffer element are all set to 8 bits, and the length of each row is 32 ⁇ 8 bits.
  • the cache controller 604 can store the quantized data into the preset cache element of the NRAM 431 or WRAM 432, and ensure that the quantized format of the quantized data is consistent with the quantized format stored in a specific row of the NRAM 431 or WRAM 432.
  • the data stored in the cache array (NRAM 431 and/or WRAM 432) has been quantized, and when vector operations need to be performed, the quantized data stored in NRAM 431 is taken out and output to the vector in the operation module 42
  • the operation unit 421 performs vector operations.
  • the quantized data stored in the NRAM 431 and the weights stored in the WRAM 432 are taken out and output to the matrix operation unit 422 in the operation module 42 for matrix operations. Its calculation result will be stored back in the NRAM 431.
  • the calculation device 201 may include a calculation result cache element, and the calculation result generated by the operation module 42 is not stored back into the NRAM 431, but is stored in the calculation result cache element.
  • the calculation result is the output after prediction. Since the calculation result is non-quantified data, direct processing will take up too many resources, and further quantization is also required. Therefore, the calculation device 201 also includes a statistical quantizer 605, which is related to The statistical quantizer 602 has the same structure and is used to quantize the calculation result to obtain the quantized calculation result. The quantized calculation result is sent to the memory 601 via the external storage controller 301 for storage.
  • the calculation result is the gradient of the weight, and these gradients need to be sent back to the near data processing device 204 to update the parameters.
  • the gradient is also non-quantized data, the gradient cannot be quantized. Once quantized, the gradient information will be lost and cannot be used to update the parameters.
  • the external memory controller 301 fetches the gradient directly from the NRAM 431 and sends it to the near data processing device 204.
  • FIG. 9 shows a more detailed schematic diagram of the near data processing device 204 .
  • the memory 601 includes a plurality of memory particles 901 and a parameter buffer 902.
  • the plurality of memory particles 901 are storage units of the memory 601 for storing the parameters required for running the neural network.
  • To read and cache parameters when each device wants to access the memory 601 , it needs to move the data of the memory particles 901 through the parameter buffer 902 .
  • the parameters referred to here are values that can be continuously updated to optimize the neural network model when training the neural network, such as weights and biases.
  • the optimizer 603 is used to read the parameters from the parameter buffer 902 and update the parameters according to the training result (ie the aforementioned gradient) sent by the external memory controller 301 .
  • the near data processing device 204 also includes a constant register 903 , and the constant register 903 is used to store constants related to the neural network, such as hyperparameters, for the optimizer 603 to perform various operations based on these constants to update the parameters.
  • Hyperparameters are generally variables set based on the developer’s experience, and will not automatically update values along with training.
  • the learning rate, decay rate, number of iterations, the number of layers of the neural network, and the number of neurons in each layer are is a constant.
  • the optimizer 603 stores the updated parameters in the parameter register 902, and the parameter register 902 stores the updated parameters in the memory particle 901, so as to complete the update of the parameters.
  • the optimizer 603 may perform stochastic gradient descent (SGD).
  • SGD stochastic gradient descent
  • the stochastic gradient descent method uses the derivative in calculus to find the direction of function decline or the lowest point (extreme point) by finding the value of the function derivative according to the learning rate and gradient in the parameters and constants.
  • the weight value is continuously adjusted through the stochastic gradient descent method, so that the value of the loss function becomes smaller and smaller, that is, the prediction error becomes smaller and smaller.
  • the formula of the stochastic gradient descent method is as follows:
  • w t-1 is the weight value
  • is the learning rate in the constant
  • g is the gradient
  • w t is the updated weight value
  • the subscript t-1 refers to the current stage
  • the subscript t is the learning rate after a training The next stage, that is, after an update.
  • the optimizer 603 can also execute the AdaGrad algorithm according to the learning rate and gradient in the parameters and constants.
  • the idea of the AdaGrad algorithm is to adapt each parameter of the model independently, that is, the parameters with larger partial derivatives correspond to a larger learning rate, and the parameters with small partial derivatives correspond to a smaller learning rate.
  • the learning rate scales each parameter inversely by the square root of the sum of its historical gradient squared values. Its formula is as follows:
  • w t-1 and m t-1 are parameters
  • is the learning rate in the constant
  • g is the gradient
  • w t and m t are the updated parameters
  • the subscript t-1 refers to the current stage
  • the optimizer 603 can also execute the RMSProp algorithm according to the parameters, the learning rate in the constant, the decay rate in the constant, and the gradient.
  • the RMSProp algorithm uses exponential decay averaging to discard distant histories, enabling it to converge quickly after finding a "convex" structure.
  • the RMSProp algorithm also introduces a hyperparameter (decay rate) to control the decay rate. Its formula is as follows:
  • w t-1 and m t-1 are parameters
  • is the learning rate in the constant
  • is the decay rate in the constant
  • g is the gradient
  • w t and m t are the updated parameters
  • the subscript t-1 refers to is the current stage
  • the subscript t is the next stage after a training session, that is, after an update.
  • the optimizer 603 can also execute the Adam algorithm according to the parameters, the learning rate in the constant, the decay rate in the constant, and the gradient.
  • the Adam algorithm goes a step further on the basis of the RMSProp algorithm.
  • it In addition to adding the exponential decay average of the square of the historical gradient, it also retains the exponential decay average of the historical gradient. Its formula is as follows:
  • w t-1 , m t-1 and v t-1 are parameters
  • is the learning rate in the constant
  • ⁇ 1 and ⁇ 2 are the decay rate in the constant
  • g is the gradient
  • w t , m t and v t is the updated parameter
  • the subscript t-1 refers to the current stage
  • the subscript t refers to the next stage after one training, that is, after an update
  • the superscript t indicates that t times of training have been performed
  • ⁇ t means t power of ⁇
  • FIG. 10 shows a schematic diagram of the optimizer 603 .
  • the optimizer 603 utilizes simple addition circuits, subtraction circuits, multiplication circuits and multiplexers to implement the aforementioned various algorithms. After summarizing the aforementioned various algorithms, the optimizer 603 needs to implement the following operations:
  • any of the aforementioned algorithms can update parameters based on these operations, but the constants used by each algorithm are different.
  • the configuration of its constants is as follows:
  • the optimizer 603 updates the parameter 1003 to a parameter 1004 according to the gradient 1001 and the constant 1002 , and then stores the parameter 1004 into the parameter register 902 .
  • the parameters are taken out from the memory 601, quantized by the statistical quantizer 602, stored in the WRAM 432 through the control of the cache controller 604, and then forward-propagated and back-propagated by the computing module 42 to generate gradients , the gradient is sent to the optimizer 603, and the aforementioned algorithms are executed to update the parameters.
  • the parameters are adjusted, and the deep neural network model is mature and can be used for prediction.
  • neuron data such as image data
  • trained weights are taken out from the memory 601, quantized by the statistical quantizer 602, and stored in the NRAM 431 and WRAM 432 respectively under the control of the cache controller 604, and then stored in the NRAM 431 and WRAM 432 by
  • the calculation module 42 performs calculations, and the calculation results are quantified by the statistical quantizer 604, and the final quantized calculation results (ie prediction results) are stored in the memory 601 to complete the prediction task of the neural network model.
  • the above embodiments propose a brand-new hybrid architecture, which includes an acceleration device and a near-data processing device.
  • the hardware-friendly quantization technique hardware-friendly quantization technique, HQT
  • statistical analysis and quantification are performed on the memory side.
  • this embodiment realizes the quantization of dynamic statistics, reduces unnecessary data access, achieves the technical effect of high-precision parameter update, and makes the neural network model more accurate and lighter .
  • this embodiment introduces a near-data processing device, the data is quantized at the memory side, and errors caused by quantizing long-tail distribution data can also be directly suppressed.
  • FIG. 11 shows a flow chart of using the statistical quantizer in FIG. 7 to implement this method.
  • the original data is quantized based on different quantization formats to obtain corresponding intermediate data.
  • the quantization component 704 receives input data from the buffer component of the buffer element 701, quantizes the input data (or called raw data) based on different quantization formats, each quantization component 704 performs different quantization operations, and obtains different intermediate data.
  • the statistical parameter may be at least one of the maximum value of the absolute value of the original data, the cosine distance between the original data and the corresponding intermediate data, and the vector distance between the original data and the corresponding intermediate data.
  • step 1102 the error between the intermediate data and the original data is calculated.
  • Multiple error calculation units 706 receive input data, intermediate data and statistical parameters, and calculate the error value between the input data and intermediate data.
  • each error calculation unit 706 corresponds to a quantization component 704, and the intermediate data generated by the quantization component 704
  • the data is output to the corresponding error calculation unit 706, and the error calculation unit 706 calculates the error value between the intermediate data generated by the corresponding quantization component 704 and the input data.
  • the gap of the input data, the gap is the same as the statistical parameter cos(x,x') from the statistical element 702 or compared to.
  • the error calculation unit 706 will also generate a label for recording the quantization format of the corresponding quantization component 704 , that is, recording the quantization format according to which the error value is generated.
  • step 1103 the intermediate data of the error minima are identified.
  • the selection unit 707 receives all the error values of the error calculation unit 706 and compares them with the input data, identifies the smallest of these error values, and generates a control signal corresponding to the intermediate data with the smallest error value.
  • the intermediate data of the minimum error value is output as quantized data.
  • the first multiplexing unit 708 is used to output the intermediate data with the minimum error value according to the control signal as the output data, in other words, the control signal controls the first multiplexing unit 708 to output the intermediate data with the smallest error among several quantization formats as the output data, also It is quantitative data.
  • the second multiplexing unit 709 is used for outputting the label of the intermediate data with the minimum error value according to the control signal, that is, recording the quantization format of the output data (quantization data).
  • the electronic equipment or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present invention can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the solution of the present invention can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present invention expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present invention is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present invention, those skilled in the art can understand that some of the steps can be performed in other order or at the same time. Further, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different schemes, the description of some embodiments of the present invention also has different emphases. In view of this, those skilled in the art may understand the parts not described in detail in a certain embodiment of the present invention, and may also refer to relevant descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention.
  • multiple units in this embodiment of the present invention may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • a statistical quantizer that quantifies multiple raw data including:
  • a buffer element configured to temporarily store the plurality of raw data
  • a statistical element configured to generate statistical parameters according to the plurality of raw data
  • the quantization component is used to read the plurality of raw data from the buffer component one by one according to the statistical parameter, so as to generate quantized data.
  • A2 The statistical quantizer according to A1, wherein the buffer element includes a first buffer component and a second buffer component, and the plurality of raw data are sequentially temporarily stored in the first buffer component, when the first When the space of the buffer component is full, switching is temporarily stored in the second buffer component in sequence.
  • A3 The statistical quantizer according to A2, wherein when the plurality of raw data is sequentially temporarily stored in the second buffer component, the quantization element reads the plurality of raw data from the first buffer component data.
  • each quantization component quantizing the raw data based on a different quantization format, the plurality of quantization components generating a plurality of intermediate data;
  • the error multiplexing component is used for selecting one of the plurality of intermediate data as the quantized data according to the error between the plurality of intermediate data and the original data.
  • A6 The statistical quantizer according to A5, wherein the statistical parameter is the maximum value of the absolute value of the original data, the cosine distance between the original data and the corresponding intermediate data, the original data and the corresponding intermediate data The vector distance of at least one of them.
  • an error calculation unit configured to calculate an error between the plurality of intermediate data and the original data
  • a selection unit for generating a control signal, wherein the control signal corresponds to the intermediate data of the minimum value of the error
  • the multiplexing unit is configured to output the intermediate data of the minimum error value as the quantized data according to the control signal.
  • A9 The statistical quantizer according to A1, wherein the original data is neuron data or weights of a deep neural network.
  • a storage device comprising the statistical quantizer according to any one of A1 to A9.
  • a processing device comprising the statistical quantizer according to any one of A1 to A9.
  • a board comprising the storage device according to A10 and the processing device according to A11.
  • a cache controller connected to a direct memory access and cache array, wherein a row of the cache array stores data in the same quantization format, and the cache controller includes a quantized data cache element for temporarily storing the direct memory access The sent quantized data and a tag, the tag records the quantized format of the quantized data.
  • a tag-specific cache element configured to temporarily store the quantized data to be stored in a specific tag of a specific row of the cache array, and the specific tag records the quantization format of the specific row;
  • a quantization component is configured to determine whether the tag is the same as the specific tag, and if not, adjust the quantization format of the quantized data to the quantization format of the specific row.
  • the cache controller according to B1 further comprising a tag buffer for storing a row tag, and the row tag records a quantization format of a row of the cache array.
  • B8 An integrated circuit device comprising the cache controller according to any one of B1 to B7.
  • a board comprising the integrated circuit device according to B8.
  • a memory for optimizing parameters of a deep neural network comprising:
  • a parameter buffer for reading and buffering the parameters from the plurality of memory particles
  • an optimizer configured to read the parameters from the parameter buffer, and update the parameters according to the gradient
  • the image data is used to infer the deep neural network based on the updated parameters.
  • the memory according to C1 further comprising a constant register for storing constants, wherein the optimizer updates the parameters according to the constants.
  • C5 The memory of C4, wherein the optimizer performs stochastic gradient descent (SGD) based on the parameters, a learning rate in the constant, and the gradient to update the parameters.
  • SGD stochastic gradient descent
  • C7 The memory of C4, wherein the optimizer performs an RMSProp algorithm based on the parameters, the learning rate in the constant, the decay rate in the constant, and the gradient to update the parameter.
  • C8 The memory of C4, wherein the optimizer performs an Adam algorithm based on the parameters, the learning rate in the constant, the decay rate in the constant, and the gradient to update the parameter.
  • An integrated circuit device comprising the memory according to any one of C1 to C8.
  • a board comprising the integrated circuit device according to C9.
  • a component for quantifying raw data including:
  • a plurality of quantization components configured to quantize the raw data based on different quantization formats, to obtain corresponding intermediate data
  • An error multiplexing component is configured to determine a corresponding error according to the intermediate data and the original data, and determine quantized data from the intermediate data according to the error.
  • D2 The element of D1, wherein the plurality of quantization components quantizes the raw data according to a statistical parameter.
  • an error calculation unit configured to calculate the error
  • a selection unit for generating a control signal, wherein the control signal corresponds to the intermediate data of the minimum value of the error
  • the multiplexing unit is configured to output the intermediate data of the minimum error value as the quantized data according to the control signal.
  • D5. An integrated circuit device comprising the element according to any one of D1 to D4.
  • a board comprising the integrated circuit device according to D5.
  • a method of quantifying raw data comprising:
  • the intermediate data with the minimum error value is output as quantized data.
  • a computer-readable storage medium on which is stored computer program code for quantifying raw data.
  • the computer program code is run by a processing device, the method described in any one of D7 to D9 is executed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)

Abstract

本发明涉及优化深度神经网络的参数的设备,其中本发明的设备包括在集成电路装置中,该集成电路装置包括通用互联接口和其他处理装置。计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。集成电路装置还可以包括存储装置,存储装置分别与计算装置和其他处理装置连接,用于计算装置和其他处理装置的数据存储。

Description

优化深度神经网络的参数的处理系统、集成电路及板卡
相关申请的交叉引用
本公开要求于2021年6月8日申请的、申请号为202110639078.8、发明名称为“优化深度神经网络的参数的处理系统、集成电路及板卡”的中国专利申请的优先权。
本公开要求于2021年6月8日申请的、申请号为202110639079.2、披露名称为“将数据进行量化的设备、方法及计算机可读存储介质”的中国专利申请的优先权。
本公开要求于2021年6月8日申请的、申请号为202110637685.0、发明名称为“量化数据的统计量化器、存储装置、处理装置及板卡”的中国专利申请的优先权。
本公开要求于2021年6月8日申请的、申请号为202110639072.0、发明名称为“缓存控制器、集成电路装置及板卡”的中国专利申请的优先权。
本公开要求于2021年6月8日申请的、申请号为202110637698.8、发明名称为“优化深度神经网络的参数的内存、集成电路及板卡”的中国专利申请的优先权。
技术领域
本发明一般地涉及神经网络领域。更具体地,本发明涉及优化深度神经网络的参数的处理系统、集成电路及板卡。
背景技术
随着人工智能技术的普及与发展,深度神经网络模型趋向复杂,有些模型包括上百层的算子,使得运算量急速上升。
减少运算量有多种途经,其中一种便是量化。量化指的是把以高精度浮点数表示的权值和激活值转换用低精度的整数来近似表示,其优点包括低内存带宽、低功耗、低计算资源占用以及低模型存储需求等。
量化是目前常用简化数据量的方式,但量化操作尚缺硬件支持,对于现有的加速器来说,大都采用离线量化数据,故需要通用处理器来辅助处理,效率不佳。
因此,一种高能效的量化硬件是迫切需要的。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本发明的方案提供了一种优化深度神经网络的参数的处理系统、集成电路及板卡。
在一个方面中,本发明揭露一种优化深度神经网络的参数的处理系统,包括近数据处理装置及加速装置。近数据处理装置用以存储并量化运行于深度神经网络的原始数据,以产生量化数据。加速装置用以基于量化数据训练深度神经网络,以产生并量化训练结果。其中,近数据处理装置基于量化后的训练结果更新参数,图像数据基于更新后参数推理深度神经网络。
在另一个方面,本发明揭露一种集成电路装置,包括前述的元件,还揭露一种板卡,包括前述的集成电路装置。
本发明实现了在线动态统计的量化,减少不必要的数据访问,达到高精度参数更 新的技术功效,使得神经网络模型更精准且更轻量,且数据直接在内存端进行量化,抑制因量化长尾分布数据所导致的误差。
附图说明
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1是示出本发明实施例的板卡的结构图;
图2是示出本发明实施例的集成电路装置的结构图;
图3是示出本发明实施例的计算装置的内部结构示意图;
图4是示出本发明实施例的处理器核的内部结构示意图;
图5是示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图;
图6是示出本发明实施例与量化运算相关硬件的示意图;
图7是示出本发明实施例的统计量化器的示意图;
图8是示出本发明实施例的缓存控制器与缓存阵列的示意图;
图9是示出本发明实施例的近数据处理装置的示意图;
图10是示出本发明实施例的优化器的示意图;以及
图11是示出本发明另一个实施例将原始数据进行量化的方法的流程图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
应当理解,本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本发明的具体实施方式。
深度学习已被证明在包括图像分类、目标检测、自然语言处理等任务上效果很好。现今大量的应用程序都配备了图像(计算机视觉)相关的深度学习算法。
深度学习一般都利用神经网络模型来实现。随着模型预测越来越准确,网络越来越深,运行神经网络所需的内存容量与内存带宽相当大,使得设备为了变得智能而付 出高昂的代价。
实务上,开发者通过压缩、编码数据等方式来减小网络规模,量化是最广泛采用的压缩方法之一。所谓的量化指的是将高精度的浮点数数据(如FP32)转换成低精度的定点数数据(INT8),高精度的浮点数需要较多比特来描述,低精度的定点数则需要较少的比特便能完整描述,通过减少数据的比特数,便能有效地释放智能设备的负担。
图1示出本发明实施例的板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,可以利用量化优化的处理方式来支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储元件105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和近数据处理装置204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以 根据实际需要来确定。如前所述,仅就本发明的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
近数据处理装置204为有处理能力的内存,用以存储待处理的数据,内存的大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3示出了计算装置201的内部结构示意图。计算装置201用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,图中的计算装置201采用多核分层结构设计,计算装置201作为一个片上系统,其包括多个集群(cluster),每个集群又包括多个处理器核,换言之,计算装置201是以片上系统-集群-处理器核的层次所构成的。
以片上系统的层级来看,如图3所示,计算装置201包括外部存储控制器301、外设通信模块302、片上互联模块303、同步模块304以及多个集群305。
外部存储控制器301可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图2中的近数据处理装置204,从而自片外读取数据或是将数据写入。外设通信模块302用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块303将外部存储控制器301、外设通信模块302及多个集群305连接起来,用以在各个模块间传输数据和控制信号。同步模块304是一种全局同步屏障控制器(global barrier controller,GBC),用以协调各集群的工作进度,确保信息的同步。多个集群305是计算装置201的计算核心,在图中示例性地展示4个,随着硬件的发展,本发明的计算装置201还可以包括8个、16个、64个、甚至更多的集群305。集群305用以高效地执行深度学习算法。
以集群的层级来看,如图3所示,每个集群305包括多个处理器核(IPU core)306及一个存储核(MEM core)307。
处理器核306在图中示例性地展示4个,本发明不限制处理器核306的数量。其内部架构如图4所示。每个处理器核306包括三大模块:控制模块41、运算模块42及存储模块43。
控制模块41用以协调并控制运算模块42和存储模块43的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)411及指令译码单元(instruction decode unit,IDU)412。取指单元411用以获取来自处理装置203的指令,指令译码单元412则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块42和存储模块43。
运算模块42包括向量运算单元421及矩阵运算单元422。向量运算单元421用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元422负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块43用来存储或搬运相关数据,包括神经元缓存元件(neuron RAM,NRAM)431、权值缓存元件(weight RAM,WRAM)432、输入/输出直接内存访问模块(input/output direct memory access,IODMA)433、搬运直接内存访问模块(move direct memory access,MVDMA)434。NRAM 431用以存储供处理器核306计算的特征图及计算后的中间结果;WRAM 432则用以存储深度学习网络的权值;IODMA 433通过广播总线309控制NRAM 431/WRAM 432与近数据处理装置204的访存; MVDMA 434则用以控制NRAM 431/WRAM 432与SRAM 308的访存。
回到图3,存储核307主要用以存储和通信,即存储处理器核306间的共享数据或中间结果、以及执行集群305与近数据处理装置204之间的通信、集群305间彼此的通信、处理器核306间彼此的通信等。在其他实施例中,存储核307具有标量运算的能力,用以执行标量运算。
存储核307包括共享缓存元件(SRAM)308、广播总线309、集群直接内存访问模块(cluster direct memory access,CDMA)310及全局直接内存访问模块(global direct memory access,GDMA)311。SRAM 308承担高性能数据中转站的角色,在同一个集群305内不同处理器核306之间所复用的数据不需要通过处理器核306各自向近数据处理装置204获得,而是经SRAM 308在处理器核306间中转,存储核307只需要将复用的数据从SRAM 308迅速分发给多个处理器核306即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。
广播总线309、CDMA 310及GDMA 311则分别用来执行处理器核306间的通信、集群305间的通信和集群305与近数据处理装置204的数据传输。以下将分别说明。
广播总线309用以完成集群305内各处理器核306间的高速通信,此实施例的广播总线309支持核间通信方式包括单播、多播与广播。单播是指点对点(即单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 308传输到特定几个处理器核306的通信方式,而广播则是将一份数据从SRAM 308传输到所有处理器核306的通信方式,属于多播的一种特例。
CDMA 310用以控制在同一个计算装置201内不同集群305间的SRAM 308的访存。图5示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图,以说明CDMA 310的工作原理。在此应用场景中,同一个计算装置包括多个集群,为方便说明,图中仅展示集群0与集群1,集群0与集群1分别包括多个处理器核,同样为了说明方便,图中的集群0仅展示处理器核0,集群1仅展示处理器核1。处理器核0欲将数据写入至处理器核1。
首先,处理器核0发送单播写请求将数据写入本地的SRAM 0中,CDMA 0作为主(master)端,CDMA 1作为从(slave)端,主端向从端推送写请求,即主端发送写地址AW和写数据W,将数据传送到集群1的SRAM 1中,接着从端发送写响应B作为回应,最后集群1的处理器核1发送单播读请求将数据从SRAM 1中读取出来。
回到图3,GDMA 311与外部存储控制器301协同,用以控制集群305的SRAM 308到近数据处理装置204的访存,或是将数据自近数据处理装置204读取至SRAM 308中。从前述可知,近数据处理装置204与NRAM 431或WRAM 432间的通信可以经由2个渠道来实现。第一个渠道是通过IODAM 433直接联系近数据处理装置204与NRAM 431或WRAM 432;第二个渠道是先经由GDMA 311使得数据在近数据处理装置204与SRAM 308间传输,再经过MVDMA 434使得数据在SRAM 308与NRAM 431或WRAM 432间传输。虽然表面上看来第二个渠道需要更多的元件参与,数据流较长,但实际上在部分实施例中,第二个渠道的带宽远大于第一个渠道,因此近数据处理装置204与NRAM 431或WRAM 432间的通信通过第二个渠道可能更有效率。本发明的实施例可根据本身硬件条件选择数据传输渠道。
在其他实施例中,GDMA 311的功能和IODMA 433的功能可以整合在同一部件 中。本发明为了方便描述,将GDMA 311和IODMA 433视为不同部件,对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本发明类似,即属于本发明的保护范围。进一步地,GDMA 311的功能、IODMA 433的功能、CDMA 310的功能、MVDMA 434的功能亦可以由同一部件来实现。
为方便说明,将图1至图4中所显示的与量化运算相关硬件整合如图6所示。此处理系统可以在训练过程中优化深度神经网络的参数,其包括近数据处理装置204与计算装置201,其中近数据处理装置204用以存储并量化运行于深度神经网络的原始数据,以产生量化数据;计算装置201是一种加速装置,用以基于量化数据训练深度神经网络,以产生并量化训练结果。近数据处理装置204基于量化后的训练结果更新参数,各类数据基于更新后参数由计算装置201运行训练好的深度神经网络,以获得计算结果(预测结果)。
如前所述,近数据处理装置204不仅具有存储能力,更具有基本的运算能力,如图6所示,近数据处理装置204包括内存601、统计量化器(statistic quantization unit,SQU)602及优化器603。
内存601可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),例如是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。运行深度神经网络所需的输入数据存储在内存601中。
统计量化器602用以对输入数据进行量化处理,图7示出此实施例的统计量化器602的示意图,统计量化器602包括缓冲元件701、统计元件702及筛选元件703。
缓冲元件701用以暂存来自内存601的多个输入数据。当深度神经网络模型处于训练阶段时,此处的输入数据指的是供训练用的原始数据,例如用以训练的权值、偏置或其他参数等。当深度神经网络模型训练完毕后,此处的输入数据指的是训练结果,也就是更新后的权值、偏置或其他参数等,获得训练后的深度神经网络模型,训练后的深度神经网络模型进行推理时使用。
缓冲元件701包括多个缓冲组件,为方便说明,以第一缓冲组件和第二缓冲组件为例。来自内存601的多个输入数据首先依序暂存至第一缓冲组件,当第一缓冲组件的空间被填满时,缓冲元件701会进行切换,使得之后的输入数据依序暂存至第二缓冲组件。当输入数据依序暂存至第二缓冲组件的同时,筛选元件703自第一缓冲组件读取暂存其中的输入数据。当第二缓冲组件的空间被填满时,缓冲元件701再度进行切换,后续输入数据改暂存至第一缓冲组件,覆盖原存储在第一缓冲组件的输入数据。由于筛选元件703已经自第一缓冲组件读取原暂存其中的输入数据,此时覆盖原存储在第一缓冲组件的输入数据并不会造成数据访问错误。如此反复在第一缓冲组件与第二缓冲组件间同步交替写入与读取输入数据,此实施例可以加速数据的访问。具体地,在此实施例中,每个缓冲组件的大小为4KB。此实施例的缓冲组件的大小仅为示例,可以根据实际情况规划大小。
统计元件702用以根据来自内存601的多个输入数据生成统计参数。此实施例是 基于统计量化法进行量化的,统计量化法已经被广泛用于深度神经网络中,其需要根据量化历史数据计算出统计参数。以下介绍几种统计量化法。
第一种统计量化法揭露于《N.Wang,J.Choi,D.Brand,C.Chen,and K.Gopalakrishnan,“Training deep neural networks with 8-bit floating point numbers,”in NeurIPS,2018》。此方法可将输入数据量化成FP8的中间数据,其所需的统计参数为输入数据x的绝对值的最大值(max|x|)。
第二种统计量化法揭露于《Y.Yang,S.Wu,L.Deng,T.Yan,Y.Xie,and G.Li,“Training high-performance and large-scale deep neural networks with full 8-bit integers,”Neural Networks,2020》。此方法可将输入数据量化成INT8的中间数据,其所需的统计参数为输入数据x的绝对值的最大值(max|x|)。
第三种统计量化法揭露于《X.Zhang,S.Liu,R.Zhang,C.Liu,D.Huang,S.Zhou,J.Guo,Y.Kang,Q.Guo,Z.Du et al.,“Fixed-point back-propagation training,”in CVPR,2020》。此方法根据需要使用动态选择的数据格式来估计INT8和INT16之间的量化误差值,以涵盖不同的分布,将输入数据量化成INT8或INT16的中间数据,其所需的统计参数为输入数据x的绝对值的最大值(max|x|)以及输入数据x与相对应中间数据x’的平均值距离
Figure PCTCN2022097372-appb-000001
第四种统计量化法揭露于《K.Zhong,T.Zhao,X.Ning,S.Zeng,K.Guo,Y.Wang,and H.Yang,“Towards lower bit multiplication for convolutional neural network training,”arXiv preprint arXiv:2006.02804,2020》。此方法是一种可平移(shiftable)的定点数据格式,以不同的定点范围和一个附加位对两个数据进行编码,据此涵盖可表示的范围和分辨率,将输入数据量化成可调整的INT8的中间数据,其所需的统计参数为输入数据x的绝对值的最大值(max|x|)。
第五种统计量化法揭露于《Zhu,R.Gong,F.Yu,X.Liu,Y.Wang,Z.Li,X.Yang,and J.Yan,“Towards unified int8 training for convolutional neural network,”arXiv preprint arXiv:1912.12607,2019》。此方法以最少精确惩罚(minimal precision penalty)的方式来裁剪(clip)多个输入数据中的长尾数据,再将输入数据量化成INT8的中间数据,其所需的统计参数为输入数据x的绝对值的最大值(max|x|)以及输入数据x与相对应中间数据x’的余弦距离(cos(x,x’))。
为了至少实现前述文献所揭露的统计量化法,统计元件702可以是具有基本运算能力的处理器或是ASIC逻辑电路,用以生成输入数据x的绝对值的最大值(max|x|)、输入数据x与相对应中间数据x’的余弦距离(cos(x,x’))、输入数据x与相对应中间数据x’的平均值距离
Figure PCTCN2022097372-appb-000002
等统计参数。
如前所述,执行统计量化法需要在量化前将所有的输入数据做全局统计,以获得统计参数,欲做全局统计,则需要搬运所有的输入数据,此举极度耗费硬件资源,使得全局统计成为训练过程中的瓶颈。此实施例的统计元件702直接设置在内存601端,不设置在计算装置201端,如此便可以在内存本地做全局统计并量化,省去了将所有的输入数据自内存601搬运至计算装置201的程序,大大舒缓硬件的容量与带宽压力。
筛选元件703用以根据统计参数,自缓冲元件701的缓冲组件中逐一读取输入数据,以生成输出数据,其中输出数据即为输入数据量化后的结果,也就是量化数据。如图7所示,筛选元件703包括多个量化组件704及误差复用组件705。
量化组件704自缓冲元件701的缓冲组件接收输入数据,基于不同的量化格式来 量化输入数据(或称为原始数据),更详细来说,将前述各种统计量化法进行整理可以归类出几种量化运算,每个量化组件704执行不同的量化运算,根据统计参数max|x|以得到不同的中间数据,换言之,量化组件704的量化格式实现前述各种统计量化法。图中展示4个量化组件704,表示前述各种统计量化法可以归类成4种量化运算,每个量化组件704执行一种量化运算。在此实施例中,这些量化运算差异在于输入数据裁剪量的不同,也就是每个量化格式对应输入数据不同的裁剪量,例如某个量化运算采用所有输入数据的95%的数据量,另一个量化运算采用所有输入数据的60%的数据量等,这些裁剪量是由前述各种统计量化法所决定的。一旦统计量化法有所取舍,则量化组件704亦要做相应的调整。
基于不同的统计量化法,筛选元件703选择执行相应的单个或多个量化组件704,以获得量化后的中间数据,例如第1种统计量化法仅需要使用1个量化组件704来执行一种量化运算即可,而第2种统计量化法却需要使用全部量化组件704来执行4种量化运算。这些量化组件704可以同步执行各自的量化格式运算,也可以分时逐一实现每个量化组件704的量化格式运算。
误差复用组件705用以根据中间数据与输入数据确定对应的误差,选择多个中间数据其中之一作为输出数据,也就是根据这些误差确定量化数据。误差复用组件705包括多个误差计算单元706、选择单元707、第一复用单元708及第二复用单元709。
多个误差计算单元706接收输入数据、中间数据及统计参数,计算输入数据与中间数据的误差值,更详细来说,每个误差计算单元706对应一个量化组件704,量化组件704所生成的中间数据输出至相对应的误差计算单元706,误差计算单元706计算相对应的量化组件704所生成的中间数据与输入数据的误差值,这误差值表示了量化组件704所生成的量化数据与量化前的输入数据的差距,其差距与来自统计元件702的统计参数cos(x,x’)或是
Figure PCTCN2022097372-appb-000003
作比较。误差计算单元706除了产生误差值外,还会生成标签,其用以记录所对应的量化组件704的量化格式,也就是记录了该误差值是根据何种量化格式所生成的。
选择单元707接收所有误差计算单元706的误差值,并与输入数据作比较后,选择这些误差值中最小的,并生成对应至误差值最小值的中间数据的控制信号。
第一复用单元708用以根据控制信号输出误差值最小值的中间数据作为输出数据,换言之,控制信号控制第一复用单元708输出几种量化格式中误差最小的中间数据作为输出数据,也就是量化数据。
第二复用单元709用以根据控制信号输出误差值最小值的中间数据的标签,也就是记录了输出数据(量化数据)的量化格式。
图6中箭头表示数据流,为区分未量化数据与量化后数据的差异,未量化数据以实线箭头来表示,量化后数据以虚线箭头来表示,例如从内存601传输至统计量化器602的输入数据,其为原始未量化数据,因此其数据流以实线箭头表示之,而自统计量化器602输出的输出数据为量化后数据,因此其数据流以虚线箭头表示之。图中省去了标签的数据流。
综上所述,近数据处理装置204根据存储在内存601中的输入数据,经过统计量化器602量化计算和选择后,获得误差值最小的量化数据作为输出数据与记录了输出数据的量化格式的标签。
继续参阅图6,此实施例的计算装置201包括直接存储器访问、缓存控制器604与缓存阵列。直接存储器访问即是外部存储控制器301,负责控制计算装置201与近数据处理装置204间的数据搬移,例如将近数据处理装置204的输出数据及标签搬移至计算装置201的缓存阵列中。缓存阵列包括NRAM 431及WRAM 432。
图8示出缓存控制器604与缓存阵列801的示意图。缓存控制器604用以暂存外部存储控制器301发送的输出数据及标签,并控制输出数据及标签存储至缓存阵列801中的适当位置。缓存阵列801可以是现有或是定制化的存储空间,其中包括多个缓存元件,这些缓存元件在物理结构上形成一个阵列,每个缓存元件可以用阵列的行和列来表示,进一步来说,缓存阵列801是由行选择元件802与列选择元件803所控制,当需要访问缓存阵列801中第i行第j列的缓存元件时,外部存储控制器301会分别发送行选择信号与列选择信号至行选择元件802与列选择元件803,行选择元件802与列选择元件803根据行选择信号与列选择信号使能缓存阵列801,使量化元件807可以读取存储在第i行第j列的缓存阵列801中的缓存元件的数据或是将数据写入至第i行第j列的缓存阵列801中的缓存元件。在此实施例中,由于每个量化数据的量化格式不一定相同,为了方便存储与管理,缓存阵列801中同一行的数据只能是相同量化格式,但不同行可以存储不同量化格式的数据。
缓存控制器604包括标签缓存器804、量化数据缓存元件805、优先级缓存元件806及量化元件807。
标签缓存器804用以存储行标签,该行标签记录缓存阵列801的该行的量化格式。如前所述,缓存阵列801的同一行存储相同量化格式的数据,但各行间不一定存储相同量化格式的数据,标签缓存器804便是用来记录每一行的量化格式。具体来说,标签缓存器804的数量与缓存阵列801的行数相同,每一个标签缓存器804对应至缓存阵列801的一行,即第i个标签缓存器804记录缓存阵列801的第i行的量化格式。
量化数据缓存元件805包括数据缓存组件808及标签缓存组件809。数据缓存组件808用以暂存自外部存储控制器301发送来的量化数据,标签缓存组件809用以暂存自外部存储控制器301发送来的标签。当该量化数据欲存储至缓存阵列801的第i行第j列的缓存元件时,外部存储控制器301发送优先级标签至优先级缓存元件806,优先级标签用以表示这次访问应基于特定量化格式来处理,同时外部存储控制器301发送行选择信号至行选择元件802,响应该行选择信号,行选择元件802将第i行的行标签取出并发送至优先级缓存元件806。
如果优先级缓存元件806判断优先级标签与行标签一致,表示这次的访问以第i行的量化格式处理,量化元件807确保量化数据的量化格式与第i行的量化格式一致。
如果优先级标签与行标签不一致,以优先级标签为准,即这次的访问以优先级标签所记录的量化格式处理,量化元件807不仅要确保量化数据的量化格式与优先级标签所记录的量化格式一致,还需要调整原本存储在第i行的数据的量化格式,使得整行的量化数据的量化格式均为优先级标签所记录的特定量化格式。
更详细来说,优先级缓存元件806判断量化数据的标签与优先级标签是否相同。如相同,表示待存储的量化数据的量化格式与优先级标签的量化格式一致,量化数据无需进行调整。优先级缓存元件806进一步判断行标签与优先级标签是否相同。如相同,已存储在第i行的量化数据亦无需进行调整,行选择元件802打开缓存阵列801 第i行的通道,第j列的量化元件807将量化数据存储至第i行第j列的缓存元件中。如行标签与优先级标签不同,优先级缓存元件806控制所有的量化元件807,将每个第i行的量化数据的量化格式转换成优先级标签的量化格式。行选择元件802打开缓存阵列801第i行的通道,量化元件807将调整格式后的量化数据存储至第i行的缓存元件中。
如果优先级缓存元件806判断量化数据的标签与优先级标签不同,则量化数据需要进行格式转换,优先级缓存元件806进一步判断行标签与优先级标签是否相同。如相同,已存储在第i行的量化数据无需进行调整,仅来自外部存储控制器301的量化数据需要进行格式转换,优先级缓存元件806控制第j列的量化元件807对来自外部存储控制器301的量化数据进行格式转换,使其成为优先级标签的量化格式。行选择元件802打开缓存阵列801第i行的通道,第j列的量化元件807将转换后的量化数据存储至第i行第j列的缓存元件中。如优先级缓存元件806判断行标签与优先级标签不同,优先级缓存元件806控制所有的量化元件807,将每个第i行的量化数据的量化格式转换成优先级标签的量化格式。行选择元件802打开缓存阵列801第i行的通道,量化元件807将调整格式后的量化数据存储至第i行的缓存元件中。
在此实施例中,量化元件807有多个,其空间大小与数量搭配量化数据的长度和缓存阵列801的行的长度,更详细来说,缓存阵列801包括M×N个缓存元件,也就是具有M行及N列,假设量化数据的长度固定为S比特,则每个缓存元件的长度亦为S比特,且每行的长度等于N×S。对应缓存阵列801具有N列,量化元件807便配置有N个,每列对应一个量化元件807。具体来说,在此实施例中,缓存阵列包括8092×32个缓存元件,也就是具有8092行(图中的第0行至第8191行)及32列,量化元件807相对应亦有32个(图中的量化元件0至量化元件31),而量化数据的长度、量化元件807的空间、缓存元件的空间均设定为8比特,每行的长度为32×8比特。
至此,缓存控制器604得以将量化数据存储至预设的NRAM 431或WRAM 432的缓存元件中,并确保量化数据的量化格式与存储至NRAM 431或WRAM 432特定行的量化格式一致。
回到图6,存储在缓存阵列(NRAM 431及/或WRAM 432)的数据均已被量化,当需要执行向量运算时,存储在NRAM 431的量化数据被取出并输出至运算模块42中的向量运算单元421进行向量运算。当需要执行矩阵乘及卷积运算时,存储在NRAM 431的量化数据与存储在WRAM 432的权值被取出并输出至运算模块42中的矩阵运算单元422进行矩阵运算。其计算结果将被存储回NRAM 431中。在其他的实施例中,计算装置201可以包括计算结果缓存元件,运算模块42所产出的计算结果并不存储回NRAM 431,而是存储至计算结果缓存元件中。
在神经网络的推理阶段,计算结果是预测后的产出,由于计算结果是非量化的数据,直接处理会占用太多资源,同样需要进一步量化,因此计算装置201还包括统计量化器605,其与统计量化器602的结构相同,用以对计算结果进行量化,以获得量化后的计算结果。量化后的计算结果经由外部存储控制器301传送至内存601进行存储。
如果是在神经网络的训练阶段,计算结果是权值的梯度,这些梯度需要被传送回 近数据处理装置204以更新参数。虽然梯度同样是非量化的数据,但梯度不能被量化,一旦量化会导致梯度信息丧失,无法用来更新参数。在这种情况下,外部存储控制器301直接从NRAM 431取出梯度并发送至近数据处理装置204。
图9示出近数据处理装置204更为详细的示意图。内存601包括多个内存颗粒901及参数缓存器902,多个内存颗粒901是内存601的存储单元,用以存储运行神经网络所需的参数,参数缓存器902用以自多个内存颗粒902中读取并缓存参数,当各设备欲对内存601进行访问时,均需通过参数缓存器902对内存颗粒901的数据进行搬移。此处所指的参数是在训练神经网络时可以持续更新以优化神经网络模型的数值,例如权值与偏置。优化器603用以自参数缓存器902中读取参数,并根据外部存储控制器301发送来的训练结果(即前述的梯度)更新参数。
近数据处理装置204还包括常数缓存器903,常数缓存器903则用以存储与神经网络相关的常数,例如超参数,供优化器603根据这些常数执行多种运算来更新参数。超参数一般是基于开发者的经验所设定的变量,并不会随着训练而自动更新数值,学习率、衰减率、迭代次数、神经网络的层数、每层神经元的个数等都属于常数。优化器603将更新后参数存储至参数缓存器902中,参数缓存器902再将更新后参数存储至内存颗粒901中,以完成参数的更新。
优化器603可以执行随机梯度下降法(SGD)。随机梯度下降法根据参数、常数中的学习率及梯度,使用微积分里的导数,通过求出函数导数的值,从而找到函数下降的方向或者是最低点(极值点)。通过随机梯度下降法不断地调整权值,使得损失函数的值越来越小,也就是预测误差越来越小。随机梯度下降法的公式如下:
w t=w t-1-η×g
其中,w t-1为权值,η为常数中的学习率,g为梯度,w t为更新后权值,下标t-1指的是现阶段,下标t为经历一次训练后的下一个阶段,即指的是一次更新后。
优化器603还可以根据参数、常数中的学习率及梯度执行AdaGrad算法。AdaGrad算法的思想是独立地适应模型的每个参数,即具有较大偏导的参数相应有一个较大的学习率,具有小偏导的参数则对应一个较小的学习率,每个参数的学习率会缩放各参数反比于其历史梯度平方值总和的平方根。其公式如下:
m t=m t-1+g 2
Figure PCTCN2022097372-appb-000004
其中,w t-1及m t-1为参数,η为常数中的学习率,g为梯度,w t及m t为更新后参数,下标t-1指的是现阶段,下标t为经历一次训练后的下一个阶段,即指的是一次更新后。
优化器603还可以根据参数、常数中的学习率、常数中的衰减率及梯度执行RMSProp算法。RMSProp算法使用指数衰减平均以丢弃遥远的历史,使其能够在找到某个“凸”结构后快速收敛,此外,RMSProp算法还引入一个超参数(衰减率)来控制衰减速率。其公式如下:
m t=β×m t-1+(1-β)×g 2
Figure PCTCN2022097372-appb-000005
其中,w t-1及m t-1为参数,η为常数中的学习率,β为常数中的衰减率,g为梯度,w t及m t为更新后参数,下标t-1指的是现阶段,下标t为经历一次训练后的下一个阶段,即指的是一次更新后。
优化器603还可以根据参数、常数中的学习率、常数中的衰减率及梯度执行Adam算法。Adam算法在RMSProp算法的基础上更进一步,除了加入历史梯度平方的指数衰减平均外,还保留了历史梯度的指数衰减平均。其公式如下:
m t=β 1×m t-1+(1-β 1)×g
v t=β 2×v t-1+(1-β 2)×g 2
Figure PCTCN2022097372-appb-000006
Figure PCTCN2022097372-appb-000007
Figure PCTCN2022097372-appb-000008
其中,w t-1、m t-1及v t-1为参数,η为常数中的学习率,β 1及β 2为常数中的衰减率,g为梯度,w t、m t及v t为更新后参数,下标t-1指的是现阶段,下标t为经历一次训练后的下一个阶段,即指的是一次更新后,上标t则表示进行了t次训练,故β t表示β的t次方,
Figure PCTCN2022097372-appb-000009
Figure PCTCN2022097372-appb-000010
为动量m t和v t经过衰减后的动量。
图10示出优化器603的示意图。优化器603利用简单的加法电路、减法电路、乘法电路及复用器,以实现前述各种算法。总结前述各种算法后,优化器603需要实现以下的运算:
m t=c 1×m t-1+c 2×g
v t=c 3×v t-1+c 4×g 2
t 1=m t或g
Figure PCTCN2022097372-appb-000011
或1
w t=w t-1-c 5×t 1×t 2
即,前述任一种算法都可以根据这些运算来更新参数,但每种算法所搭配的常数不同,以Adam算法为例,其常数的配置如下:
c 1=β 1
c 2=1-β 1
c 3=β 2
c 4=1-β 2
Figure PCTCN2022097372-appb-000012
s 1=s 2=1
优化器603根据梯度1001以及常数1002,将参数1003更新成参数1004,再将参数1004存储至参数缓存器902中。
在每一次训练中,参数从内存601被取出,经统计量化器602量化,通过缓存控 制器604的控制存储到WRAM 432中,再被运算模块42进行正向传播与反向传播推导,产生梯度,梯度被传送至优化器603,执行前述各算法以更新参数。经过一代或数代训练后,参数被调试完成,至此该深度神经网络模型已经成熟可以用来进行预测。在推理阶段,神经元数据(例如图像数据)和训练好的权值从内存601被取出,经统计量化器602量化,通过缓存控制器604的控制分别存储到NRAM 431及WRAM 432中,再被运算模块42进行计算,计算结果由统计量化器604量化,最终量化后的计算结果(即预测结果)被存储至内存601中,以完成神经网络模型的预测任务。
上述实施例提出了一种全新的混合架构,其包括一个加速装置和一个近数据处理装置。基于硬件友好的量化技术(hardware-friendly quantization technique,HQT),在内存端进行统计分析和量化。由于统计量化器602与缓存控制器604的存在,此实施例实现了动态统计的量化,并减少不必要的数据访问,达到高精度参数更新的技术功效,使得神经网络模型更精准且更轻量。再者,由于此实施例引入近数据处理装置,数据在内存端便进行量化,还可以直接抑制因量化长尾分布数据所导致的误差。
本发明另一个实施例是一种将原始数据进行量化的方法,图11示出利用图7的统计量化器执行此方法的流程图。
在步骤1101中,基于不同的量化格式量化原始数据,以获得对应的中间数据。量化组件704自缓冲元件701的缓冲组件接收输入数据,基于不同的量化格式来量化输入数据(或称为原始数据),每个量化组件704执行不同的量化运算,根据统计参数以得到不同的中间数据。统计参数可以是原始数据的绝对值的最大值、原始数据与相对应中间数据的余弦距离、原始数据与相对应中间数据的向量距离至少其中之一。
在步骤1102中,计算中间数据与原始数据之间的误差。多个误差计算单元706接收输入数据、中间数据及统计参数,计算输入数据与中间数据的误差值,更详细来说,每个误差计算单元706对应一个量化组件704,量化组件704所生成的中间数据输出至相对应的误差计算单元706,误差计算单元706计算相对应的量化组件704所生成的中间数据与输入数据的误差值,这误差值表示了量化组件704所生成的量化数据与量化前的输入数据的差距,其差距与来自统计元件702的统计参数cos(x,x’)或是
Figure PCTCN2022097372-appb-000013
作比较。误差计算单元706除了产生误差值外,还会生成标签,其用以记录所对应的量化组件704的量化格式,也就是记录了该误差值是根据何种量化格式所生成的。
在步骤1103中,识别误差最小值的中间数据。选择单元707接收所有误差计算单元706的误差值,并与输入数据作比较后,识别出这些误差值中最小的,并生成对应至误差值最小值的中间数据的控制信号。
在步骤1104中,输出误差最小值的中间数据作为量化数据。第一复用单元708用以根据控制信号输出误差值最小值的中间数据作为输出数据,换言之,控制信号控制第一复用单元708输出几种量化格式中误差最小的中间数据作为输出数据,也就是量化数据。第二复用单元709用以根据控制信号输出误差值最小值的中间数据的标签,也就是记录了输出数据(量化数据)的量化格式。
本发明的另一个实施例是一种计算机可读存储介质,其上存储有将原始数据进行量化的计算机程序代码,当计算机程序代码由处理装置运行时,执行如图11所示的方法。根据不同的应用场景,本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、 PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本发明将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此,依据本发明的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本发明对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本发明某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本发明的公开和教导,本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本发明中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外,在一些场景中,本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理 器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。
A1.一种将多个原始数据进行量化的统计量化器,包括:
缓冲元件,用以暂存所述多个原始数据;
统计元件,用以根据所述多个原始数据生成统计参数;以及
量化元件,用以根据所述统计参数,自所述缓冲元件逐一读取所述多个原始数据,以生成量化数据。
A2.根据A1所述的统计量化器,其中所述缓冲元件包括第一缓冲组件和第二缓冲组件,所述多个原始数据依序暂存至所述第一缓冲组件,当所述第一缓冲组件的空间被填满时,切换依序暂存至所述第二缓冲组件。
A3.根据A2所述的统计量化器,其中当所述多个原始数据依序暂存至所述第二缓冲组件时,所述量化元件自所述第一缓冲组件读取所述多个原始数据。
A4.根据A1所述的统计量化器,其中所述量化元件包括:
多个量化组件,每个量化组件基于不同的量化格式量化所述原始数据,所述多个量化组件生成多个中间数据;以及
误差复用组件,用以根据所述多个中间数据与所述原始数据的误差以选择所述多个中间数据其中之一作为所述量化数据。
A5.根据A4所述的统计量化器,其中所述多个量化组件分时实现所述不同的量化格式。
A6.根据A5所述的统计量化器,其中所述统计参数为所述原始数据的绝对值的最大值、所述原始数据与相对应中间数据的余弦距离、所述原始数据与相对应中间数据的向量距离至少其中之一。
A7.根据A5所述的统计量化器,其中所述误差复用组件包括:
误差计算单元,用以计算所述多个中间数据与所述原始数据的误差;
选择单元,用以生成控制信号,其中所述控制信号对应至误差最小值的中间数据;以及
复用单元,用以根据所述控制信号输出误差最小值的中间数据作为所述量化数据。
A8.根据A1所述的统计量化器,其中所述量化元件还生成标签,用以记录所述量化数据的量化格式。
A9.根据A1所述的统计量化器,其中所述原始数据为深度神经网络的神经元数据或权值。
A10.一种存储装置,包括根据A1至A9任一项所述的统计量化器。
A11.一种处理装置,包括根据A1至A9任一项所述的统计量化器。
A12.一种板卡,包括根据A10所述的存储装置及根据A11所述的处理装置。
B1.一种缓存控制器,连接至直接存储器访问与缓存阵列,所述缓存阵列的一行 存储相同量化格式的数据,所述缓存控制器包括量化数据缓存元件,用以暂存所述直接存储器访问发送的量化数据及标签,所述标签记录所述量化数据的量化格式。
B2.根据B1所述的缓存控制器,还包括:
特定标签缓存元件,用以暂存所述量化数据欲存储至所述缓存阵列的特定行的特定标签,所述特定标签记录所述特定行的量化格式;以及
量化元件,用以判断所述标签与所述特定标签是否相同,如不同,调整所述量化数据的量化格式为所述特定行的量化格式。
B3.根据B2所述的缓存控制器,其中所述量化元件将调整后的量化数据存储至所述特定行。
B4.根据B2所述的缓存控制器,其中所述缓存阵列包括M×N个缓存元件,所述缓存元件的长度为S比特。
B5.根据B4所述的缓存控制器,其中所述缓存控制器包括N个量化元件。
B6.根据B1所述的缓存控制器,还包括标签缓存器,用以存储行标签,所述行标签记录所述缓存阵列的一行的量化格式。
B7.根据B1所述的缓存控制器,其中所述缓存阵列用以存储深度神经网络的神经元数据或权值。
B8.一种集成电路装置,包括根据B1至B7任一项所述的缓存控制器。
B9.一种板卡,包括根据B8所述的集成电路装置。
C1.一种优化深度神经网络的参数的内存,包括:
多个内存颗粒,用以存储所述参数;
参数缓存器,用以自所述多个内存颗粒中读取并缓存所述参数;以及
优化器,用以自所述参数缓存器中读取所述参数,并根据梯度更新所述参数;
其中,图像数据基于更新后参数推理所述深度神经网络。
C2.根据C1所述的内存,其中所述优化器将所述更新后参数存储至所述参数缓存器中,所述参数缓存器将所述更新后参数存储至所述多个内存颗粒中。
C3.根据C1所述的内存,其中所述梯度为训练所述深度神经网络而得。
C4.根据C1所述的内存,还包括常数缓存器,用以存储常数,其中所述优化器根据所述常数更新所述参数。
C5.根据C4所述的内存,其中所述优化器根据所述参数、所述常数中的学习率及所述梯度执行随机梯度下降法(SGD),以更新所述参数。
C6.根据C4所述的内存,其中所述优化器根据所述参数、所述常数中的学习率及所述梯度执行AdaGrad算法,以更新所述参数。
C7.根据C4所述的内存,其中所述优化器根据所述参数、所述常数中的学习率、所述常数中的衰减率及所述梯度执行RMSProp算法,以更新所述参数。
C8.根据C4所述的内存,其中所述优化器根据所述参数、所述常数中的学习率、所述常数中的衰减率及所述梯度执行Adam算法,以更新所述参数。
C9.一种集成电路装置,包括根据C1至C8任一项所述的内存。
C10.一种板卡,包括根据C9所述的集成电路装置。
D1.一种将原始数据进行量化的元件,包括:
多个量化组件,用于基于不同的量化格式量化所述原始数据,以获得对应的中间数据;以及
误差复用组件,用以根据所述中间数据与所述原始数据确定对应的误差,根据所述误差从所述中间数据中确定量化数据。
D2.根据D1所述的元件,其中所述多个量化组件根据统计参数量化所述原始数据。
D3.根据D2所述的元件,其中所述统计参数为所述原始数据的绝对值的最大值、所述原始数据与相对应中间数据的余弦距离、所述原始数据与相对应中间数据的向量距离至少其中之一。
D4.根据D1所述的元件,其中所述误差复用组件包括:
误差计算单元,用以计算所述误差;
选择单元,用以生成控制信号,其中所述控制信号对应至误差最小值的中间数据;以及
复用单元,用以根据所述控制信号输出误差最小值的中间数据作为所述量化数据。
D5.一种集成电路装置,包括根据D1至D4任一项所述的元件。
D6.一种板卡,包括根据D5所述的集成电路装置。
D7.一种将原始数据进行量化的方法,包括:
基于不同的量化格式量化所述原始数据,以获得对应的中间数据;
计算所述中间数据与所述原始数据之间的误差;
识别误差最小值的中间数据;以及
输出误差最小值的中间数据作为量化数据。
D8.根据D7所述的方法,其中所述量化步骤根据统计参数量化所述原始数据。
D9.根据D8所述的方法,其中所述统计参数为所述原始数据的绝对值的最大值、所述原始数据与相对应中间数据的余弦距离、所述原始数据与相对应中间数据的向量距离至少其中之一。
D10.一种计算机可读存储介质,其上存储有将原始数据进行量化的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行D7至D9任一项所述的方法。
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (25)

  1. 一种优化深度神经网络的参数的处理系统,包括:
    近数据处理装置,用以存储并量化运行于所述深度神经网络的原始数据,以产生量化数据;以及
    加速装置,用以基于所述量化数据训练所述深度神经网络,以产生并量化训练结果;
    其中,所述近数据处理装置基于量化后的训练结果更新所述参数,图像数据基于更新后参数推理所述深度神经网络。
  2. 根据权利要求1所述的处理系统,其中所述近数据处理装置及所述加速装置分别包括统计量化器,所述统计量化器包括:
    缓冲元件,用以暂存多个输入数据,其中所述多个输入数据为所述原始数据或所述训练结果;
    统计元件,用以根据所述多个输入数据生成统计参数;以及
    量化元件,用以根据所述统计参数,自所述缓冲元件逐一读取所述多个输入数据,以生成输出数据,其中所述输出数据为所述量化数据或量化后的训练结果。
  3. 根据权利要求2所述的处理系统,其中所述缓冲元件包括第一缓冲组件和第二缓冲组件,所述多个输入数据依序暂存至所述第一缓冲组件,当所述第一缓冲组件的空间被填满时,切换依序暂存至所述第二缓冲组件。
  4. 根据权利要求3所述的处理系统,其中当所述多个输入数据依序暂存至所述第二缓冲组件时,所述量化元件自所述第一缓冲组件读取所述多个输入数据。
  5. 根据权利要求2所述的处理系统,其中所述量化元件包括:
    多个量化组件,用于基于不同的量化格式量化所述原始数据,以获得对应的中间数据;以及
    误差复用组件,用以根据所述中间数据与所述原始数据确定对应的误差,根据所述误差从所述中间数据中确定量化数据。
  6. 根据权利要求5所述的处理系统,其中所述多个量化组件分时实现所述不同的量化格式。
  7. 根据权利要求5所述的处理系统,其中所述统计参数为所述输入数据的绝对值的最大值、所述输入数据与相对应中间数据的余弦距离、所述输入数据与相对应中间数据的向量距离至少其中之一。
  8. 根据权利要求5所述的处理系统,其中所述误差复用组件包括:
    误差计算单元,用以计算所述误差;
    选择单元,用以生成控制信号,其中所述控制信号对应至误差最小值的中间数据;以及
    复用单元,用以根据所述控制信号输出误差最小值的中间数据作为所述输出数据。
  9. 根据权利要求2所述的处理系统,其中所述量化元件还生成标签,用以记录所述输出数据的量化格式。
  10. 根据权利要求9所述的处理系统,其中所述加速装置包括:
    缓存阵列,所述缓存阵列的一行存储相同量化格式的数据;
    直接存储器访问,用以控制所述输出数据及所述标签存储至所述缓存阵列;以及
    缓存控制器,包括量化数据缓存元件,用以暂存所述直接存储器访问发送的所述输出数据及所述标签。
  11. 根据权利要求10所述的处理系统,其中所述缓存控制器还包括:
    特定标签缓存元件,用以暂存所述输出数据欲存储至所述缓存阵列的特定行的特定标签,所述特定标签记录所述特定行的量化格式;以及
    量化元件,用以判断所述标签与所述特定标签是否相同,如不同,调整所述输出数据的量化格式为所述特定行的量化格式。
  12. 根据权利要求11所述的处理系统,其中所述量化元件将调整后的输出数据存储至所述特定行。
  13. 根据权利要求11所述的处理系统,其中所述缓存阵列包括M×N个缓存元件,所述缓存元件的长度为S比特。
  14. 根据权利要求13所述的处理系统,其中所述缓存控制器包括N个量化元件。
  15. 根据权利要求10所述的处理系统,其中所述缓存控制器还包括标签缓存器,用以存储行标签,所述行标签记录所述缓存阵列的一行的量化格式。
  16. 根据权利要求1所述的处理系统,其中所述近数据处理装置包括:
    多个内存颗粒,用以存储所述参数;
    参数缓存器,用以自所述多个内存颗粒中读取并缓存所述参数;以及
    优化器,用以自所述参数缓存器中读取所述参数,并根据梯度更新所述参数。
  17. 根据权利要求16所述的处理系统,其中所述优化器将所述更新后参数存储至所述参数缓存器中,所述参数缓存器将所述更新后参数存储至所述多个内存颗粒中。
  18. 根据权利要求16所述的处理系统,其中所述训练结果包括所述梯度。
  19. 根据权利要求16所述的处理系统,所述近数据处理装置还包括常数缓存器,用以存储常数,其中所述优化器根据所述常数更新所述参数。
  20. 根据权利要求19所述的处理系统,其中所述优化器根据所述参数、所述常数中的学习率及所述梯度执行随机梯度下降法,以更新所述参数。
  21. 根据权利要求19所述的处理系统,其中所述优化器根据所述参数、所述常数中的学习率及所述梯度执行AdaGrad算法,以更新所述参数。
  22. 根据权利要求19所述的处理系统,其中所述优化器根据所述参数、所述常数中的学习率、所述常数中的衰减率及所述梯度执行RMSProp算法,以更新所述参数。
  23. 根据权利要求19所述的处理系统,其中所述优化器根据所述参数、所述常数中的学习率、所述常数中的衰减率及所述梯度执行Adam算法,以更新所述参数。
  24. 一种集成电路装置,包括根据权利要求1至23任一项所述的处理系统。
  25. 一种板卡,包括根据权利要求24所述的集成电路装置。
PCT/CN2022/097372 2021-06-08 2022-06-07 优化深度神经网络的参数的处理系统、集成电路及板卡 WO2022257920A1 (zh)

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
CN202110637685.0A CN113238987B (zh) 2021-06-08 2021-06-08 量化数据的统计量化器、存储装置、处理装置及板卡
CN202110639078.8 2021-06-08
CN202110637698.8 2021-06-08
CN202110639079.2A CN113238989A (zh) 2021-06-08 2021-06-08 将数据进行量化的设备、方法及计算机可读存储介质
CN202110639072.0A CN113238976B (zh) 2021-06-08 2021-06-08 缓存控制器、集成电路装置及板卡
CN202110639079.2 2021-06-08
CN202110637685.0 2021-06-08
CN202110639078.8A CN113238988B (zh) 2021-06-08 2021-06-08 优化深度神经网络的参数的处理系统、集成电路及板卡
CN202110639072.0 2021-06-08
CN202110637698.8A CN113238975A (zh) 2021-06-08 2021-06-08 优化深度神经网络的参数的内存、集成电路及板卡

Publications (1)

Publication Number Publication Date
WO2022257920A1 true WO2022257920A1 (zh) 2022-12-15

Family

ID=84425700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097372 WO2022257920A1 (zh) 2021-06-08 2022-06-07 优化深度神经网络的参数的处理系统、集成电路及板卡

Country Status (1)

Country Link
WO (1) WO2022257920A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754066A (zh) * 2017-11-02 2019-05-14 三星电子株式会社 用于生成定点型神经网络的方法和装置
CN112651485A (zh) * 2019-10-11 2021-04-13 三星电子株式会社 识别图像的方法和设备以及训练神经网络的方法和设备
US20210110260A1 (en) * 2018-05-14 2021-04-15 Sony Corporation Information processing device and information processing method
CN113238987A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 量化数据的统计量化器、存储装置、处理装置及板卡
CN113238988A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 优化深度神经网络的参数的处理系统、集成电路及板卡
CN113238989A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 将数据进行量化的设备、方法及计算机可读存储介质
CN113238975A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 优化深度神经网络的参数的内存、集成电路及板卡
CN113238976A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 缓存控制器、集成电路装置及板卡

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754066A (zh) * 2017-11-02 2019-05-14 三星电子株式会社 用于生成定点型神经网络的方法和装置
US20210110260A1 (en) * 2018-05-14 2021-04-15 Sony Corporation Information processing device and information processing method
CN112651485A (zh) * 2019-10-11 2021-04-13 三星电子株式会社 识别图像的方法和设备以及训练神经网络的方法和设备
CN113238987A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 量化数据的统计量化器、存储装置、处理装置及板卡
CN113238988A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 优化深度神经网络的参数的处理系统、集成电路及板卡
CN113238989A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 将数据进行量化的设备、方法及计算机可读存储介质
CN113238975A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 优化深度神经网络的参数的内存、集成电路及板卡
CN113238976A (zh) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 缓存控制器、集成电路装置及板卡

Similar Documents

Publication Publication Date Title
CN111652368B (zh) 一种数据处理方法及相关产品
KR102434729B1 (ko) 처리방법 및 장치
CN110298443B (zh) 神经网络运算装置及方法
CN109478144A (zh) 一种数据处理装置和方法
CN109726806A (zh) 信息处理方法及终端设备
CN113238989A (zh) 将数据进行量化的设备、方法及计算机可读存储介质
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
CN110383300A (zh) 一种计算装置及方法
CN114118347A (zh) 用于神经网络量化的细粒度每向量缩放
CN113238987B (zh) 量化数据的统计量化器、存储装置、处理装置及板卡
CN110909870B (zh) 训练装置及方法
WO2023020613A1 (zh) 一种模型蒸馏方法及相关设备
WO2018113790A1 (zh) 一种人工神经网络运算的装置及方法
CN111105023A (zh) 数据流重构方法及可重构数据流处理器
WO2022012233A1 (zh) 一种量化校准方法、计算装置和计算机可读存储介质
CN113238988B (zh) 优化深度神经网络的参数的处理系统、集成电路及板卡
CN113238976B (zh) 缓存控制器、集成电路装置及板卡
WO2022257920A1 (zh) 优化深度神经网络的参数的处理系统、集成电路及板卡
CN113238975A (zh) 优化深度神经网络的参数的内存、集成电路及板卡
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
WO2023040389A1 (zh) 转数方法、存储介质、装置及板卡
US11704562B1 (en) Architecture for virtual instructions
CN112766475B (zh) 处理部件及人工智能处理器
CN111198714B (zh) 重训练方法及相关产品
WO2023279946A1 (zh) 一种处理装置、设备、方法及其相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22819518

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE