CN113238988B

CN113238988B - Processing system, integrated circuit and board for optimizing parameters of deep neural network

Info

Publication number: CN113238988B
Application number: CN202110639078.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2023-05-30
Anticipated expiration: 2041-06-08
Also published as: CN113238988A

Abstract

The present invention relates to an apparatus for optimizing parameters of a deep neural network, wherein the apparatus of the present invention is comprised in an integrated circuit device comprising a universal interconnect interface and other processing means. The computing device interacts with other processing devices to collectively complete a user-specified computing operation. The integrated circuit device may further comprise a storage device coupled to the computing device and the other processing device, respectively, for data storage by the computing device and the other processing device.

Description

Processing system, integrated circuit and board for optimizing parameters of deep neural network

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to processing systems, integrated circuits, and boards that optimize parameters of deep neural networks.

Background

With the popularization and development of artificial intelligence technology, deep neural network models tend to be complex, and some models comprise operators of hundreds of layers, so that the operand is rapidly increased.

There are several ways to reduce the amount of computation, one of which is quantization. Quantization refers to the approximation of weight and activation value conversion expressed in high precision floating point numbers with low precision integers, with advantages including low memory bandwidth, low power consumption, low computational resource usage, and low model storage requirements.

Quantization is a common way to simplify the data size at present, but quantization operation is not supported by hardware, and for the existing accelerator, offline quantization data is mostly adopted, so that a general processor is required to assist in processing, and the efficiency is poor.

Therefore, an energy efficient quantization hardware is highly desirable.

Disclosure of Invention

In order to at least partially solve the technical problems mentioned in the background art, the scheme of the invention provides a processing system, an integrated circuit and a board card for optimizing parameters of a deep neural network.

In one aspect, the invention discloses a processing system for optimizing parameters of a deep neural network, which comprises a near data processing device and an accelerating device. The near data processing device is used for storing and quantizing the original data running on the deep neural network to generate quantized data. The acceleration device is used for training the deep neural network based on the quantized data so as to generate and quantize training results. The near data processing device updates parameters based on the quantized training results, and the image data infers the deep neural network based on the updated parameters.

In another aspect, the present invention discloses an integrated circuit device including the above-mentioned components, and also discloses a board card including the above-mentioned integrated circuit device.

The invention realizes the quantization of online dynamic statistics, reduces unnecessary data access, achieves the technical effect of high-precision parameter updating, ensures that the neural network model is more accurate and lighter, and the data is directly quantized at the memory end, thereby inhibiting errors caused by quantized long-tail distributed data.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. In the drawings, several embodiments of the invention are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

fig. 1 is a block diagram showing a board of an embodiment of the present invention;

fig. 2 is a block diagram showing an integrated circuit device of an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the internal architecture of a computing device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the internal architecture of a processor core according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating when one processor core wants to write data to another clustered processor core;

FIG. 6 is a schematic diagram showing hardware associated with quantization operations according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a statistical quantizer of an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating a cache controller and a cache array according to an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating a near data processing apparatus of an embodiment of the invention;

FIG. 10 is a schematic diagram illustrating an optimizer of an embodiment of the present invention; and

fig. 11 is a flowchart illustrating a method of quantizing raw data according to another embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification and drawings of the present invention are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present invention are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Specific embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Deep learning has proven to work well on tasks including image classification, object detection, natural language processing, and the like. A large number of applications are today equipped with image (computer vision) dependent deep learning algorithms.

Deep learning is typically implemented using neural network models. As model predictions become more accurate, networks become deeper and the memory capacity and memory bandwidth required to run neural networks are quite large, making the device expensive to get intelligent.

In practice, developers reduce the network size by compressing, encoding data, etc., and quantization is one of the most widely used compression methods. Quantization refers to converting high-precision floating point data (such as FP 32) into low-precision fixed point data (INT 8), wherein the high-precision floating point data needs more bits to describe, and the low-precision fixed point data needs fewer bits to fully describe, so that the burden of the intelligent device can be effectively relieved by reducing the bit number of the data.

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the invention. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, where the combined processing device is an artificial intelligent computing unit, and can support various deep learning and machine learning algorithms by using a quantization optimization processing manner, so as to meet intelligent processing requirements in complex scenarios in fields of computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory elements 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and near data processing means 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present invention may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The near data processing device 204 is a memory with processing capability for storing data to be processed, and the memory is typically 16G or more in size for storing data of the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 201 is configured to process input data such as computer vision, voice, natural language, and data mining, where the computing device 201 is configured as a multi-core hierarchical structure, and the computing device 201 is a system-on-chip (soc) including a plurality of clusters (clusters), each of which includes a plurality of processor cores, in other words, the computing device 201 is configured in a system-on-chip (soc) -cluster-processor core hierarchy.

At the system-on-chip level, as shown in FIG. 3, computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.

There may be a plurality of external memory controllers 301, of which 2 are shown by way of example, for accessing external memory devices, such as the near data processing apparatus 204 of fig. 2, to read data from or write data to the off-chip in response to an access request issued by the processor core. The peripheral communication module 302 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302, and the plurality of clusters 305 for transferring data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 being shown by way of example in the figure, and the computing device 201 of the present invention may also include 8, 16, 64, or even more clusters 305 as hardware progresses. The cluster 305 is used to efficiently execute the deep learning algorithm.

At the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU cores) 306 and a memory core (MEM core) 307.

The number of processor cores 306 is illustratively shown as 4, and the present invention is not limited to the number of processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an operation module 42 and a storage module 43.

The control module 41 is used for coordinating and controlling the operation of the operation module 42 and the storage module 43 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 411 and an instruction decode unit (instruction decode unit, IDU) 412. The instruction fetching unit 411 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 412 decodes the fetched instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 422 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 43 is used for storing or handling related data, including a neuron cache element (NRAM) 431, a weight cache element (WRAM) 432, an input/output direct memory access module (input/output direct memory access, IODMA) 433, and a handling direct memory access module (move direct memory access, MVDMA) 434.NRAM 431 is used to store the feature map for the processor core 306 to calculate and the intermediate result after calculation; WRAM 432 is configured to store weights for the deep learning network; the IODMA 433 controls access to NRAM 431/WRAM 432 and the near data processing device 204 via broadcast bus 309; MVDMA 434 is used to control access to NRAM 431/WRAM 432 and SRAM 308.

Returning to FIG. 3, the storage cores 307 are mainly used for storing and communicating, i.e., storing shared data or intermediate results between the processor cores 306, and performing communication between the clusters 305 and the near data processing apparatus 204, communication between the clusters 305, communication between the processor cores 306, etc. In other embodiments, the memory core 307 has scalar operation capabilities to perform scalar operations.

The memory core 307 includes a shared cache element (SRAM) 308, a broadcast bus 309, a clustered direct memory access module (cluster direct memory access, CDMA) 310, and a global direct memory access module (global direct memory access, GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 306 in the same cluster 305 is not required to be acquired by the processor cores 306 respectively close to the data processing device 204, but is transferred between the processor cores 306 through the SRAM 308, and the memory core 307 only needs to rapidly distribute the multiplexed data from the SRAM 308 to a plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip off-chip input/output access is greatly reduced.

Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform inter-processor core 306 communications, inter-cluster 305 communications, and data transmissions between cluster 305 and near data processing device 204, respectively. As will be described below, respectively.

The broadcast bus 309 is used to perform high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 308 to a specific number of processor cores 306, and broadcast is a communication scheme that transfers a piece of data from SRAM 308 to all processor cores 306, a special case of multicast.

CDMA 310 is used to control access to SRAM 308 between different clusters 305 within the same computing device 201. Fig. 5 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operation of CDMA 310. In this application scenario, the same computing device includes a plurality of clusters, for convenience of illustration, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include a plurality of processor cores, for convenience of illustration, also, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 is to write data to processor core 1.

Firstly, the processor core 0 sends a unicast write request to write data into the local SRAM 0, CDMA 0 is used as a master (master) end, CDMA 1 is used as a slave (slave) end, the master end pushes the write request to the slave end, that is, the master end sends a write address AW and write data W, the data is transmitted to the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.

Returning to FIG. 3, GDMA 311 cooperates with external memory controller 301 to control access of SRAM 308 of cluster 305 to near data processing device 204 or to read data from near data processing device 204 into SRAM 308. From the foregoing, it is appreciated that communication between near data processing device 204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly contact the data processing device 204 with either the NRAM 431 or WRAM 432 through the IODAM 433; the second channel is to transfer data between the near data processing device 204 and the SRAM 308 via the GDMA 311, and then transfer data between the SRAM 308 and the NRAM 431 or WRAM 432 via the MVDMA 434. While the second channel seemingly requires more elements to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel and thus communication between the near data processing device 204 and the NRAM 431 or WRAM 432 may be more efficient through the second channel. Embodiments of the present invention may select a data transmission channel based on its hardware conditions.

In other embodiments, the functionality of the GDMA 311 and the functionality of the IODMA 433 may be integrated in the same component. For convenience of description, the GDMA 311 and the IODMA 433 are regarded as different components, so long as the functions and technical effects achieved by the present invention are similar to those of the present invention, i.e., they belong to the protection scope of the present invention. Further, the functions of the GDMA 311, the IODMA 433, the CDMA 310, and the MVDMA 434 may be implemented by the same component.

For convenience of description, the hardware related to quantization operations shown in fig. 1 to 4 is integrated as shown in fig. 6. The processing system may optimize parameters of the deep neural network during training, which includes a near data processing device 204 and a computing device 201, wherein the near data processing device 204 is configured to store and quantize raw data running on the deep neural network to generate quantized data; the computing device 201 is an acceleration device for training the deep neural network based on the quantized data to generate and quantize training results. The near data processing device 204 updates the parameters based on the quantized training results, and various types of data are run by the computing device 201 on the trained deep neural network based on the updated parameters to obtain the calculation results (predicted results).

As described above, the near data processing apparatus 204 has not only the capability but also the basic operation capability, and as shown in fig. 6, the near data processing apparatus 204 includes a memory 601, a statistics quantizer (statistic quantization unit, squi) 602 and an optimizer 603.

The memory 601 may be any suitable storage medium (including magnetic or magneto-optical storage media, etc.), such as variable resistive memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cubes (Hybrid Memory Cube, HMC), ROM, RAM, etc. Input data required to operate the deep neural network is stored in memory 601.

The statistical quantizer 602 is configured to quantize the input data, and fig. 7 shows a schematic diagram of the statistical quantizer 602 in this embodiment, where the statistical quantizer 602 includes a buffering element 701, a statistical element 702, and a filtering element 703.

The buffer 701 is used for temporarily storing a plurality of input data from the memory 601. When the deep neural network model is in a training phase, the input data herein refers to raw data for training, such as weights, offsets, or other parameters used for training, and the like. After the deep neural network model is trained, input data refers to training results, namely updated weights, offsets or other parameters and the like, so that a trained deep neural network model is obtained, and the trained deep neural network model is used for reasoning.

The buffer element 701 includes a plurality of buffer components, for convenience of description, a first buffer component and a second buffer component are taken as examples. The plurality of input data from the memory 601 is first buffered in sequence to the first buffer element, and when the space of the first buffer element is filled, the buffer element 701 is switched, so that the input data is buffered in sequence to the second buffer element. While the input data is sequentially buffered in the second buffer device, the filtering device 703 reads the buffered input data from the first buffer device. When the space of the second buffer element is filled, the buffer element 701 is switched again, and the subsequent input data is temporarily stored in the first buffer element, so as to cover the input data originally stored in the first buffer element. Since the filtering element 703 has read the input data originally temporarily stored in the first buffer device, overwriting the input data originally stored in the first buffer device does not cause data access errors. The input data is written and read between the first buffer component and the second buffer component synchronously and alternately repeatedly in this way, and the embodiment can accelerate the access of the data. Specifically, in this embodiment, each buffer component is 4KB in size. The size of the buffer assembly of this embodiment is only an example, and the size may be planned according to the actual situation.

The statistics component 702 is configured to generate statistics parameters according to a plurality of input data from the memory 601. This embodiment is quantized based on statistical quantization, which has been widely used in deep neural networks, which requires calculation of statistical parameters from quantization history data. Several statistical quantization methods are described below.

The first statistical quantification method is disclosed in N.Wang, J.Choi, D.Brand, C.Chen, and K.Gopalakrishenan, "Training deep neural networks with-bit floating point numbers," in NeurIPS,2018 ". This method can quantize the input data into intermediate data of FP8, whose required statistical parameter is the maximum value of the absolute value of the input data x (max|x|).

A second statistical quantification method is disclosed in Y.Yang, S.Wu, L.Deng, T.Yan, Y.Xie, and G.Li, "Training high-performance and large-scale deep Neural Networks with full 8-bit integers," Neural Networks, 2020. This method can quantize the input data into intermediate data of INT8, whose required statistical parameter is the maximum value of the absolute value of the input data x (max|x|).

A third statistical quantification method is disclosed in X.Zhang, S.Liu, R.Zhang, C.Liu, D.Huang, S.Zhou, J.Guo, Y.Kang, Q.Guo, Z.Du et al, "Fixed-point back-propagation training," in CVPR 2020 ". The method uses dynamically selected data format to estimate the quantization error value between INT8 and INT16 as required to cover different distributions, quantizes the input data into INT8 or INT16 intermediate data, and the required statistical parameters are the maximum value of the absolute value of the input data x (max|x|) and the average distance of the input data x from the corresponding intermediate data x'

/>

A fourth statistical quantification method is disclosed in K.Zhong, T.Zhao, X.Ning, S.Zeng, K.Guo, Y.Wang, and H.Yang, "Towards lower bit multiplication for convolutional neural network training," arXiv preprint arXiv:2006.02804,2020 ". The method is a translatable fixed point data format, two data are encoded with different fixed point ranges and one additional bit, the representable range and resolution are covered accordingly, the input data are quantized into adjustable INT8 intermediate data, and the required statistical parameter is the maximum value (max|x|) of the absolute value of the input data x.

A fifth statistical quantification method is disclosed in Zhu, R.Gong, F.Yu, X.Liu, Y.Wang, Z.Li, X.Yang, and J.Yan, "Towards unified int8 training for convolutional neural network," arXiv preprint arXiv:1912.12607,2019 ". The method cuts (clips) long tail data in a plurality of input data in a least accurate punishment (minimal precision penalty) mode, and quantizes the input data into INT8 intermediate data, wherein the required statistical parameters are the maximum value (max|x|) of the absolute value of the input data x and the cosine distance (cos (x, x ')) between the input data x and corresponding intermediate data x'.

In order to implement at least the statistical quantization method disclosed in the foregoing document, the statistical element 702 may be a processor or ASIC logic circuit with basic operation capability for generating a maximum value (max|x|) of the absolute value of the input data x, a cosine distance (cos (x, x ')) between the input data x and the corresponding intermediate data x ', and an average distance between the input data x and the corresponding intermediate data x '

And the like.

As described above, performing the statistical quantization method requires global statistics of all input data before quantization to obtain statistical parameters, and to perform global statistics, all input data needs to be carried, which consumes extremely hardware resources, so that global statistics becomes a bottleneck in the training process. The statistics element 702 of this embodiment is directly disposed at the memory 601 and not disposed at the computing device 201, so that global statistics and quantization can be performed locally in the memory, thereby omitting the process of transferring all input data from the memory 601 to the computing device 201, and greatly relieving the capacity and bandwidth pressure of hardware.

The filtering element 703 is configured to read the input data from the buffer components of the buffering element 701 one by one according to the statistical parameter, so as to generate output data, wherein the output data is a quantized result of the input data, that is, quantized data. As shown in fig. 7, the filtering element 703 includes a plurality of quantization components 704 and an error multiplexing component 705.

The quantization component 704 receives the input data from the buffer component of the buffer component 701, quantizes the input data (or called raw data) based on different quantization formats, and more specifically, sorts out several quantization operations by sorting the various statistical quantization methods, each quantization component 704 performs different quantization operations to obtain different intermediate data according to the statistical parameter max|x|, in other words, the quantization formats of the quantization component 704 implement the various statistical quantization methods. There are 4 quantization components 704, which represent that the above-described various statistical quantization methods can be categorized into 4 quantization operations, and each quantization component 704 performs one quantization operation. In this embodiment, the quantization operations differ in the amount of clipping of the input data, i.e., each quantization format corresponds to a different amount of clipping of the input data, e.g., one quantization operation uses 95% of the amount of all input data, another quantization operation uses 60% of the amount of all input data, etc., which is determined by the various statistical quantization methods described above. Once the statistical quantization method has been chosen, the quantization component 704 also adjusts accordingly.

Based on different statistical quantization methods, the filtering element 703 selectively performs a corresponding quantization component or components 704 to obtain quantized intermediate data, e.g., 1 st statistical quantization method only needs to perform one quantization operation using 1 quantization component 704, while 2 nd statistical quantization method only needs to perform 4 quantization operations using all quantization components 704. The quantization components 704 may perform their respective quantization format operations synchronously, or the quantization format operations of each quantization component 704 may be implemented time-by-time.

The error multiplexing component 705 is configured to determine corresponding errors according to the intermediate data and the input data, and select one of the plurality of intermediate data as the output data, that is, determine quantized data according to the errors. The error multiplexing component 705 includes a plurality of error calculation units 706, a selection unit 707, a first multiplexing unit 708, and a second multiplexing unit 709.

The error calculation units 706 receive the input data, the intermediate data and the statistical parameters, calculate error values of the input data and the intermediate data, in more detail, each error calculation unit 706 corresponds to one quantization component 704, the intermediate data generated by the quantization component 704 is output to the corresponding error calculation unit 706, and the error calculation unit 706 calculates the error value of the corresponding quantization component 704The error value of the generated intermediate data and the input data, which represents the difference between the quantized data generated by the quantization component 704 and the input data before quantization, and the difference is compared with the statistical parameters cos (x, x') or from the statistical element 702

A comparison is made. In addition to generating the error value, the error calculation unit 706 generates a tag that is used to record the quantization format of the corresponding quantization component 704, i.e. record the quantization format according to which the error value was generated.

The selecting unit 707 receives the error values of all the error calculating units 706, compares the error values with the input data, selects the smallest of the error values, and generates a control signal corresponding to the intermediate data having the smallest error value.

The first multiplexing unit 708 is configured to output the intermediate data with the smallest error value as output data according to the control signal, in other words, the control signal controls the first multiplexing unit 708 to output the intermediate data with the smallest error in several quantization formats as output data, that is, the quantization data.

The second multiplexing unit 709 is configured to output a label of intermediate data of the minimum error value, that is, a quantization format in which output data (quantized data) is recorded, according to a control signal.

In fig. 6, arrows represent data flows, and in order to distinguish between unquantized data, which is represented by solid arrows, and quantized data, which is represented by broken arrows, for example, input data transmitted from the memory 601 to the statistical quantizer 602, which is original unquantized data, is represented by solid arrows, and output data output from the statistical quantizer 602 is quantized data, which is represented by broken arrows. The data flow of the tag is omitted from the figure.

In summary, the near data processing apparatus 204 performs quantization calculation and selection by the statistical quantizer 602 according to the input data stored in the memory 601, and then obtains the quantized data with the smallest error value as the output data and the tag with the quantized format in which the output data is recorded.

With continued reference to FIG. 6, the computing device 201 of this embodiment includes a direct memory access, a cache controller 604, and a cache array. The external memory controller 301 is responsible for controlling data movement between the computing device 201 and the near data processing device 204, such as moving output data and tags of the near data processing device 204 into a cache array of the computing device 201. The cache array includes NRAM 431 and WRAM 432.

Fig. 8 shows a schematic diagram of a cache controller 604 and a cache array 801. The cache controller 604 is used for temporarily storing the output data and the tag sent by the external storage controller 301, and controlling the output data and the tag to be stored in a proper position in the cache array 801. The buffer array 801 may be an existing or customized memory space, and includes a plurality of buffer elements physically forming an array, each buffer element may be represented by a row and a column of the array, and further, the buffer array 801 is controlled by a row selecting element 802 and a column selecting element 803, when the buffer elements in the ith row and the jth column of the buffer array 801 need to be accessed, the external memory controller 301 sends a row selecting signal and a column selecting signal to the row selecting element 802 and the column selecting element 803, respectively, and the row selecting element 802 and the column selecting element 803 enable the buffer array 801 according to the row selecting signal and the column selecting signal, so that the quantization element 807 can read the data stored in the buffer element 801 in the ith row and the jth column or write the data into the buffer element in the buffer array 801 in the ith row and the jth column. In this embodiment, since the quantization formats of each quantized data are not necessarily the same, for convenience of storage and management, the data in the same line in the buffer array 801 can only be in the same quantization format, but different lines can store data in different quantization formats.

The buffer controller 604 includes a tag buffer 804, a quantized data buffer element 805, a priority buffer element 806, and a quantization element 807.

The tag register 804 is used to store a row tag that records the quantized format of the row of the cache array 801. As described above, the same line of the buffer array 801 stores data with the same quantization format, but the tag buffer 804 is used to record the quantization format of each line. Specifically, the number of tag buffers 804 is the same as the number of lines of the cache array 801, and each tag buffer 804 corresponds to a line of the cache array 801, i.e., the ith tag buffer 804 records the quantization format of the ith line of the cache array 801.

The quantized data buffer element 805 includes a data buffer component 808 and a tag buffer component 809. The data buffering component 808 is configured to temporarily store the quantized data sent from the external memory controller 301, and the tag buffering component 809 is configured to temporarily store the tag sent from the external memory controller 301. When the quantized data is to be stored in the buffer element of the ith row and jth column of the buffer array 801, the external memory controller 301 sends a priority tag to the priority buffer element 806, the priority tag being used to indicate that the access should be processed based on the specific quantization format, and the external memory controller 301 sends a row selection signal to the row selection element 802, and in response to the row selection signal, the row selection element 802 fetches the row tag of the ith row and sends the row tag to the priority buffer element 806.

If the priority buffer element 806 determines that the priority tag matches the line tag, indicating that this access is processed in the quantization format of the i-th line, the quantization element 807 ensures that the quantization format of the quantized data matches the quantization format of the i-th line.

If the priority tag is inconsistent with the line tag, the priority tag is used, that is, the access is processed with the quantization format recorded by the priority tag, the quantization element 807 needs to ensure that the quantization format of the quantized data is consistent with the quantization format recorded by the priority tag, and also needs to adjust the quantization format of the data originally stored in the i-th line, so that the quantization format of the quantized data of the whole line is the specific quantization format recorded by the priority tag.

In more detail, the priority buffer element 806 determines whether the tag of the quantized data is identical to the priority tag. As the same, the quantization format representing the quantized data to be stored is identical to the quantization format of the priority tag, and the quantized data does not need to be adjusted. The priority cache element 806 further determines whether the row tag is the same as the priority tag. Similarly, the quantized data stored in the ith row is not required to be adjusted, and the row selecting element 802 opens the channel of the ith row of the buffer array 801, and the quantization element 807 in the jth column stores the quantized data in the buffer element in the jth column of the ith row. If the line tag is different from the priority tag, the priority buffer element 806 controls all the quantization elements 807 to convert the quantization format of the quantized data of each i-th line into the quantization format of the priority tag. The line selection element 802 opens a channel of the ith line of the buffer array 801, and the quantization element 807 stores the quantized data after the formatting into the buffer element of the ith line.

If the priority buffer element 806 determines that the quantized data has a different tag from the priority tag, the priority buffer element 806 further determines whether the line tag is the same as the priority tag. As in the case of the quantization data already stored in the i-th row, the priority buffer element 806 controls the j-th column quantization element 807 to perform the format conversion of the quantization data from the external storage controller 301 into the quantization format of the priority tag without adjustment, only the quantization data from the external storage controller 301 needs to be subjected to the format conversion. The row selecting element 802 opens the channel of the ith row of the buffer array 801, and the quantization element 807 of the jth column stores the converted quantization data in the buffer element of the jth column of the ith row. If the priority buffer element 806 determines that the line tag is different from the priority tag, the priority buffer element 806 controls all the quantization elements 807 to convert the quantization format of the quantized data of each ith line into the quantization format of the priority tag. The line selection element 802 opens a channel of the ith line of the buffer array 801, and the quantization element 807 stores the quantized data after the formatting into the buffer element of the ith line.

In this embodiment, the number and size of the quantization elements 807 match the length of the quantized data and the length of the lines of the buffer array 801, and in more detail, the buffer array 801 includes m×n buffer elements, that is, M lines and N columns, and if the length of the quantized data is fixed to S bits, the length of each buffer element is also S bits, and the length of each line is equal to n×s. The corresponding buffer array 801 has N columns, and the quantization elements 807 are configured by N, and each column corresponds to one quantization element 807. Specifically, in this embodiment, the buffer array includes 8092×32 buffer elements, that is, there are 8092 columns (from column 0 to column 8191 in the figure) and 32 rows, and there are 32 corresponding quantization elements 807 (from quantization element 0 to quantization element 31 in the figure), and the length of the quantized data, the space of the quantization element 807, and the space of the buffer elements are all set to 8 bits, and the length of each column is 32×8 bits.

The buffer controller 604 stores the quantized data into the buffer memory element of the NRAM 431 or WRAM 432, and ensures that the quantized data has a quantization format consistent with the quantization format stored in the specific line of the NRAM 431 or WRAM 432.

Returning to fig. 6, the data stored in the buffer array (NRAM 431 and/or WRAM 432) are quantized, and when vector operation is required, the quantized data stored in NRAM 431 is fetched and output to the vector operation unit 421 in the operation module 42 for vector operation. When matrix multiplication and convolution operations need to be performed, the quantized data stored in the NRAM 431 and the weights stored in the WRAM 432 are fetched and output to the matrix operation unit 422 in the operation module 42 for matrix operation. The calculation result will be stored back in NRAM 431. In other embodiments, the computing device 201 may include a computation result buffer element, and the computation result generated by the computing module 42 is not stored back into the NRAM 431, but is stored into the computation result buffer element.

In the reasoning stage of the neural network, the calculation result is a predicted output, and since the calculation result is unquantized data, the direct processing occupies too much resources and needs further quantization, the calculation device 201 further includes a statistics quantizer 605, which has the same structure as the statistics quantizer 602, for quantizing the calculation result to obtain a quantized calculation result. The quantized calculation result is transferred to the memory 601 for storage via the external storage controller 301.

If it is during the training phase of the neural network, the result of the calculations is gradients of weights that need to be transferred back to the near data processing device 204 to update the parameters. Although the gradient is also non-quantized data, the gradient cannot be quantized, which once it is quantized results in loss of gradient information and cannot be used to update parameters. In this case, the external memory controller 301 takes the gradient directly from the NRAM 431 and sends it to the near data processing device 204.

Fig. 9 shows a more detailed schematic diagram of the near data processing device 204. The memory 601 includes a plurality of memory granules 901 and a parameter register 902, the memory granules 901 are storage units of the memory 601, and are used for storing parameters required by the operation of the neural network, the parameter register 902 is used for reading and buffering the parameters from the memory granules 902, and when each device wants to access the memory 601, data of the memory granules 901 need to be moved through the parameter register 902. Parameters referred to herein are values, such as weights and biases, that can be continually updated as the neural network is trained to optimize the neural network model. The optimizer 603 is configured to read the parameters from the parameter buffer 902 and update the parameters according to the training result (i.e. the gradient) sent from the external memory controller 301.

The near data processing device 204 further comprises a constant buffer 903, wherein the constant buffer 903 is used for storing constants related to the neural network, such as super parameters, for the optimizer 603 to update parameters according to the constants. The super parameter is generally a variable set based on experience of a developer, and is not automatically updated with training, and the learning rate, the attenuation rate, the iteration number, the number of layers of the neural network, the number of neurons of each layer, and the like belong to constants. The optimizer 603 stores the updated parameters in the parameter register 902, and the parameter register 902 stores the updated parameters in the memory granule 901 to complete the updating of the parameters.

The optimizer 603 may perform a random gradient descent method (SGD). The random gradient descent method finds the direction of descent of a function or the lowest point (extreme point) by calculating the value of the derivative of the function using the derivative in the calculus according to the learning rate and gradient in the parameter and constant. The weight is continuously adjusted by a random gradient descent method, so that the value of the loss function is smaller and smaller, namely the prediction error is smaller and smaller. The formula for the random gradient descent method is as follows:

w _t ＝w _t-1 -η×g

wherein w is _t-1 As weight, η is learning rate in constant, g is gradient, w _t For updating the weights, the subscript t-1 refers to the current stage, and the subscript t refers to the next stage after undergoing a training, i.e., after one update.

The optimizer 603 may also perform an AdaGrad algorithm based on the parameters, learning rate in constants, and gradient. The idea of the AdaGrad algorithm is to adapt each parameter of the model independently, i.e. the parameter with the larger bias has a larger learning rate and the parameter with the smaller bias has a smaller learning rate, and the learning rate of each parameter scales the square root of the sum of the historical gradient squares of the parameters inversely proportional to each parameter. The formula is as follows:

m _t ＝m _t-1 +g ²

wherein w is _t-1 M _t-1 As a parameter, η is the learning rate in a constant, g is the gradient, w _t M _t For updated parameters, the subscript t-1 refers to the current stage, and the subscript t refers to the next stage after undergoing a training, i.e., after one update.

The optimizer 603 may also perform RMSProp algorithm based on parameters, learning rate in constants, decay rate in constants, and gradients. The RMSProp algorithm uses exponential decay averaging to discard distant histories, enabling rapid convergence after finding a certain "convex" structure, and introduces a super-parameter (decay rate) to control the decay rate. The formula is as follows:

m _t ＝β×m _t-1 +(1-β)×g ²

Wherein w is _t-1 M _t-1 Is a parameter, eta is a constantLearning rate, beta is the decay rate in constant, g is gradient, w _t M _t For updated parameters, the subscript t-1 refers to the current stage, and the subscript t refers to the next stage after undergoing a training, i.e., after one update.

The optimizer 603 may also perform Adam's algorithm based on parameters, learning rate in constants, decay rate in constants, and gradients. The Adam algorithm further maintains the exponential decay average of the historical gradient in addition to the exponential decay average of the square of the historical gradient based on the RMSProp algorithm. The formula is as follows:

m _t ＝β ₁ ×m _t-1 +(1-β ₁ )×g

υ _t ＝β ₂ ×υ _t-1 +(1-β ₂ )×g ²

wherein w is _t-1 、m _t-1 V _t-1 As a parameter, η is the learning rate in a constant, β ₁ Beta and beta ₂ The attenuation rate in a constant, g is gradient, w _t 、m _t V _t For updating the parameters, the subscript t-1 refers to the current stage, the subscript t refers to the next stage after undergoing one training, i.e. after one update, the superscript t indicates that t training has been performed, so β ^t Represents the power of the t of beta,

and->

Is the momentum m _t And v _t The momentum after attenuation.

Fig. 10 shows a schematic diagram of the optimizer 603. The optimizer 603 utilizes simple addition circuitry, subtraction circuitry, multiplication circuitry, and multiplexers to implement the various algorithms described above. After summarizing the various algorithms described above, the optimizer 603 needs to implement the following operations:

m _t ＝c ₁ ×m _t-1 +c ₂ ×g

v _t ＝c ₃ ×v _t-1 +c ₄ ×g ²

t ₁ ＝m _t Or g

w _t ＝W _t-1 -c ₅ ×t ₁ ×t ₂

That is, any of the foregoing algorithms can update parameters according to these operations, but the constants collocated by each algorithm are different, and for example, adam algorithm is used, the constants are configured as follows:

c ₁ ＝β ₁

c ₂ ＝1-β ₁

c ₃ ＝β ₂

c ₄ ＝1-β ₂

s ₁ ＝s ₂ ＝1

the optimizer 603 updates the parameters 1003 to parameters 1004 based on the gradient 1001 and the constant 1002, and stores the parameters 1004 in the parameter buffer 902.

In each training, the parameters are fetched from the memory 601, quantized by the statistical quantizer 602, stored in the WRAM 432 under the control of the buffer controller 604, and then forward-propagated and backward-propagated deductions are performed by the operation module 42 to generate gradients, which are transmitted to the optimizer 603, and the algorithms described above are performed to update the parameters. After one or more generations of training, the parameters are debugged and completed, so that the deep neural network model is mature and can be used for prediction. In the reasoning stage, the neuron data (e.g., image data) and trained weights are fetched from the memory 601, quantized by the statistical quantizer 602, stored in the NRAM 431 and WRAM 432 respectively under the control of the buffer controller 604, calculated by the operation module 42, quantized by the statistical quantizer 604, and finally quantized calculated results (i.e., predicted results) are stored in the memory 601 to complete the prediction task of the neural network model.

The above embodiments provide a completely new hybrid architecture, which includes an acceleration device and a near data processing device. Statistical analysis and quantization are performed at the memory end based on hardware-friendly quantization technology (hardware-friendly quantization technique, HQT). Due to the existence of the statistics quantizer 602 and the cache controller 604, this embodiment achieves the quantization of dynamic statistics, reduces unnecessary data access, and achieves the technical efficacy of high-precision parameter updating, so that the neural network model is more accurate and lighter. Furthermore, since the embodiment introduces the near data processing device, the data is quantized at the memory end, and errors caused by quantization of long tail distribution data can be directly suppressed.

Another embodiment of the present invention is a method of quantizing raw data, and fig. 11 shows a flowchart of performing the method using the statistical quantizer of fig. 7.

In step 1101, the raw data is quantized based on different quantization formats to obtain corresponding intermediate data. The quantization components 704 receive input data from the buffer components of the buffer element 701, quantize the input data (or referred to as raw data) based on different quantization formats, each quantization component 704 performs different quantization operations to obtain different intermediate data according to statistical parameters. The statistical parameter may be at least one of a maximum value of an absolute value of the original data, a cosine distance of the original data and the corresponding intermediate data, and a vector distance of the original data and the corresponding intermediate data.

In step 1102, an error between the intermediate data and the original data is calculated. The error calculating units 706 receive the input data, the intermediate data and the statistical parameters, calculate error values of the input data and the intermediate data, and more specifically, each error calculating unit 706 corresponds to one quantization component 704, the intermediate data generated by the quantization component 704 is output to the corresponding error calculating unit 706, and the error calculating unit 706 calculates error values of the intermediate data generated by the corresponding quantization component 704 and the input data, which represent differences between the quantized data generated by the quantization component 704 and the input data before quantization, and differences between the quantized data and the statistical parameters cos (x, x ') or the statistical parameters cos (x, x') from the statistical component 702

In step 1103, intermediate data of the error minimum is identified. The selecting unit 707 receives the error values of all the error calculating units 706, and after comparing with the input data, recognizes the smallest of the error values, and generates a control signal corresponding to the intermediate data of the smallest error value.

In step 1104, intermediate data of the error minimum value is output as quantized data. The first multiplexing unit 708 is configured to output the intermediate data with the smallest error value as output data according to the control signal, in other words, the control signal controls the first multiplexing unit 708 to output the intermediate data with the smallest error in several quantization formats as output data, that is, the quantization data. The second multiplexing unit 709 is configured to output a label of intermediate data of the minimum error value, that is, a quantization format in which output data (quantized data) is recorded, according to a control signal.

Another embodiment of the invention is a computer readable storage medium having stored thereon computer program code for quantizing raw data, which when executed by a processing device, performs a method as shown in fig. 11. According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present invention may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, etc. Furthermore, the electronic equipment or the electronic device can be used in cloud end, edge end, terminal and other application scenes related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the high-power electronic device or apparatus according to the present invention may be applied to a cloud device (e.g., a cloud server), and the low-power electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of simplicity, the present invention represents some methods and embodiments thereof as a series of acts and combinations thereof, but it will be understood by those skilled in the art that the aspects of the present invention are not limited by the order of acts described. Thus, those skilled in the art will appreciate, in light of the present disclosure or teachings, that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described herein may be considered as alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or all aspects of the present invention. In addition, the description of some embodiments of the present invention is also focused on according to the different schemes. In view of this, those skilled in the art will appreciate that portions of one embodiment of the invention that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, those skilled in the art will appreciate that several embodiments of the present disclosure may be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separated, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purposes of the solution according to the embodiments of the present invention. In addition, in some scenarios, multiple units in embodiments of the invention may be integrated into one unit or each unit may physically reside separately.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like.

The foregoing has outlined rather broadly the more detailed description of embodiments of the invention, wherein the principles and embodiments of the invention are explained in detail using specific examples, the above examples being provided solely to facilitate the understanding of the method and core concepts of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A processing system for optimizing parameters of a deep neural network, comprising:

the near data processing device is used for storing and quantizing the original data running on the deep neural network so as to generate quantized data; and

acceleration means for training the deep neural network based on the quantized data to generate and quantize a training result;

the near data processing device updates the parameters based on quantized training results, the image data infers the deep neural network based on the updated parameters, and the near data processing device and the accelerating device respectively comprise a statistical quantizer; the statistical quantizer includes: the buffer element is used for temporarily storing a plurality of input data, wherein the plurality of input data are the original data or the training result; a statistical element for generating a statistical parameter from the plurality of input data; the screening element is used for reading the plurality of input data one by one from the buffer element according to the statistical parameter so as to generate output data, wherein the output data is the quantized data or a quantized training result; the filtering element is further configured to generate a tag for recording a quantization format of the output data, where the tag includes a line tag for recording a quantization format of a line of the cache array;

Wherein the acceleration device comprises: a cache array, a row of which stores data in the same quantization format; direct memory access for controlling the output data and the tag to be stored in the cache array; the buffer controller comprises a quantized data buffer element for temporarily storing the output data and the tag sent by the direct memory access; wherein the cache controller further comprises: the priority cache element is used for temporarily storing the priority label of the output data to be stored in the cache array, judging whether the line label is the same as the priority label or not, wherein the priority label is used for indicating that the access should be processed based on a specific quantization format; and a quantization element for adjusting the quantization format of the output data to the quantization format recorded by the priority label in response to the priority label being different from the line label.

2. The processing system of claim 1, wherein the buffer element comprises a first buffer element and a second buffer element, the plurality of input data is buffered sequentially to the first buffer element, and when the space of the first buffer element is filled, the buffer element is switched to be buffered sequentially to the second buffer element.

3. The processing system of claim 2, wherein the filtering element reads the plurality of input data from the first buffer component as the plurality of input data is sequentially buffered to the second buffer component.

4. The processing system of claim 1, wherein the screening element comprises:

a plurality of quantization components for quantizing the raw data based on different quantization formats to obtain corresponding intermediate data; and

and the error multiplexing component is used for determining a corresponding error according to the intermediate data and the original data and determining quantized data from the intermediate data according to the error.

5. The processing system of claim 4, wherein the plurality of quantization components time-share the different quantization formats.

6. The processing system of claim 4, wherein the statistical parameter is at least one of a maximum value of an absolute value of the input data, a cosine distance of the input data from corresponding intermediate data, and a vector distance of the input data from corresponding intermediate data.

7. The processing system of claim 4, wherein the error multiplexing component comprises:

an error calculation unit configured to calculate the error;

A selection unit for generating a control signal, wherein the control signal corresponds to intermediate data of a minimum error value; and

and the multiplexing unit is used for outputting intermediate data with the minimum error value as the output data according to the control signal.

8. The processing system of claim 1, wherein the quantization element stores the adjusted output data to a particular row.

9. The processing system of claim 1, wherein the cache array comprises M x N cache elements, the cache elements being S bits in length.

10. The processing system of claim 9, wherein the buffer controller comprises N quantization elements.

11. The processing system of claim 1, wherein the cache controller further comprises a tag cache to store the row tags.

12. The processing system of claim 1, wherein the near data processing device comprises a plurality of memory granules, a parameter buffer, and an optimizer, the plurality of memory granules to store the parameters; the parameter register is used for reading and caching the parameters from the memory particles; the optimizer is used for reading the parameters from the parameter buffer and updating the parameters according to the gradient; the optimizer stores the updated parameters in the parameter buffer, and the parameter buffer stores the updated parameters in the memory granules.

13. The processing system of claim 12, wherein the training results comprise the gradient.

14. The processing system of claim 13, the near data processing device further comprising a constant buffer to store a constant, wherein the optimizer updates the parameters according to the constant.

15. The processing system of claim 14, wherein the optimizer performs a random gradient descent method according to the parameters, learning rates in the constants, and the gradients to update the parameters.

16. The processing system of claim 14, wherein the optimizer performs an AdaGrad algorithm based on the parameters, learning rates in the constants, and the gradients to update the parameters.

17. The processing system of claim 14, wherein the optimizer performs RMSProp algorithm based on the parameters, a learning rate in the constant, a decay rate in the constant, and the gradient to update the parameters.

18. The processing system of claim 14, wherein the optimizer performs Adam algorithm to update the parameters based on the parameters, learning rate in the constant, decay rate in the constant, and the gradient.

19. An integrated circuit device comprising a processing system according to any of claims 1 to 18.

20. A board card comprising the integrated circuit device of claim 19.