CN117519802B

CN117519802B - Data processing device based on integrated memory and calculation unit

Info

Publication number: CN117519802B
Application number: CN202410025725.XA
Authority: CN
Inventors: 顾子熙; 时拓; 陈美文; 刘津畅; 卢建; 唐双柱
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-04-30
Anticipated expiration: 2044-01-08
Also published as: CN117519802A

Abstract

The application provides a data processing device based on an integrated storage and calculation unit, which comprises a control unit, a data memory, a preloading unit, an input unit, an integrated storage and calculation unit and an output unit. The control unit is responsible for controlling the global and local modules of the device; the data storage is used for storing initial data and result data; the preloading unit is used for inputting the preloading of the calculation data; the input unit is connected with the data memory, the preloading unit and the memory and calculation integrated processing unit and is used for loading and outputting data in the processes of reading, writing and calculating; the integrated storage and calculation processing unit is composed of a plurality of integrated storage and calculation units and is connected with the output unit to output data. The processing device of the built-in memory calculation integrated unit integrates the structural units required for constructing the complete calculation data stream and the control stream, so that matrix calculation of different calculation structures is realized, the data stream is optimized in a convolution calculation mode, and the memory consumption is reduced.

Description

Data processing device based on integrated memory and calculation unit

Technical Field

The present application relates to the field of data computation, and in particular, to a data processing apparatus based on a memory integrated unit.

Background

Computers currently in mainstream use remain based on the storage and computation separation architecture of the "von neumann" architecture. Since a mode of separation of storage and calculation is adopted, every time calculation is performed, 2 values calculated mutually are taken out from the memory and put into the calculation unit for calculation. When a large number of matrix calculation processes are performed, data in 2 matrices need to be continuously carried out from the memory, and the power consumption for carrying out data in the whole process is very large. Further, in the calculation process of the neural network, not only the handling of a large amount of matrix data is involved, but also the problem that a large amount of data needs to be multiplexed is involved, and if the data is still handled from a memory when calculation is required, the problems in the aspects of calculation efficiency, energy consumption and the like are not negligible.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data processing apparatus based on a memory integrated unit.

In a first aspect, the present application provides a data processing apparatus based on a memory integrated unit, including:

a preload unit for performing a preload operation of data;

the input unit is connected with the preloading unit and is used for executing the input operation of the data;

The integrated memory and calculation processing unit is connected with the input unit and is used for executing calculation operation on the input data to obtain an output result;

the output unit is used for outputting the output result with the integrated storage processing unit;

the data storage is connected with the preloading unit and the output unit and is used for providing data for the preloading unit and storing the output result;

And the control unit is connected with the preloading unit, the input unit, the integrated storage processing unit, the output unit and the data memory and is used for controlling the processing process of the data in a mode of sending different control signals through parameter configuration stored in a configuration register.

In one embodiment thereof, the storage and calculation integrated unit comprises:

the integrated memory control unit is used for converting the input data and feeding back a data processing result to the control unit;

the integrated memory module is used for storing the converted result;

And the shift adder-adder is used for shifting the data input in at least two periods and carrying out accumulation processing to obtain a calculation result.

In one embodiment, the memory module includes:

the power gating unit is used for selecting power categories in different operation modes and outputting power;

The row decoding unit is used for generating row strobe signals of the integrated memory array by combining the number of the row channels;

The column decoding unit is used for generating column selection communication signals of the integrated memory array by combining the number of column channels;

The storage and calculation integrated array is formed by the storage and calculation integrated devices through cross arrangement.

In one embodiment, the memory module further comprises:

The row decoding unit comprises a row decoding circuit, a WL gating circuit and a BL gating circuit;

the column decoding unit comprises a column decoding circuit and an SL gating circuit.

In one embodiment, the processing includes data flow control and memory access mode control;

wherein the data flow control comprises a conventional matrix calculation mode and a convolution calculation mode, and the access mode control comprises a direct mode and a preload mode.

In one embodiment thereof, the control unit comprises:

Aiming at a conventional matrix calculation mode, matrix multiplication with different sizes is realized through configuration of a data calculation structure and data channel parameters, and reconstruction of the structure is carried out after single calculation is completed;

and aiming at the convolution calculation mode, carrying out parameter configuration on the data multiplexing mode, and loading partial data in the gating multiplexing convolution sliding window of the data through a shift calculator.

In one embodiment thereof, the control unit comprises:

for the direct mode, the data is read out from the data memory and directly loaded into the input unit;

For the preload mode, the data is read out from the data memory, loaded into the preload unit, and flowed into the input unit under the control of the control unit.

In one embodiment thereof, the preload unit includes:

in the process of loading data from the data memory, the data is split into a plurality of data with single data bit width length, a strobe signal of a channel between the data memory and a register group is generated through the control unit, and the data initial energy signal finishes preloading.

In one embodiment, the input unit includes:

The calculation input channel is used for inputting data in the calculation process, and the data with single data bit width is input into the calculation integrated unit for calculation in a plurality of periods through the shift register;

and the read-write input channel is used for accessing the memory of the memory-calculation integrated array in the memory-calculation integrated unit.

In one embodiment thereof, the output unit includes:

The calculation output channel is used for carrying out post-processing on the calculation result value and then outputting the calculation result value;

And the read-out channel is used for reading out the original data stored in the corresponding address of the memory integrated array in the memory integrated unit.

According to the data processing device based on the integrated storage and calculation unit, a series of problems caused by the von neumann architecture can be well solved through the integrated storage and calculation characteristic, the data carrying capacity in the data calculation process can be reduced, and the energy consumption caused by data carrying is reduced, so that the calculation efficiency is improved. In addition, the array structure of the integrated device can greatly improve the parallelism of calculation and further improve the calculation efficiency under the application scene of matrix calculation. The method realizes the functions of reading, writing and matrix calculation based on a plurality of memory calculation integrated units, accelerates by matrix multiplication, optimizes partial data flow, can reduce memory access times in the convolution calculation process, has strong universality and is flexibly applied to acceleration calculation of different neural networks.

Drawings

FIG. 1 is a schematic diagram of a data processing apparatus based on a memory integrated unit;

FIG. 2 is a schematic diagram of a row/column decoder unit of a memory cell-based data processing apparatus;

FIG. 3 is a schematic diagram of a decoding circuit of a data processing apparatus based on a memory unit;

FIG. 4 is an example of a decoding circuit of a memory cell based data processing apparatus;

FIG. 5 is a schematic diagram of a memory array;

FIG. 6 is an example of a computational integrated array calculation;

FIG. 7 is a schematic diagram of a preload cell;

FIG. 8 is a schematic diagram of an input unit of a memory integrated unit-based data processing apparatus;

FIG. 9 is a schematic diagram of a shift register set of a memory cell based data processing apparatus;

FIG. 10 is an example of a shift register set performing convolution calculations;

fig. 11 is an example of an output unit of the data processing apparatus based on the memory integrated unit.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, there is provided a data processing apparatus based on a memory integrated unit, including:

a preload unit for performing a preload operation of data;

The input unit is connected with the preloading unit and is used for executing data input operation;

The integrated memory and calculation processing unit is connected with the input unit and is used for executing calculation operation on input data to obtain an output result, and the integrated memory and calculation processing unit comprises a preset number of integrated memory and calculation units;

The output unit is connected with the integrated storage processing unit and is used for outputting an output result;

The data storage is connected with the preloading unit and the output unit and is used for providing data for the preloading unit and storing an output result;

the control unit is connected with the preloading unit, the input unit, the integrated storage processing unit, the output unit and the data memory and is used for controlling the processing process of the data in a mode of sending different control signals through parameter configuration stored in the configuration register.

In implementation, the data processing device provided in this embodiment directly stores a part of data needed to perform matrix calculation in the integrated calculation unit by adopting the integrated calculation mode, and because the part of data does not need to be carried in the future calculation process, a great amount of carrying time and power consumption are saved.

The convolution calculation in the convolution neural network has a plurality of pieces of data which can be multiplexed when the convolution kernel slides on the feature map, the data are reserved for multiplexing, the number of times of data handling is further reduced, the whole data flow can be optimized at the moment, the data are not required to be repeatedly written into and read out from a memory, and the calculation efficiency is improved. Therefore, compared with a separate memory operation mode, the memory operation mode can reduce a large amount of data access time and power consumption generated in the access process.

According to the data processing device based on the integrated memory unit, which is provided by the embodiment, a calculation mode of the integrated memory unit is adopted, and partial data needing to be subjected to matrix calculation is stored in the integrated memory unit, so that the partial data does not need to be carried in the future calculation process, and at least half of carrying time and power consumption are saved.

In one embodiment, the memory integrated unit includes:

the integrated memory and calculation control module is used for converting the input data and feeding back a data processing result to the control unit;

the integrated memory module is used for storing the converted result;

In implementation, the integrated control module is used for converting the data and control signals input into the integrated unit, outputting the next integrated module and the shift adder-adder, and feeding back signals to the control unit to explain whether the calculation and the writing of the values are completed. And the memory and calculation integrated module is used for realizing the data processing and storage functions based on memory and calculation integration. And the shift adder-adder is used for adding or subtracting the input values input in a plurality of periods after shifting to obtain actual calculation result values.

In one embodiment, the memory module includes:

In implementation, the power gating unit is used for power supply selection in multiple operation modes and outputs to corresponding WL, SL and BL ports; the row decoding unit comprises a row decoding circuit, a WL gating circuit and a BL gating circuit; the decoding function of the gating signals of the memory and calculation integrated array in the row direction is combined with the control of the number of row channels, so that non-unique row gating signals can be generated; the column decoding unit comprises a column decoding circuit and an SL gating circuit, and is used for storing and calculating the decoding function of gating signals of the integrated array in the column direction, and can generate non-unique column gating signals by combining the control of the number of column channels; the memory-calculation integrated array is composed of memory-calculation integrated devices, the number of the memory-calculation integrated arrays in a unit is at least 1, and if the memory-calculation integrated array is a plurality of arrays, the memory-calculation integrated array can be regarded as an array group with array index selection, and the operation mode cannot be changed. The kind of the integrated device adopted by the integrated array is not limited, and the integrated device comprises FLASH, RRAM, MRAM devices which can form an array.

As shown in fig. 2, the row and column decoding units of the device of the present invention include a row decoding circuit, a WL gate circuit, a BL gate circuit, and a column decoding circuit, a SL gate circuit, respectively. The WL gating circuit, the BL gating circuit and the SL gating circuit respectively control the access of the ports of the storage and calculation integrated array WL, SL and BL, and the voltage output by the power gating circuit is input into the storage and calculation integrated array after the power gating circuit is started. The basic working principles of the row decoding circuit and the column decoding circuit are kept consistent, and the number of output gating signals corresponding to the row/column decoding circuit is determined according to the size of the currently adopted memory integrated array on the row and the column.

As shown in fig. 3, the input of the decoding circuit is address and channel number, and the output is gating signal, when the address is input into the decoding circuit and the channel number is 0, the corresponding and unique gating signal is generated according to the index information of the address; when an address is input into the decoding circuit and the channel number is not 0, a corresponding gating signal is generated according to the index information of the address, and the number of gating signals which are consistent with the channel number is additionally increased according to the information of the channel number on the basis of the gating signal.

As shown in fig. 4, for example, when the decoding circuit of the device of the present invention inputs an address bit width of 3 and an input channel bit width of 3, the input address value is 2, the input channel number is 2, a strobe signal with index position of 2 will be generated first, and based on the channel number of 2, the strobe signals with index of 3 and 4 are enabled, and finally the strobe signal {0,0,1,1,1,0,0,0} is output. In the row decoding unit, the channel number of the input dimension can be increased by adopting a mode of increasing the channel number, so that the parallelism of data calculation in the input aspect is improved; in the column decoding unit, the channel number of the output dimension can be increased by adopting a mode of increasing the channel number, so that the parallelism of data calculation in the output aspect is improved; by adjusting the row and column decoding units, the matrix structure of single data calculation can be changed and used for constructing calculation structures in different scenes.

As shown in FIG. 5, the integrated memory array in the integrated memory unit is formed by arranging integrated memory devices in a crossing way. And referring to FIG. 6, for example, a case where a memory integrated array of the memory integrated unit of the device of the present invention is formed by memristors, where the memristors have a size of MxN, M is the number of rows, N is the number of columns, and each memristor has a different conductance valueCorresponding to the weight value/>Applied voltage/>The voltage value is provided by a power supply circuit and corresponds to the input value/>According to ohm's law, each memristor flows out/>The current is the product of the input value and the weight value. Meanwhile, according to kirchhoff's current law, the total current flowing out of each column is the sum of the currents flowing out of all memristors in one column,I.e. the result value of the corresponding matrix multiply-add/>Finally, the output is carried out by a read-out circuit and is transmitted to a shifter or a write verification module. The above variables m,/>; The above variables n,/>. The unified memory array may be formed of any unified memory device, such as FLASH, RRAM, MRAM.

In one embodiment, the processing performed by the control unit includes data flow control and access mode control.

In an implementation, the control unit includes a plurality of configuration registers for data access configuration, different matrix calculation structures, data multiplexing modes, and parameter configuration of data channels, including parameter configuration of read initial address, read end address, write back initial address, write back end address, data bit width, data number, data input size, data output size, memory unit enable, data multiplexing enable, data shift, row channel number, column channel number, quantization mode, addition tree.

The control unit sends control signals at different stages through different modules in the data calculation process of the integrated memory and calculation processing unit, so as to realize control of data flow and control of memory access modes of data; and in the read-write process of the integrated memory unit, a read-write control signal and a unit index signal are generated to realize the read-write of the integrated memory unit.

Aiming at a conventional matrix calculation mode in data flow control, the control unit realizes matrix multiplication of different sizes through configuration of a data calculation structure and data channel parameters, and the reconstruction of the structure is carried out after single calculation is completed.

For the convolution calculation mode in the data flow control, the control unit carries out parameter configuration on the data multiplexing mode, and partial data in the gating multiplexing convolution sliding window of the data is loaded through the shift calculator. In addition to the configuration of data calculation structure and data channel parameters in the conventional matrix calculation mode, the convolution calculation mode needs to perform parameter configuration on the data multiplexing mode, and partial data in the strobe multiplexing convolution sliding window of the data is loaded through the shift calculator.

In one embodiment, the control unit comprises:

For the direct mode in the access mode, the data are read out from the data memory and directly loaded into the input unit;

For the preload mode in the access mode, the data is read out from the data memory, loaded into the preload unit and flows into the input unit under the control of the control unit.

In one embodiment, the preload unit includes:

As shown in fig. 7, the number of data that can be stored in a single address of the data memory can be adjusted according to the number of intermediate register sets and the number of registers in the preload unit, and when the number of data that can be stored in a single address=the number of register sets×the number of registers, the data access efficiency is the highest. Further, the single address of the data memory can store data bit width=data number×single data bit width. In the process of loading data from the data memory, the preloading unit is split into a plurality of data with single data bit width length, and a strobe signal of a channel between the control unit and the register group and a data initial energy signal are generated through the control unit to complete data preloading.

The preload unit comprises at least 1 register group, and the number of the register groups=the number of the memory integrated units; the register set contains at least 1 register, the number of registers=the maximum number of channels supported by the row decode unit+1, single register bit width=single data bit width.

In one embodiment, as shown in fig. 8, the input unit includes:

In practice, the shift register herein may support input and output modes such as parallel input, parallel output, and serial output. The calculation input channel of the input unit comprises at least 1 shift register group, and the number of the shift register group=the number of the storage and calculation integrated units; the shift register group comprises at least 1 shift register, the number of shift registers=the maximum channel number+1 supported by the row decoding unit, and the single shift register bit width=the single data bit width; the read-write input channel comprises a multiplexer group for reading and writing control signals, gates of the memory integrated units are carried out through index signals of the units, and the control signals are input into the single memory integrated units.

As shown in fig. 9, the shift register of the input unit of the apparatus of the present invention is configured when the corresponding memory cell row decoding unit supports a maximum channel number of 2. When the maximum number of channels supported by the row decoding unit is 2, the shift register set needs to be provided with 3 shift registers, namely a first shift register, a second shift register and a third shift register. The first shift register and the second shift register need to be additionally provided with a multiplexer for selecting input data, and the input data sources are two, namely, the first input data and the second input data are respectively from the first shift register and the second shift register, the output data are respectively from the second shift register and the third shift register, and the third shift register only loads data through the third input data. The shift register group can realize multiplexing of data by loading enabling and selecting signals, and the data can be moved to a shift register of the upper stage after calculation is completed; the shift register set can realize the input of data into the memory integrated unit in a plurality of periods through shift enabling.

As illustrated in fig. 10, in the example of the convolution calculation with a step size of 1 performed by the shift register group provided in fig. 9, the bit width of the current single data is 4, and there are 3 shift registers in the shift register group. When the signal values are not marked on any input and output ports, the signal values are regarded as not concerned, and the signal values can be any value. When a 3 x 3 convolution kernel starts sliding on top of the 4 x 4 feature pattern in the figure, the decimal data 1,2,3 of the first row is first fetched from the data storage and input to the shift register set directly or through the preload unit. At time T0, the shift registers are loaded with binary values {0001, 0010, 00} corresponding to decimal data {1,2,3} respectively; at time T1, the lowest bit {1,0,1} of the data is output from the serial output port, and shift operation is performed; at time T2, the next lower level {0, 1} of the data is output from the serial output port and shift operation is performed; at time T3, outputting the next highest level {0, 0} of the data from the serial output port, and performing a shift operation; at time T4, outputting the next lower bits {0, 0} of the data from the serial output port, and performing a shift operation; at the time T5, the shift register group returns to the original numerical value through shifting, the calculation of the group of data is finished currently, the convolution kernel slides to the right for 1 step, at the moment, the input data of the first shift register is selected as the parallel output data of the second shift register, the input data of the second shift register is selected as the parallel output data of the third shift register, the input data of the third shift register is newly appeared data {4} within the range of the convolution kernel sliding window, and the data loading operation is carried out; at time T6, the shift register group finishes loading decimal data {2,3,4}, corresponding binary values {0010, 00, 0100}, outputs the lowest bit {0,1,0} of the data from the serial output port, and performs shift operation; the subsequent time is sequentially carried out according to the flow.

In one embodiment, the output unit includes:

In practice, the data flowing from the output unit is usually written back to the data register by default, or may be directly output to the outside of the processing unit. The calculation output channel of the output unit comprises a quantization unit and an addition tree, wherein the quantization unit is used for scaling and mapping the data output by the shift adder-adder; the addition tree is structurally configured based on configuration parameters of the addition tree in a configuration register of the control unit, and the reconstruction of the addition tree is realized through addition enabling signals of different nodes.

As shown in fig. 11, the structure of the calculation output channel when the number of the storage unit of the apparatus of the present invention is 4 includes 4 quantization units for different data outputs, a process for providing data scaling and mapping, and a reconfigurable addition tree with an input channel of 4. When the data is calculated and output, the data is quantized through the quantization unit, then flows into the addition tree, and outputs an addition enabling signal to the addition tree according to a configuration register in the current control unit, so as to control the accumulated value. The flexibility of the device in the calculation structure and calculation mode is further improved through the configuration of the quantization unit and the addition tree.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A data processing device based on a memory integrated unit, characterized in that the data processing device comprises:

a preload unit for performing a preload operation of data;

The control unit is connected with the preloading unit, the input unit, the integrated storage processing unit, the output unit and the data memory and is used for controlling the processing process of the data in a mode of sending different control signals through parameter configuration stored in a configuration register;

wherein the output unit includes:

the read-out channel is used for reading out the original data stored in the corresponding address of the memory-calculation integrated array in the memory-calculation integrated unit;

The computation output channel comprises a quantization unit and an addition tree, the quantization unit is used for scaling and mapping data output by the shift adder-adder, the addition tree is structurally configured based on configuration parameters of the addition tree in the configuration register of the control unit, and the reconfigurable addition tree is realized through addition enabling signals of different nodes.

2. The integrated memory unit-based data processing apparatus of claim 1, wherein the integrated memory unit comprises:

the integrated memory module is used for storing the converted result;

3. The integrated-calculation-unit-based data processing apparatus according to claim 2, wherein the integrated-calculation module includes:

4. The integrated-calculation-unit-based data processing apparatus according to claim 3, wherein the integrated-calculation module further includes:

5. The integrated memory cell based data processing device of claim 1, wherein the processing includes data flow control and memory access mode control;

6. The integrated memory unit-based data processing apparatus of claim 5, wherein the control unit includes:

7. The integrated memory unit-based data processing apparatus of claim 5, wherein the control unit includes:

8. The memory integrated unit-based data processing apparatus according to claim 1, wherein the preload unit includes:

9. The integrated memory unit-based data processing apparatus of claim 1, wherein the input unit comprises: