CN110647983B

CN110647983B - Self-supervision learning acceleration system and method based on storage and calculation integrated device array

Info

Publication number: CN110647983B
Application number: CN201910944467.4A
Authority: CN
Inventors: 潘红兵; 娄胜; 王宇宣
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2023-03-24
Anticipated expiration: 2039-09-30
Also published as: CN110647983A

Abstract

The invention discloses a self-supervision learning acceleration system and method based on a storage and computation integrated device array. The acceleration system comprises a cache module, a calculation array, a weight input module, an auxiliary circuit, a control module and a parameter updating module; the cache module, the calculation array and the parameter updating module are sequentially connected; the weight input module is connected with the calculation array and used for updating the calculation array; the control module is respectively connected with the cache module, the weight input module, the calculation array and the parameter updating module; the computing array and the auxiliary circuit are used for completing the operation of the self-supervision neural network. The invention realizes an acceleration system and method of self-supervision learning by the area and power consumption advantages of the storage and calculation integrated computing array, and can save a large amount of energy consumption and product volume compared with the existing processing system which utilizes a graphic computing display card and a traditional digital circuit.

Description

Self-supervision learning acceleration system and method based on storage and calculation integrated device array

Technical Field

The invention relates to a system and a method for accelerating self-supervision learning by using a storage and computation integrated device array, belonging to the field of machine learning.

Background

Most of the conventional computers adopt the von neumann architecture, however, due to the separation of the storage unit and the operation unit of the von neumann architecture, the great energy consumption is generated on the data transmission, and the operation speed is influenced.

Self-supervised learning is one type of unsupervised learning that does not require labeled data to train a general-purpose system. When a neural network is generally trained, a graphic computing graphics card or a central processing unit is used for all the calculations of the network part and the calculation of parameter updating, and for a typical processor of the von neumann architecture, the energy efficiency ratio of the graphic computing graphics card to the central processing unit is very low.

When a neural network is used for inference, a conventional digital circuit generally expands convolution operation into matrix vector multiplication operation, and uses a corresponding multiply-accumulate unit to complete matrix multiplication operation. However, a single multiplier requires a large resource (area) and also causes high power consumption; the access during operation brings the improvement of power consumption, and the existence of a storage wall also limits the further improvement of the operation speed.

The storage and calculation integrated device can realize a storage-calculation integrated function with certain precision, a single device can store numerical value information and store the numerical value information for a long time, and multiplication can be realized by utilizing an analog method in the device. In intensive computing tasks (such as machine learning), large area advantages and power consumption advantages can be achieved using computationally-integrated devices. The current storage and calculation integrated device types are mainly as follows: phase change memories, resistive random access memories rerams, floating gate devices and the like. Numerous products with capacities reaching the Gb level have been released by the industry and the trade, such as 8Gb PCM, which was introduced by 20nm technology in 2012 by three stars, and 128Gb 3D XPoint technology, which was introduced by Micron and Intel 2015. The challenges facing current computing integrated devices include: the precision of the value stored in the device is not high, the process of a part of storage and calculation integrated device is not mature, and the manufacturing process of the memory and the processor is not compatible.

Disclosure of Invention

In order to overcome the technical defects of the traditional processing system in the process of self-supervised learning, the invention provides a system and a method for accelerating the self-supervised learning based on a storage and calculation integrated device array.

The technical scheme adopted by the system of the invention is as follows:

a self-supervision learning acceleration system based on a storage and calculation integrated device array comprises a cache module, a calculation array, a weight input module, an auxiliary circuit, a control module and a parameter updating module; the cache module, the calculation array and the parameter updating module are connected in sequence; the weight input module is connected with the calculation array and used for updating the calculation array; the control module is respectively connected with the cache module, the weight input module, the calculation array and the parameter updating module; the computing array and the auxiliary circuit are used for completing the operation of the self-supervision neural network.

Further, the parameter updating module adopts a digital circuit or a graphic calculation display card.

The invention utilizes the acceleration method of the acceleration system, and the specific process is as follows: the control module inputs and stores the initialized network parameters into the calculation array through the control weight input module; the upper computer sends the training data to a cache module through an interface; the control module sends the training data in the cache module to the auxiliary circuit, and the training data in the cache module is still reserved; a part of the auxiliary circuit quantizes the training data according to bits and then inputs the training data into a calculation array; the calculation array completes convolution operation and full connection operation in the self-supervision neural network, and the other part of auxiliary circuits completes activation and pooling operation; the parameter updating module calculates updated network parameters by a gradient descent method according to the calculation result of the neural network and the training data in the cache module, and sends the parameters to the control module; after receiving the parameters, the control module controls the calculation array to erase the original parameters, and then controls the weight input module to input and store the updated parameters into the calculation array, thereby completing one iteration; repeating the iteration process to complete the training process of the self-supervision learning.

Further, the parameter updating module calculates the updated network parameters according to the calculation result output by the neural network and the stored training data and according to a back propagation algorithm by using a preset loss function.

Further, the calculation array completes convolution operation in the self-supervision neural network, and the specific calculation process of each convolution layer is as follows:

(1) for m convolution kernels of the current convolution layer, expanding each convolution kernel according to columns and then splicing the convolution kernels into a column of vectors, if the m column vectors corresponding to the m convolution kernels are spliced into a matrix, for the input image characteristic diagram of n channels, splicing the n matrices up and down into a new large matrix, and calculating the nonvolatile storage value of the array to be the corresponding value in the large matrix by adopting a calculation array with the same size as the large matrix;

(2) the input of the current convolutional layer is the image feature maps of n channels, and for each image feature map, the following operations are performed to obtain n' matrixes: selecting an area with the same size as the convolution kernel, and moving the characteristic diagram for p times according to a specified step; taking out the corresponding values in the characteristic diagram every time of moving, and unfolding and splicing the values into a line of vectors according to the longitudinal sequence; after the movement is finished, p row vectors are obtained and are sequentially spliced into a matrix from top to bottom;

(3) splicing the n' matrixes left and right to obtain a final electric input matrix; sequentially inputting each row of the electrical input matrix into the calculation array from top to bottom, wherein each column element of the row vector corresponds to each row of the calculation array;

(4) inputting the row vectors into a calculation array according to bit positions, namely inputting one bit at a time; after the calculation of the calculation array is completed, converting the result of each row by an analog-to-digital converter to obtain a digital signal, shifting the digital signal according to corresponding bits respectively, and accumulating to obtain a result, wherein the result is in the form of a vector with the length of m; the result is the result of the calculation completed by the entry of a row of vectors of the electrical input matrix into the calculation array, and the result of the summation after the convolution operation is performed on the same area in the n image characteristic diagrams corresponding to the m convolution kernels;

(5) according to the methods in the steps (3) and (4), p row vectors of the electrical input matrix are sequentially calculated to obtain p vector form results, and the p row vectors are vertically spliced into a matrix; splicing each column of the matrix into a characteristic diagram according to the sequence of values from the characteristic diagram in the step (2), namely obtaining m characteristic diagrams corresponding to the result of convolution operation of each convolution kernel;

(6) and adding bias to the m characteristic diagrams by using an auxiliary circuit, and performing activation operation to obtain a final result of the current layer convolution layer.

Further, the computing array completes full-connection operation in the self-supervision neural network, and the computing process of each full-connection layer is as follows:

(1) assuming that the number of upper layer neurons is m, and the number of the local layer neurons is n, the weights are m × n in total, the m × n weights are sequentially arranged into a matrix, and a calculation array with the same size as the matrix is adopted to calculate the light input quantity of the array to be a corresponding value in the matrix;

(2) taking the m values output by the upper layer as the electrical input quantity of the calculation array;

(3) inputting the electrical input quantity into the integrated storage device array according to the bit, namely inputting one bit each time; after the calculation of the calculation array is completed, converting the result of each row by an analog-to-digital converter to obtain a digital signal, shifting the digital signal according to corresponding bits respectively, and accumulating to obtain a result, wherein the result is in the form of a vector with the length of n;

(4) and adding bias to the vector with the length of n, activating, and obtaining the final result of the current layer full-connection layer after the activation.

The invention realizes an acceleration system and method of self-supervision learning by the area and power consumption advantages of the storage and calculation integrated computing array, and can save a large amount of energy consumption and product volume compared with the existing processing system which utilizes a graphic computing display card and a traditional digital circuit.

Drawings

Fig. 1 is a block diagram of the structure of the acceleration system for the self-supervised learning of the present invention.

Fig. 2 is a schematic structural diagram of a computing unit in embodiment 1 of the present invention.

FIG. 3 is a hardware block diagram of the compute array and support circuits of embodiment 1 of the present invention.

Fig. 4 is a schematic structural diagram of a convolutional self-encoder in embodiment 1 of the present invention.

Fig. 5 is a schematic diagram illustrating the principle of the convolution operation developed as the matrix multiplication method in embodiment 1 of the present invention, (a) performing convolution operation on one feature map for n convolution kernels, and outputting results of n channels; (b) The area corresponding to the convolution kernel in the characteristic diagram is expanded according to the column direction to be used as input, n convolution kernels are expanded according to the column direction and then spliced into a convolution kernel matrix, the multiplication result is an n-column matrix, and each column corresponds to the result of one channel in a.

FIG. 6 is a structural schematic diagram of a memristor in embodiment 2 of the present disclosure.

FIG. 7 is a schematic structural diagram of a memristor computational array in embodiment 2 of the present disclosure.

Fig. 8 is a schematic diagram of the NOR Flash array structure in embodiment 3 of the present invention.

Detailed Description

The invention aims to build an acceleration system for self-supervision learning by utilizing a storage-integration device array so as to obtain smaller area and higher energy efficiency. As shown in fig. 1, the acceleration system includes a calculation array, a weight input module, an auxiliary circuit, a control module, and a parameter update module; the cache module, the calculation array and the parameter updating module are connected in sequence; the weight input module is connected with the calculation array and used for updating the calculation array; the control module is respectively connected with the weight input module, the calculation array, the cache module and the parameter updating module; and auxiliary circuits are arranged between the cache module and the calculation array and between the calculation array and the parameter updating module, and the calculation array and the auxiliary circuits are used for completing the operation of the self-supervision neural network. The array of integral devices can compute large-scale matrix multiplication with minimal cost, and the convolution operation can be expressed as matrix multiplication through expansion.

Example 1

The integrated computing array of the embodiment adopts a photoelectric integrated computing array, the array comprises a light emitting array and a computing array, and the computing array is formed by periodically arranging a plurality of computing units.

As shown in fig. 2, the calculation unit of the present embodiment includes: the photoelectric readout device comprises a control grid serving as a carrier control region, a charge coupling layer serving as a coupling region and a P-type substrate serving as a photon-generated carrier collecting region and a readout region, wherein the P-type substrate is divided into a left collecting region and a right readout region, and the right readout region comprises a shallow trench isolation, and an N-type source end and an N-type drain end which are formed by ion implantation. The shallow trench isolation is located in the middle of the semiconductor substrate, the collection region and the readout region, and is formed by etching and filling silicon dioxide so as to isolate electric signals of the collection region and the readout region. The N-type source end is positioned on one side, close to the bottom dielectric layer, in the reading area and is formed by doping through an ion implantation method. The N-type drain terminal is positioned on the other side, opposite to the N-type source terminal, of the semiconductor substrate close to the bottom layer dielectric layer, and is formed by a doping method through an ion implantation method.

And applying a pulse with a negative voltage range or applying a pulse with a positive voltage range on the control gate on the substrate in the collecting region to generate a depletion layer for collecting photoelectrons in the substrate in the collecting region, and reading out the quantity of the collected photoelectrons through the right read-out region as the input quantity of the optical input end. When reading, a positive voltage is applied to the control grid electrode to form a conductive channel between the N-type source end and the N-type drain end of the collecting region, and then a bias pulse voltage is applied between the N-type source end and the N-type drain end to accelerate electrons in the conductive channel to form a current between the source and the drain. And current carriers are formed in a channel between the source and the drain and are acted by the control gate voltage, the source and the drain voltage and the number of photoelectrons collected by the collecting region together to serve as electrons acted by the light input quantity and the electric input quantity, and the electrons are output in a current form, wherein the control gate voltage and the source and the drain voltage can serve as the electric input quantity of the device, and the number of photoelectrons serves as the light input quantity of the device.

The charge coupling layer of the coupling region is used for connecting the collecting region and the reading region, so that the surface potential of the collecting region substrate can be influenced by the quantity of collected photoelectrons after the depletion region in the collecting region substrate starts to collect the photoelectrons; through the connection of the charge coupling layer, the surface potential of the semiconductor substrate in the reading region is influenced by the surface potential of the semiconductor substrate in the collecting region, so that the magnitude of the current between the source and the drain of the reading region is influenced, and the quantity of photoelectrons collected in the collecting region is read by judging the current between the source and the drain of the reading region;

and the control gate of the carrier control region is used for applying a pulse voltage to the control gate so as to generate a depletion region for exciting photoelectrons in the P-type semiconductor substrate readout region, and can also be used as an electrical input end for inputting one bit of operand.

In addition, a bottom dielectric layer for isolation is arranged between the P-type semiconductor substrate and the charge coupling layer; a top dielectric layer for isolation is also present between the charge coupling layer and the control gate.

The hardware block diagram of the calculation array and the auxiliary circuit of the self-monitoring learning acceleration system composed of the photoelectric storage and calculation integrated calculation array is shown in fig. 3.

Now, assume that the neural network model of the system is a convolution self-encoder, which is a classic unsupervised learning (self-supervised learning) case and can be used for applications such as image denoising and the like. The method combines the convolution and pooling operations of the convolutional neural network by utilizing the self-supervision learning mode of the traditional self-encoder, thereby realizing feature extraction and realizing a deep neural network. Firstly, a convolutional self-encoder model is built by using a photoelectric memory-computer integrated computing array, and then a convolutional self-encoder accelerating system is used for training to obtain the image denoising capability.

As shown in fig. 4, the convolution layers, the upsampling layers, and the like in the convolutional auto-encoder are sequentially built by using the integrated photoelectric-computing array, and assuming that for a certain convolution layer, the input feature map size is 4 × 4, 64 channels are provided, 64 convolution kernels with the size of 3 × 3 are provided in total, and the step is 1, the feature map with the size of 2 × 2 is output as 64 channels. The construction of the layer of the convolutional layer comprises the following steps:

1) Expanding convolution kernel, extracting characteristic diagram, matrix multiplication according to the method of FIG. 5,

input characteristic diagram (the nth channel)

Convolution kernel (mth)

The matrix multiplication operation after expansion is

The convolution operation of the convolutional layer can be expressed as above, where W (convolution kernel, size (9 × 64) is the weight of the photo computational array, and a (upper layer output, size 16 × 9 × 64)) is the input of the photo computational array.

2) The calculation array inputs one row of electrical signals (4 rows in total) at a time, and the input vector is input into the calculation array according to bit positions, namely one bit at a time. Assuming that each element has 8 bits, the input is divided into 8 times, when the operation in the calculation array is completed, the result of each column is subjected to AD conversion to obtain a digital signal, the 8 times of output is shifted according to the corresponding bit by using a basic digital logic circuit, and then the result is obtained by accumulation.

3) According to the method in step 2), 4 rows of vectors of the electrical input matrix are sequentially operated to obtain 4 vector-form results (each vector has 64 elements), and the 4 rows of vectors are spliced up and down to form a matrix. And splicing the first column of the matrix into a feature map according to the sequence of values from the feature map, namely the feature map corresponding to the result of the first convolution kernel convolution operation, splicing the second column into a feature map corresponding to the result of the second convolution kernel convolution operation, and repeating the steps to obtain 64 feature maps (the size is 4 x 4).

4) The resulting 64 profiles are biased using basic digital circuitry and activation is performed using an activation function. And obtaining the output result of the convolution layer after finishing.

By the method, a network main body part of the convolution self-encoder can be built by using the photoelectric-computing integrated computing array; after the building is completed, the acceleration system can be used for training the convolution self-encoder, and the training comprises the following steps:

the control module controls the light emitting array to emit light and inputs the random initialization network parameter information into the photoelectric calculation integrated computing array; the upper computer sends the image data to the cache module through the interface; the control module inputs the image data into an auxiliary circuit, wherein the image data is added with Gaussian noise to become a noise image, is quantized according to bits and then is input into a photoelectric memory integrated computing array; the photoelectric calculation array completes convolution operation and full-connection operation in the convolution self-encoder network, and the other part of auxiliary circuits complete activation, pooling and other operations, so that the photoelectric calculation integrated calculation array and the auxiliary circuits are matched to complete all operations of the convolution self-encoder network; the parameter updating module calculates updated network parameters by a gradient descent method according to the calculation result of the convolution self-encoder model and the original image data in the cache module, and sends the parameters to the control module; after receiving the parameters, the control module controls the photoelectric calculation integrated computing array to erase original parameters, then controls the light emitting array to input and store the updated parameters into the photoelectric calculation integrated computing array, and therefore one iteration is completed; the above steps are an iteration period, after a plurality of iterations, the loss function value is calculated by the parameter updating module, the loss function value is smaller than a certain threshold value, the training effect is achieved, and at the moment, the convolutional self-encoder model has the image denoising function.

Example 2

The memory-computation-integrated computation array of the embodiment adopts a memristor computation array. As shown in fig. 6, a memristor is a non-linear resistor with a memory function, and the resistance of the resistor changes with the flowing circuit. After the power is turned off, even if the current stops, the resistance value is maintained until the reverse current passes, and it returns to its original state.

Therefore, the resistance value of the memristor can be changed by controlling the current change, for example, the high resistance value is defined as 1, and the low resistance value is defined as 0, so that the data storage function is realized. In addition, the memristor can realize the function of integrating storage and calculation according to methods such as memristor CMOS logic, logic storage fusion operation and the like.

A memristor computing array can be formed by a plurality of memristor units, and some functional units are added on the periphery of the memristor array, as shown in FIG. 7. Thus, a 4 × 4 memristor array may store a 4 × 4 weight matrix, and after 4 1 × 4 input voltages are applied, a matrix vector multiplication may be performed within a read delay.

In the same embodiment 1, assuming that the network model of the system is a convolution self-encoder model, convolution operation can be expanded into matrix vector multiplication, so that the convolution self-encoder model can be built by using the memristor array, and after the building is completed, the acceleration system can be used for training the convolution self-encoder, and the training comprises the following steps:

the control module controls the weight input module, weight data are sequentially stored in the memristor array in a mode of controlling the voltage of a device port, and the upper computer sends image data to the cache module through an interface; the control module inputs the image data into an auxiliary circuit, and Gaussian noise is added into the image data in the auxiliary circuit to become a noise image, the noise image is quantized according to bits, and then the image data is input into a memristor array; the memristor calculation array completes convolution operation and full-connection operation in the convolution self-encoder network, and the other part of auxiliary circuits complete activation, pooling and other operations, so that the memristor array and the auxiliary circuits are matched to complete all operations of the convolution self-encoder network; the parameter updating module calculates updated network parameters by a gradient descent method according to the calculation result of the convolution self-encoder model and the original image data in the cache module, and sends the parameters to the control module; after receiving the parameters, the control module firstly controls the memristor array to erase original parameters, and then controls the weight input module to input and store the updated parameters into the memristor array, so that one iteration is completed; the above steps are an iteration cycle, after a plurality of iterations, the parameter updating module calculates that the loss function value is smaller than a certain threshold value, and the training effect is achieved, then the convolutional self-encoder model has the image denoising function.

Example 3

The integrated computing array of the embodiment adopts a floating gate device/flash computing array. As shown in fig. 8, in a 3 × 8bit NOR Flash structure, the basic memory cells under each Bit Line are connected in parallel, and when a Word Line is selected, reading of the Word, that is, bit reading can be achieved, and a higher reading rate is achieved.

The Flash storage unit can store the weight parameters of the neural network and can complete the multiplication and addition operation related to the weight, so that the multiplication and addition operation and the storage are integrated into one Flash unit. The multiplication is performed by a current mirror like analog circuit. The input current is converted to a voltage and coupled to the control gate of the Flash transistor, the output current of the Flash transistor being equal to the input current multiplied by the stored weight. The addition is calculated in a manner similar to the parallel circuit current summation.

The storage and calculation integrated array based on the NOR Flash architecture can directly perform full-precision matrix convolution operation (multiply-add operation) in a storage unit by utilizing the analog characteristic of NOR Flash. The bottleneck of data back and forth transmission between the logic operation unit and the memory is avoided, so that the power consumption is greatly reduced, and the operation efficiency is improved.

In the same embodiment 1, assuming that the network model of the system is a convolutional self-encoder model, the convolutional operation can be expanded into matrix vector multiplication, so that the convolutional self-encoder model can be built by using the NOR flash memory array, and after the building is completed, the convolutional self-encoder can be trained by using the acceleration system, and the training comprises the following steps:

the control module controls the weight input module, weight data are sequentially stored in the NOR flash memory array in a mode of controlling the voltage of a device port, and the upper computer sends image data to the cache module through an interface; the control module inputs it into an auxiliary circuit in which image data is added with gaussian noise to become a noise image, and is quantized by bit, and then input to the NOR flash memory array; the NOR flash memory array completes convolution operation and full connection operation in the convolution self-encoder network, and the other part of auxiliary circuits complete activation, pooling and other operations, so that the NOR flash memory array and the auxiliary circuits are matched to complete all operations of the convolution self-encoder network; the parameter updating module calculates updated network parameters by a gradient descent method according to the calculation result of the convolution self-encoder model and the original image data in the cache module, and sends the parameters to the control module; after receiving the parameters, the control module controls the NOR flash memory array to erase the original parameters, and then controls the weight input module to input and store the updated parameters into the NOR flash memory array, thereby completing one iteration; the above steps are an iteration cycle, after a plurality of iterations, the parameter updating module calculates that the loss function value is smaller than a certain threshold value, and the training effect is achieved, then the convolutional self-encoder model has the image denoising function.

Claims

1. A self-supervision learning acceleration system based on a storage and computation integrated device array is characterized by comprising a cache module, a computation array, a weight input module, an auxiliary circuit, a control module and a parameter updating module; the cache module, the calculation array and the parameter updating module are connected in sequence; the weight input module is connected with the calculation array and used for updating the calculation array; the control module is respectively connected with the cache module, the weight input module, the calculation array and the parameter updating module; the computing array and the auxiliary circuit are used for completing the operation of the self-supervision neural network.

2. The system of claim 1, wherein the parameter updating module is a digital circuit or a graphic computing graphics card.

3. The acceleration method of the self-supervision learning acceleration system based on the memory-computer integrated device array is characterized in that the method comprises the following specific processes: the control module inputs and stores the initialized network parameters into the calculation array through the control weight input module; the upper computer sends the training data to the cache module through the interface; the control module sends the training data in the cache module to the auxiliary circuit, and the training data in the cache module is still reserved; a part of auxiliary circuits quantize the training data according to bits and then input the training data into a calculation array; the calculation array completes convolution operation and full connection operation in the self-supervision neural network, and the other part of auxiliary circuits completes activation and pooling operation; the parameter updating module calculates updated network parameters by a gradient descent method according to the calculation result of the neural network and the training data in the cache module, and sends the parameters to the control module; after receiving the parameters, the control module controls the calculation array to erase the original parameters, and then controls the weight input module to input and store the updated parameters into the calculation array, thereby completing one iteration; repeating the iteration process to complete the training process of the self-supervision learning.

4. An acceleration method according to claim 3, characterized in that the parameter update module calculates the updated network parameters according to the back propagation algorithm, using a pre-set loss function, according to the calculation results output by the neural network and the stored training data.

5. The acceleration method of claim 3, characterized in that, the computation array performs convolution operations in the self-supervised neural network, and the specific computation process of each convolution layer is as follows:

6. The acceleration method of claim 3, characterized in that, the computing array completes fully-connected operation in the self-supervised neural network, and the computation process of each fully-connected layer is as follows: