CN114781631A

CN114781631A - Convolution layer mapping method and device, convolution operation method and device

Info

Publication number: CN114781631A
Application number: CN202210533434.2A
Authority: CN
Inventors: 吴华强; 党琦; 王虹波; 张清天; 高滨; 唐建石; 钱鹤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-07-22

Abstract

A mapping method and a mapping device for convolution layer, a convolution operation method and an operation device are provided. The mapping method comprises the following steps: obtaining dimensionalities [ K, H, D, N ] of the convolutional layers, wherein N is the number of convolutional kernels in the convolutional layers, and K, H, D is the width, height and channel number of the convolutional kernels respectively; unfolding the convolution layers into a 0 th matrix with the row height of K multiplied by H multiplied by D and the column width of N, wherein N columns in the 0 th matrix respectively correspond to one-dimensional vectors with the length of K multiplied by H multiplied by D which are respectively unfolded by N convolution kernels; creating K-1 transformation matrixes based on the 0 th matrix, wherein the K-1 transformation matrixes comprise a 1 st matrix to a K-1 st matrix, the transformation of the m matrix relative to the m-1 st matrix comprises the steps that the row number in the m matrix is (the row number in the m-1 st matrix + K) mod (K multiplied by H multiplied by D), and m is an integer from 1 to K-1; mapping the 0 th matrix to the K-1 th matrix into the storage integral array. The mapping method can effectively improve the space utilization rate and the convolution calculation speed of the array and reduce the power consumption.

Description

Convolution layer mapping method and device, convolution operation method and device

Technical Field

The embodiment of the disclosure relates to a mapping method and a mapping device for convolution layers, a convolution operation method and an operation device.

Background

The novel memory with the memory and calculation integrated function represented by the memristor can realize calculation operation on stored data in situ, so that huge expenditure of data movement is eliminated. According to kirchhoff's law and ohm's law, the storage-computation-integrated array can complete the cumulant addition computation in parallel. Since each column of the banked array is a separate cumulative-plus-computation unit, the maximum computation force of a banked array is equal to the array row number x array column number x array computation frequency. The characteristic that the storage and calculation integrated array realizes matrix vector multiplication operation in one step in an analog domain makes the storage and calculation integrated array receive more and more extensive attention in the application of a neural network.

Deep Neural networks (CNNs), which are represented by Convolutional Neural Networks (CNNs), have been widely used in the fields of image recognition, image processing, and the like. As the depth of the neural network becomes deeper and deeper, the number of convolutional layers required becomes larger and larger. The computational overhead of convolutional layers, such as the analog-to-digital conversion overhead during convolutional arithmetic, has become the main overhead of the system. Particularly, in the convolution calculation of the first layers (shallow layer network) of the neural network, the scale of the convolution layer mapped on the storage and calculation integrated array is small, so that the space utilization rate of the storage and calculation integrated array is insufficient, the calculation efficiency of the whole system is low, and the calculation overhead is large.

Disclosure of Invention

At least one embodiment of the present disclosure provides a mapping method of a convolutional layer, including: obtaining dimensions [ K, H, D, N ] of the convolutional layer, wherein N is the number of convolutional kernels in the convolutional layer, K is the width of the convolutional kernels, H is the height of the convolutional kernels, and D is the number of channels of the convolutional kernels; expanding the convolutional layers into a 0 th matrix with a row height of K multiplied by H multiplied by D and a column width of N, wherein N columns in the 0 th matrix respectively correspond to one-dimensional vectors which are expanded by N convolution kernels and have a length of K multiplied by H multiplied by D; creating K-1 transformation matrices based on the 0 th matrix, wherein the K-1 transformation matrices include a 1 st matrix to a K-1 st matrix, wherein the transformation of the m-1 st matrix relative to the m-1 st matrix includes that a row number in the m-1 st matrix is (row number in the m-1 st matrix + K) mod (K x H x D), and m is an integer between 1 and K-1; mapping the 0 th matrix to the K-1 th matrix into an banked array, wherein the banked array includes at least one sub-array independent of each other.

For example, in the convolutional layer mapping method according to at least one embodiment of the present disclosure, the banked array includes a single sub-array, and the mapping the 0 th matrix to the K-1 th matrix into the banked array includes: mapping the 0 th matrix through the K-1 th matrix into the single sub-array.

For example, in the convolutional layer mapping method according to at least one embodiment of the present disclosure, the mapping the 0 th matrix to the K-1 th matrix into the bank array includes: sequentially arranging and mapping the 0 th matrix to the K-1 th matrix to sequentially arranged parts in the storage and computation integrated array.

For example, in a mapping method of convolution layers according to at least one embodiment of the present disclosure, the N columns in the 0 th matrix respectively correspond to one-dimensional vectors of length K × H × D that respectively expand the N convolution kernels, and the mapping method includes: expanding each of the N convolution kernels into a one-dimensional vector with the length of K multiplied by H multiplied by D according to a specified expansion mode, wherein the specified expansion mode comprises the following steps: and firstly expanding the channel number D of the convolution kernel, then expanding the height H of the convolution kernel, and finally expanding the width K of the convolution kernel.

At least one embodiment of the present disclosure provides a convolution operation method, including: mapping convolution layers used for the convolution operation into the storage-computation-integral array according to the mapping method of any embodiment of the disclosure; acquiring multiple batches of input data from an input feature map in a sliding manner, wherein in each sliding, all input data corresponding to the convolution layer in the input feature map are read and are subjected to one-dimensional expansion to serve as the input data of the current batch, the number of channels of the input feature map is D, and the convolution step length of the convolution operation is 1; and respectively inputting the input data of the plurality of batches into the storage and computation integrated array for computation so as to carry out the convolution operation.

For example, in a convolution operation method according to at least one embodiment of the present disclosure, the obtaining the input data of the plurality of batches from the input feature map in a sliding manner includes: and performing a plurality of horizontal sliding on each line of the input feature map to acquire input data corresponding to a current line in the plurality of batches from the input feature map, wherein the sliding step size of each horizontal sliding is K.

For example, in a convolution operation method according to at least one embodiment of the present disclosure, the obtaining the input data of the plurality of batches from the input feature map in a sliding manner includes: and expanding the read input data of the input feature map into a one-dimensional vector with the length of K multiplied by H multiplied by D according to the same expansion mode as the convolution kernel.

For example, in a convolution operation method according to at least one embodiment of the present disclosure, the inputting the input data of the plurality of batches to the storage bank array calculation includes: when the input data of the plurality of batches are input, the input data of the current batch are divided into K parts according to the width dimension of the convolution kernel, and each corresponding partial one-dimensional vector of the K parts is input into the storage and computation integrated array in a time-sharing mode through a multiplexing circuit.

For example, in a convolution operation method according to at least one embodiment of the present disclosure, the time-sharing inputting, by the multiplexing circuit, the K sets of corresponding partial one-dimensional vectors into the bank array includes: in each time-sharing operation period, inputting the K parts of the corresponding partial one-dimensional vectors into the storage and calculation integrated array as input signals to perform cumulative multiplication and addition operation with the convolution layers so as to obtain K × N output results at the same time.

For example, in the convolution operation method according to at least one embodiment of the present disclosure, the K × N output results include an output valid value or an output invalid value, the output valid value is retained, and the output invalid value is discarded and not included in the final convolution calculation result.

For example, in a convolution operation method according to at least one embodiment of the present disclosure, the bank array includes a plurality of bank devices arranged in an array.

For example, in a convolution operation method according to at least one embodiment of the present disclosure, the memory device includes a memristor, an SRAM cell, a DRAM cell, a PCM cell, or a Flash cell.

At least one embodiment of the present disclosure provides a mapping apparatus for a convolutional layer, including: a storage integrated array including at least one sub array independent of each other; a dimension obtaining module configured to obtain dimensions [ K, H, D, N ] of the convolutional layer, where N is the number of convolutional kernels in the convolutional layer, K is the width of the convolutional kernels, H is the height of the convolutional kernels, and D is the number of channels of the convolutional kernels; a convolution expansion module configured to expand the convolution layer into a 0 th matrix with a row height of K multiplied by H multiplied by D and a column width of N, wherein N columns in the 0 th matrix respectively correspond to one-dimensional vectors with a length of K multiplied by H multiplied by D which expand the N convolution kernels respectively; a matrix transformation module configured to create K-1 transformation matrices based on the 0 th matrix, wherein the K-1 transformation matrices include a 1 st matrix to a K-1 st matrix, wherein the transformation of the m-th matrix with respect to the m-1 st matrix includes a row number in the m-th matrix being (row number + K in the m-1 st matrix) mod (K × H × D), and m is an integer between 1 and K-1; a weight mapping module configured to map the 0 th through K-1 th matrices into the bank array.

At least one embodiment of the present disclosure provides a convolution operation apparatus, including: the mapping apparatus of convolutional layer according to any of the embodiments of the present disclosure; the data acquisition module is configured to acquire a plurality of batches of input data from an input feature map in a sliding manner, wherein in each sliding, all input data corresponding to the convolution layer in the input feature map are read and one-dimensional expansion is performed to serve as the input data of the current batch, the number of channels of the input feature map is D, and the convolution step of the convolution operation is 1; and the input control module is configured to input the input data of the plurality of batches into the storage and computation integrated array for computation so as to perform the convolution operation.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1A is a schematic diagram of a convolutional neural network;

FIG. 1B is a diagram of a convolutional layer performing a multi-channel convolution operation;

FIG. 1C is a schematic diagram of a convolution calculation process;

FIG. 2A is a schematic diagram of a convolutional neural network system architecture based on a storage-computation-integrated array;

FIG. 2B is a schematic diagram of a control gating module;

FIG. 2C is a schematic diagram of a structure of a memristor array;

FIG. 3 is a schematic diagram of a convolution algorithm implemented on the basis of a storage-integral array;

FIG. 4A is a diagram illustrating a convolutional layer mapping method;

FIG. 4B is a diagram illustrating the convolution calculation results of the convolution layer mapping method of FIG. 4A;

FIG. 5A is a diagram illustrating another convolutional layer mapping method;

FIG. 5B is a diagram illustrating the convolution calculation results of the convolution layer mapping method of FIG. 5A;

fig. 6 is a schematic diagram of a mapping method of a convolutional layer according to an embodiment of the disclosure;

FIG. 7 is a flowchart of a convolution operation method according to an embodiment of the disclosure;

fig. 8A and 8B are schematic diagrams illustrating a convolution operation process according to an embodiment of the disclosure;

fig. 9 is a schematic diagram of a mapping apparatus for convolutional layers according to an embodiment of the present disclosure; and

fig. 10 is a schematic diagram of a convolution operation apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and the like in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

Detailed descriptions of known functions and known components may be omitted in order to keep the following description of the embodiments of the present disclosure clear and concise.

The neural network is a mathematical computation model inspired by the structure of cerebral neurons and the principle of nerve conduction, and the mode of realizing intelligent computation based on the model is called brain inspiring computation. For example, the neural network includes various forms of network structures, such as a Back Propagation (BP) neural network, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a long short term memory network (LSTM), and the like; as another example, convolutional neural networks can be further subdivided into fully convolutional networks, deep convolutional networks, U-network (U-Net), and the like.

For example, a typical convolutional neural network typically includes an input, an output, and a plurality of processing layers. For example, the input end is used for receiving data to be processed, such as an image to be processed, and the output end is used for outputting a processing result, such as a processed image, and the like, the processing layers may include, but are not limited to, a convolutional layer, a pooling layer, a fully-connected layer, a Batch Normalization layer (BN), an activation layer, and the like, and the processing layers may include different contents and combinations according to the structure of the convolutional neural network. After the input data is input into the convolutional neural network, corresponding output is obtained through a plurality of processing layers, for example, the input data may be subjected to operations such as convolution, upsampling, downsampling, standardization, full connection, flattening and the like through the plurality of processing layers.

Convolutional layers are the core layers of convolutional neural networks, applying several convolutional kernels (also called filters) to input data (input images or input feature maps) that are used for various types of feature extraction. Each convolution kernel may extract one type of feature. The result obtained after applying one convolution kernel to the input data is called a feature map (feature map), and the number of feature maps is equal to the number of convolution kernels. The feature map output by one convolutional layer can be input to the convolutional layer of the next layer for processing again to obtain a new feature map. The pooling layer is an intermediate layer sandwiched between successive convolutional layers for reducing the size of the input data and also to some extent reducing the over-fitting phenomenon. There are many ways to achieve pooling, including but not limited to: maximum value combining (max-posing), average value combining (avg-posing), random combining, undersampling (e.g., selecting fixed pixels), demultiplexing output (demuxout, splitting an input image into multiple smaller images), and so on. Typically the last sub-sampling layer or convolution layer is connected to one or more fully-connected layers that are used to connect all the extracted features. For example, the output of the fully connected layer is used as the final output, and a one-dimensional matrix, i.e., a vector, can be obtained.

FIG. 1A abstractly illustrates the input and output of a neuron in a convolutional neural network. As shown in fig. 1A, C1 and C2 to Cn refer to different signal channels, and for a certain local receptive field (the local receptive field includes a plurality of channels), different filters are used to convolve data on the C1 to Cn signal channels of the local receptive field, and the convolution result is input to a stimulus node, which calculates according to a corresponding function to obtain characteristic information. For example, the convolutional neural network may be a deep convolutional neural network having two or more layers.

The above description is merely an exemplary description of the convolutional neural network, and the present disclosure does not limit the structure of the convolutional neural network.

The input data of the convolution calculation may be an image or other two-dimensional matrix information, and both the input image and the convolution layer may be represented in a matrix form, so that the convolution calculation may be performed on the input image and the convolution layer to obtain an output feature image.

FIG. 1B shows the process of performing a multi-channel convolution operation on a convolution layer. As shown in FIG. 1B, the dimensions of a convolutional layer can be defined by [ H, K, D, N]And (4) showing. For example, the number of output channels of a convolutional layer can be represented by N, and the number of output channels is also commonly referred to as the number of convolution kernels, that is, a convolutional layer can include one or more convolution kernels, such as the convolutional layer shown in fig. 1B having N convolution kernels. The dimension of each convolution kernel in a convolution layer can be represented by [ H, K, D ]]The height H and the width K of the convolution kernel may be the same or different, and in general, the height H and the width K of the convolution kernel are generally the same and may be 1,3, 5, or the like. Since the input image may comprise a plurality of channels, for example a channel number of 3 for an image of RGB type (corresponding respectively to R, G, B three different primary colors), each convolution kernel may compriseTo correspondingly comprise a plurality of input channels D. In the convolutional neural network shown in FIG. 1B, the dimension of the input feature map (or input image) can be represented by [ H [ ]_in，W_in，D，M]Where M represents the number of input feature maps, H_inAnd W_inRespectively representing the height and width of the input feature map, and the height H of the input feature map_inAnd width W_inMay be equal or unequal.

Since the number of input channels of the input feature map is equal to the number of input channels of the convolution kernel, in the embodiment of the present disclosure, the number of input channels of the input feature map is also denoted by D.

As shown in FIG. 1B, M sets of H with D channels are convolved using N sets of H K convolution kernels with D channels_in×W_inPerforming convolution operation on the input characteristic diagram to respectively obtain M groups of H with N channels_out×W_outSo that the dimension of the output feature map can be represented as H_out，W_out，N，M]In which H_outAnd W_outRespectively the height and width of each output signature.

In one convolution calculation operation, for example, the convolution kernel performs a convolution operation on a first region of the input feature image, and the result of the convolution operation corresponds to a first point on the output feature image, so that a plurality of convolution operation results obtained by performing a plurality of convolution operations on the convolution kernel and the input feature image constitute the output feature image. The convolution kernel and the input characteristic image are subjected to multiple convolution operations in a sliding mode and are transmitted forwards on the input characteristic image through the convolution kernel, the size of a sliding window is the length multiplied by the width of the convolution kernel, the overlapped part of the convolution kernel and the input characteristic image is subjected to matrix inner product multiplication operation during each sliding, then the product results of all elements are accumulated to obtain a value, and the value is the result of one convolution operation. The convolution step of the convolution kernel performing the sliding convolution calculation on the input feature image is usually 1, and the convolution step represents the step of the convolution kernel sliding on the input feature image each time, and may be 1 or 2 or 3, for example.

The convolution operation process of convolutional layer will be described in detail with reference to fig. 1C.

Fig. 1C shows a convolution calculation process for a single-channel single convolution kernel, where the size of the input feature map IFM of a single channel is 5 × 7, and the positions of each element of the input feature map are denoted by a, b, C, …, u, a single-channel convolution kernel of 3 × 3 is selected, and the positions of each element of the convolution kernel are denoted by 1,2, …, 9. The result of the calculation of the n-th position on the output signature OFM is represented by OFMn, for example, the result at the 1 st position of the output signature OFM is obtained by a convolution operation in which the convolution kernel multiplies and accumulates each element in the first 3 × 3 sized region of the input signature, i.e., OFM 1+ 1a + 2b +3 × c +4 d +5 e +6 f +7 g +8 h +9 i, and then the convolution kernel is slid in the horizontal direction of the input signature by a convolution step size of 1, corresponding to the second 3 × 3 sized region of the input signature, and the result at the 2 nd position of the output signature OFM, i.e., OFM 2+ 1 d +2 d +3 f +4 g +5 h +6 j +8 j, is calculated by the same method. And when the convolution kernel slides to the last area of the 1 st row in the horizontal direction of the input feature map and completes convolution calculation, the convolution kernel moves down to the 2 nd row of the input feature map to continue horizontal sliding window convolution operation until all areas of the input feature map are scanned. For example, the convolution kernel is horizontally slid to the last region on row 1 of the input feature map, which corresponds to the output result of row 1 and column 5 of the output feature map, i.e., OFM5 ═ 1 × m +2 × n +3 × o +4 × p +5 × q +6 × r +7 × s +8 × t +9 u, and then the convolution kernel is shifted down by one row, and the convolution calculation is performed on the 6 th region of the input feature map, which corresponds to the output result of row 2 and column 1 of the output feature map, i.e., OFM 6. And by analogy, all convolution calculation results of the output characteristic diagram are finally obtained.

The cumulative-multiply-add operation of the convolution operation can be implemented simply by storing a unified array. For example, a computing unit and a storage unit are integrated in a storage-computation-integrated array represented by a nonvolatile storage array such as a memristor array, so that the problem of a storage wall of a von neumann architecture can be effectively solved, and compared with a traditional processor computing device, the storage-computation-integrated computing device has the advantages of high computing efficiency and low power consumption, and therefore the storage-computation-integrated computing device can provide good hardware support for convolution operation of a convolutional neural network. The neural network convolution operation based on the storage and computation integrated array can greatly reduce the system overhead caused by the movement of data between the computing unit and the storage unit, and effectively improve the computing speed of the neural network. In an embodiment of the present disclosure, the bank array may be composed of nonvolatile Memory devices including, but not limited to, memristors (for example, the memristors may be Resistive Random Access Memory (RRAM)), Static Random Access Memory (SRAM) cells, Dynamic Random Access Memory (DRAM) cells, Phase Change Memory (PCM) cells, Flash Memory (Flash) cells, or other suitable nonvolatile Memory devices.

The convolutional neural network system architecture based on the storage and computation integrated array generally comprises an input module, an output module, a computation unit, a control module and the like. For example, fig. 2A shows a schematic diagram of a convolutional neural network system architecture based on a storage-computation-integral array, which includes an input buffer module, a digital-to-analog conversion module, a gating switch module, a storage-computation-integral array (e.g., a memristor array), an analog-to-digital conversion module, an output buffer module, and a control module; for example, the input module comprises an input buffer module, a digital-to-analog conversion module and a gating switch module, the computing unit comprises a storage and computation integrated array, and the output module comprises an analog-to-digital conversion module and an output buffer module. For example, these modules may be implemented by circuitry, such as digital circuitry and/or analog circuitry.

For example, the input buffer module is configured to receive and buffer input data of the input feature map, and output the input data of the input feature map buffered in the input buffer module to the digital-to-analog conversion module. For example, the input cache module may be an SRAM-based memory, a DRAM-based memory.

For example, a digital-to-analog conversion (DAC) module converts input data output by the input buffer module from a digital signal to an analog signal and outputs the analog signal to the memristor array. For example, the digital-to-analog conversion module may convert the input data into an analog voltage. For example, the digital-to-analog conversion module may also be implemented by a pulse number encoder, and converts the input data into a plurality of pulse numbers with fixed voltage values according to the size of the input data, for example, the number of input pulse numbers is used to represent the size of the input numerical value.

For example, as shown in FIG. 2B, the control strobe block may be a single input multiple output block. The control gating module may connect an input signal from an input channel to a selected one (or ones) of the output channels 1 through N according to control of a gating signal. For example, different regions of the memristor array may be sequentially provided with input signals by controlling the gating module. For example, the control gating module distributes the input signal over different regions of the memristor array depending on the type and number of digital-to-analog conversion modules provided in the circuit. For example, the control gating module may be implemented by a Multiplexing (MUX) circuit, e.g., may be implemented with a plurality of CMOS transmission gates.

For example, a storage and computation integration array is used to implement storage and computation integration. The embodiment of the present disclosure describes a structure of a memristor array by taking a memristor array as an example, but a component of the memristor array is not limited to a memristor, and may be formed by other memory devices such as an SRAM cell, a DRAM cell, a PCM cell, and a Flash cell.

For example, the memristor array may include a plurality of memristors arranged in an array, and each memristor array may employ the structure shown in fig. 2C, or may employ other structures capable of performing matrix multiplication calculations.

For example, the memristor array illustrated in fig. 2C is made up of a plurality of memristor cells, each including a switching element, such as a transistor or the like, and a memristor device. The plurality of memristor cells form an array of M rows and N columns, where M and N are both positive integers. In fig. 2C, WL1, WL2, …, WLM respectively represent word lines of the first, second, …, mth rows, and the control electrodes (e.g., gates of transistors) of the switching elements in the memristor cell circuits of each row are connected to the word line corresponding to the row; BL1, BL2, … and BLN respectively represent bit lines of a first column, a second column, … and an Nth column, and a memristor in the memristor unit circuit of each column is connected with the corresponding bit line of the column; SL1, SL2, … and SLM respectively represent source lines of a first row, a second row, … and an mth row, and the source electrode of the transistor in the memristor unit circuit of each row is connected with the source line corresponding to the row. For example, the word line WL controls row gating on the memristor array, the bit line BL provides an input signal, the source line SL serves as a current accumulation line, and the current accumulation is completed according to kirchhoff's law.

The memristor cells in the memristor array shown in fig. 2C may be, for example, a 1T1R structure or a 2T2R structure, where the memristor cells of the 1T1R structure include one switching transistor and one memristor (as shown in fig. 2C), and the memristor cells of the 2T2R structure include two switching transistors and two memristors. The present disclosure has no limitations on the type, structure, etc. of the memristor devices. It should be noted that the transistors used in the embodiments of the present disclosure may be thin film transistors or field effect transistors (e.g., MOS field effect transistors) or other switching devices with the same characteristics. The source and drain of the transistor used herein may be symmetrical in structure, so that there may be no difference in structure between the source and drain. Embodiments of the present disclosure do not limit the type of transistors employed.

For example, an analog-to-digital conversion (ADC) module in fig. 2A may convert an analog signal output by a memristor array to a digital signal. For example, the analog-to-digital conversion module may be implemented by analog circuits and/or digital circuits. For example, the output current signal may be converted into a voltage signal by the operational amplifier circuit, and then analog-to-digital conversion may be implemented by the ADC.

For example, the output buffer module in fig. 2A buffers the conversion result of the analog-to-digital conversion module, and outputs the calculation result to the subsequent module to be operated. For example, the output buffer module is identical in composition to the input buffer module.

For example, the control module in fig. 2A is responsible for controlling and coordinating the operations of the various modules. For example, the input and output of the input buffer module and the output buffer module are controlled, for example, the control signal is provided for controlling the gating module. For example, as shown in fig. 2A, under the control of the control module, the input buffer module buffers and transmits received digital input data to the digital-to-analog conversion module, an analog input signal obtained through digital-to-analog conversion is transmitted to the gating switch module, the gating switch module transmits the analog input signal to a specified input channel, the analog input signal is input to the memristor array through the bit line end, the memristor array completes multiply-accumulate operation of convolution operation, and an analog calculation result is processed by the analog-to-digital conversion module to obtain a digital output result and output the digital output result to the output buffer module.

Fig. 3 is a schematic diagram of a convolutional neural network based on a storage-computation-integrated array to implement convolution computation. For example, as shown in FIG. 3, in the first place of the convolutional neural network<n>At the layer network layer, the dimension of the input feature graph is [ H ]_in，W_in，D]The dimension of the convolutional layer is [ K, H, D, N]That is, the convolutional layer includes convolution kernels with N channels of K × H size, each of the N convolution kernels is expanded into a one-dimensional vector with a length of K × H × D and mapped onto one column of the bank array, for example, the weight of each element in the convolution kernel is mapped into the resistance value of each memristor on the memristor array. For example, mapping N convolution kernels onto N parallel columns on a memristor array, parallel computations may be implemented on the memristor array since the inputs of the N convolution kernels are the same. For example, the part of the input feature map that is used for convolution calculation during each sliding (dimensions are also [ K, H, D ]]) The method can be implemented by unfolding the convolution layer into one-dimensional vectors with the length of K H D in the same unfolding mode as the convolution layer, sequentially unfolding (1), (2), … and (X) one-dimensional vectors from the input feature map as input signals (such as voltage signals) of the memristor array and performing accumulation multiplication and addition calculation on the memristor array, and finally combining the output results (such as current signals) (1), (2), … and (Y) obtained respectively to be corresponding to the values of the output feature map, for example, the size of the output feature map is [ H ] H_out，W_out，N]。

In existing memory-integrated array designs, the peripheral circuits including digital-to-analog conversion modules (DACs) and analog-to-digital conversion modules (ADCs) occupy a significant area overhead and power consumption overhead. The inventors of the present disclosure noted that the first few layers (shallow networks) of the neural network are rolled upIn the product operation, the number of convolution kernels in the convolution layer is generally small, and the space utilization rate of the memristor array is low by mapping the small number of convolution kernels to each column of the memristor array one by one. For example, let one dimension be [3, 3, 64, 32 ]]When the convolutional layer is mapped onto a memristor array with 576 rows and 128 columns, because the convolutional layer has 32 convolution kernels, each convolution kernel is expanded into a one-dimensional vector with the length of 3 × 64, namely the one-dimensional vector comprises 576 elements, and the 576 elements are respectively mapped onto 576 rows in one column of the memristor array. In this mapping mode, only 32 columns and 576 rows of memristors in the memristor array are fully used, and the remaining 96 columns of memristors are not utilized, and in order to minimize the power consumption generated by the unused memristors in the part, the 96 columns of memristors need to be set to the highest resistance state (R)_max) To minimize the current flowing through these memristors.

As shown in fig. 4A, each convolution kernel has a size of 3 × 3, and position labels of different elements of the convolution kernel are denoted by 1 to 9, and it should be noted that the position labels are not equivalent to actual element values. The N convolution kernels are all expanded into one-dimensional vectors according to the channel number D and are mapped onto the memristor array, each black small block with the mark number of 1-9 on the memristor array is composed of elements on D channels of the N convolution kernels, for example, the black small block with the mark number of 1 on the memristor array represents a matrix composed of all elements on the D channels of the first row and the first column of the N convolution kernels, namely the row number of each black small block is the input channel number D of the convolution kernels, the column number of each black small block is the number N of the convolution kernels, and each black small block represents a matrix with D rows and N columns. For example, in the mapping scheme shown in fig. 4A, only N rows of memristors are used on the memristor array, and the blank areas in the diagram indicate the memristors that are not fully utilized, and all of the memristors are in the high-resistance state during the convolution operation.

Because the area and power consumption of a digital-to-analog conversion module (DAC) are in an exponential relation with the precision, namely, the area and power consumption of the DAC are doubled every time the precision of the DAC is improved by one bit, two schemes are adopted to realize the convolution calculation of a neural network at present.

One solution is to use a low precision DAC for analog to digital conversion, such as a 2bit (bits) precision DAC; as described above, the area and power consumption of a low-precision DAC is relatively smaller, so one DAC cell may be provided for each row of the memristor array. For example, in order to improve the input parallelism, a low-precision DAC is used to replace a high-precision DAC, so that each row of the memristor array has one DAC unit, and each row of the memristor array can work simultaneously. As shown in fig. 4A, the convolution calculation system includes a low-precision analog-to-digital conversion module, a memristor array, a control gating module at an output terminal, an analog-to-digital conversion module, and a shift accumulation module. For example, a low-precision digital-to-analog conversion module converts a received input signal into an analog signal and simultaneously inputs the analog signal into the memristor array.

However, this scheme using low-precision parallel computation requires the input to be divided into a plurality of cycles for the case where the precision of the input data is higher than that of the DAC. For example, for an 8-bit input, a 2-bit DAC is required to be used to divide the input into four times, the result of each time of input after calculation is read by the ADC, and the four times of results can be correspondingly shifted and accumulated in the digital circuit to obtain the final calculation result. As shown in fig. 4B, for an 8-bits input, using a 2-bits DAC, the 8-bits input needs to be divided into four parts from high to low, each part includes 2bits and corresponds to a period, each period uses the low-precision DAC to receive the 2-bits part, the 8-bits input is input into the memristor array 4 times for calculation, for example, the 2bits corresponding to the first input [6:7], and the obtained calculation result is buffered after ADC conversion and is used for accumulating displacement by going to high 6 bits; for example, 2bits corresponding to [4:5] are input for the second time, the obtained calculation result is buffered after ADC conversion is carried out, and the calculation result is accumulated by shifting the calculation result to the high bit position by 4 bits; for example, 2bits corresponding to [2:3] is input for the third time, and the obtained calculation result is buffered after ADC conversion and is used for accumulation by shifting to high bit by 2 bits; for example, 2bits corresponding to [0:1] is input for the fourth time, the obtained calculation result is buffered after ADC conversion is carried out, and the calculation result is added with the intermediate result of the shift processing for the previous three times to obtain a final result. However, the shifting process also amplifies errors that are present in each calculation. For example, the amplified error on the high bits is enough to overwhelm the calculation result of the low bits, so the method of processing the input by the low-precision DAC in cycles still cannot improve the output precision. Therefore, the scheme using the low-precision DAC may not only require a plurality of calculation cycles to obtain a complete result, but also may cause the accuracy of the calculation result to be reduced.

Another solution is to use a high precision DAC for analog to digital conversion, such as an 8bits DAC. However, because the area of the high-precision DAC is relatively large, it is difficult to have one DAC for each row to provide input, and in order to save the overhead of the digital-to-analog conversion module, the digital-to-analog conversion module needs to be demultiplexed by controlling the gating module, for example, the DAC needs to be time-multiplexed by controlling the gating module. As shown in fig. 5A, the convolution computing system includes a high-precision analog-to-digital conversion module, a control gating module at an input end, a memristor array, a control gating module at an output end, and an analog-to-digital conversion module. For example, after the high-precision digital-to-analog conversion module converts the received input signal into an analog signal, the input signal is output to different output channels in a time-sharing manner through the control gating module at the input end.

For example, in a convolution computing system as shown in FIG. 5A, to input data to all the rows required in a memristor array, the high-precision DACs are time-multiplexed 4 times, each time corresponding to 1/4 in all the rows required and resulting in a corresponding intermediate result, and the four intermediate results are then accumulated to obtain the final result, with the output result of the memristor array being shown in FIG. 5B. For example, a DAC with 8bits precision is time-division multiplexed 4 times, the output result of each memristor array is 12bits, and for example, the lower two bits of each output result are affected by error noise. For example, the output results of 4 times are added to obtain a final output result of 14bits, for example, in which only the lower 4bits are affected by error noise. The calculation precision of the scheme is high, but the DAC needs time division multiplexing for multiple times, so that a complete calculation result can be obtained through multiple cycles, and the overall calculation speed of the system is low.

At least one embodiment of the present disclosure provides a mapping method of a convolutional layer, including: obtaining dimensionality [ K, H, D, N ] of the convolutional layer, wherein N is the number of convolutional kernels in the convolutional layer, K is the width of the convolutional kernels, H is the height of the convolutional kernels, and D is the number of channels of the convolutional kernels; unfolding the convolution layers into a 0 th matrix with the row height of K multiplied by H multiplied by D and the column width of N, wherein N columns in the 0 th matrix respectively correspond to one-dimensional vectors which are unfolded with N convolution kernels and have the length of K multiplied by H multiplied by D; creating K-1 transformation matrixes based on the 0 th matrix, wherein the K-1 transformation matrixes comprise a 1 st matrix to a K-1 st matrix, wherein the transformation of the m matrix relative to the m-1 st matrix comprises row number in the m matrix being (row number in the m-1 st matrix + K) mod (K multiplied by H multiplied by D), and m is an integer from 1 to K-1; mapping the 0 th matrix to the K-1 th matrix into an acreage-integrated array, wherein the acreage-integrated array comprises at least one sub-array independent of each other.

At least one embodiment of the present disclosure further provides a convolution operation method, where the convolution operation method includes: mapping convolution layers used for convolution operation into a storage and computation integrated array according to a mapping method provided by any embodiment of the disclosure; acquiring a plurality of batches of input data from an input feature map in a sliding mode, wherein in each sliding process, all input data corresponding to a convolution layer in the input feature map are read and are subjected to one-dimensional expansion to serve as the input data of the current batch, the number of channels of the input feature map is D, and the convolution step length of convolution operation is 1; and respectively inputting the input data of a plurality of batches into the storage and calculation integrated array for calculation so as to carry out convolution operation.

In at least one embodiment of the present disclosure, the first calculated output point of the output signature in the convolution calculation process as shown in fig. 1C may be represented as OFM1 ═ Psum1+ Psum2+ Psum3, where Psum1 ═ a +2 × b +3 ═ C, Psum2 ═ 4 × + d +5 × +6 ± +6 f, and Psum3 ═ 7 × g +8 × + h +9 ×. I.e. each column of the input feature map is multiplied by each column of the convolution kernel and added. Likewise, the second calculated output point can be denoted OFM2 ═ Psum1+ Psum2+ Psum3, where Psum1 ═ 1 × d +2 × e +3 × f, Psum2 ═ 4 × g +5 × h +6 × i, and Psum3 ═ 7 × j +8 × k + 9. As the input data changes in a sliding manner, the third calculated output point of the output signature can be represented by OFM3 ═ Psum1+ Psum2+ Psum3, where Psum1 ═ 1 × g +2 × h +3 × i, Psum2 ═ 4 × j +5 × k +6 × l, and Psum3 ═ 7 +8 × n +9 o.

For example, as shown in fig. 6, at least one embodiment of the present disclosure expands the three-dimensional convolution kernel into one-dimensional vectors, which are arranged in the order shown in fig. 6. In fig. 6, the number of rows of each black patch in the bank array is equal to the number D of input channels of the convolution kernel, and the number of columns of the black patches is equal to the number N of convolution kernels. In at least one embodiment of the present disclosure, the expanded N convolution kernels are subjected to copy mapping in the horizontal direction of the storage-computation-integrated array, and the number of times of copying is K-1. For example, the mapping method provided in at least one embodiment of the present disclosure further needs to adjust the order in which the copied transformation matrix is mapped onto the storage bank array, so that the input of each row can be operated with three different columns of the convolution kernel shown in fig. 6, and 3 × N output results can be obtained at the same time.

For example, the convolution calculation step size in the convolution layer mapping method and the convolution calculation method provided by at least one embodiment of the present disclosure is 1, and therefore, most data of the input feature map can be multiplexed K-1 times. For example, in the calculation process described above, the second column of data d, e, f of the input feature map is used 2 times and the third column of data is used 3 times in the calculation process. Therefore, in at least one embodiment of the present disclosure, the reusability of the input feature map is improved, and the space utilization rate of the storage-computation-integrated array is improved, so that the number of times of calling the digital-to-analog conversion module is reduced, and the computation speed and the computation accuracy are improved.

For example, the mapping method of the convolution layer provided in at least one embodiment of the present disclosure may map the 0 th matrix to the K-1 th matrix into one single sub-array of the banked array, for example, the row height of the single sub-array is not less than K × H × D and the column width of the single sub-array is not less than K × N, or when the array size of the single sub-array is smaller than the array size to be mapped by the convolution kernel, the 0 th matrix to the K-1 th matrix may be mapped into a plurality of sub-arrays of the banked array according to actual needs. For example, when the length (K × H × D) of the one-dimensional vector into which the convolution kernel is expanded is long and the number N of convolution kernels is small, the one-dimensional vector K × H × D may be divided into Z parts, and the K × N one-dimensional vectors may be mapped to Z sub-arrays of the computation bank array, respectively. For example, where the convolution kernel is expanded to have a one-dimensional vector length less than the number of rows of memristor subarrays and the number of convolution kernels, N, is large, the 0 th through mth matrices may be mapped into one banked subarray, and the mth through K-1 th matrices may be mapped into another banked subarray.

For example, in the mapping method of convolutional layers provided in at least one embodiment of the present disclosure, the 0 th matrix to the K-1 th matrix may be sequentially arranged and mapped to corresponding portions in the bank array. For example, the 0 th matrix to the K-1 th matrix may be arranged according to an order when the transformation matrix is created, for example, a forward order, or may be mapped into the storage and computation integral array according to a reverse order or a disorder order, as long as position labels of corresponding convolution kernels on each row of the 0 th matrix to the K-1 th matrix mapped into the storage and computation integral array are ensured to be different.

For example, in the method for mapping convolutional layers according to at least one embodiment of the present disclosure, N convolutional kernels may be respectively expanded according to a specified expansion method to form one-dimensional vectors with a length of K × H × D, and the input feature map also needs to be expanded according to the same expansion method as the convolutional kernels. For example, the expansion mode may be specified as: the number of channels D of the convolution kernel is expanded first, then the height H of the convolution kernel is expanded, and finally the width K of the convolution kernel is expanded, as shown by the black small blocks in FIG. 4A. For example, the expansion mode may be specified as: the width K of the convolution kernel is firstly expanded, then the height H of the convolution kernel is expanded, and finally the channel number D of the convolution kernel is expanded. Embodiments of the present disclosure do not limit the manner in which the convolution kernel or input signature is expanded.

For example, in the convolution operation method provided in at least one embodiment of the present disclosure, the input data of the current line in multiple batches is obtained by performing horizontal sliding on each line of the input feature map for multiple times, where a sliding step length of each horizontal sliding is K. For example, for a single-channel convolution kernel of 3 × 3 size, all input data of 3 × 3 size of the input feature map is acquired every time of horizontal sliding, and then divided into K parts by the digital-to-analog conversion module and the gating control module to be input into the bank of computation, and the required time-sharing operation period is K times.

For example, in each time-sharing operation period, the convolution operation method provided in at least one embodiment of the present disclosure inputs K corresponding partial one-dimensional vectors as input signals into the bank array to perform a cumulative addition operation with the convolution layer, so as to obtain K × N output results at the same time. For example, the K × N output results may or may not include an output invalid value, and all of the output results are retained with the valid value discarded. For example, the accumulation of the effective values of the output results of different time-sharing operation periods can be realized by the peripheral circuit to obtain the final convolution calculation result. In this regard, the operation may be performed, for example, using a high precision DAC module as shown in fig. 5A and 5B.

The convolution calculation is a process of continuously sliding the window, which is usually a horizontal sliding window, and then moving down one row to continue the horizontal sliding window until all the input data are completely calculated. Fig. 7 is a flowchart of a convolution operation method according to at least one embodiment of the present disclosure, and as shown in fig. 7, the convolution operation method includes the following steps:

step 10, judging whether a vertical sliding window of a convolution kernel is finished;

step 20, if the vertical sliding window of the convolution kernel is not finished, reading the data of the input feature map corresponding to the convolution kernel;

step 30, judging whether the horizontal sliding window of the convolution kernel is finished or not;

step 40, if the horizontal sliding window of the convolution kernel is not finished, sliding the convolution kernel by K columns in the horizontal direction of the input feature map;

if the horizontal sliding window of the convolution kernel has been completed, step 50, the convolution kernel is slid 1 row in the vertical direction of the input feature map, and then step 10 is performed until the vertical sliding window is completed.

For example, a specific example of the step 20 further includes:

step 21, activating input data of a first column of a convolution kernel corresponding to an input feature map;

step 22, activating the input data of the second column of the convolution kernel corresponding to the input feature map;

……

and step 23, activating the K column of the convolution kernel to correspond to the input data of the input feature map.

For example, steps 10 to 50 correspond to obtaining a plurality of batches of input data from the input feature map in a sliding manner in the embodiment of the present disclosure, and steps 21 to 23 correspond to dividing the input data of the current batch into K parts according to the width dimension (K) of the convolution kernel when inputting the input data of the plurality of batches in the embodiment of the present disclosure, and inputting the K parts of each corresponding partial one-dimensional vector into the storage bank array in a time-sharing manner through the multiplexing circuit.

The following describes in detail a mapping method and a convolution operation method proposed in at least one embodiment of the present disclosure with reference to a specific but non-limiting example.

Fig. 8A and 8B illustrate a convolution operation process provided in at least one embodiment of the present disclosure, each batch of input data in the calculation process of the entire array may be divided into K times for input activation, each time data corresponding to one column of convolution kernel in the convolution layer, i.e., a part of one-dimensional vector corresponding to the K-th part of each batch of input data, for example, in the embodiment shown in fig. 8A and 8B, for one convolution kernel of 3 × 3, each batch of input data is divided into 3 parts, and the calculation results are read out by the ADC and accumulated in the peripheral digital circuit.

As shown in fig. 8B, in the first cycle of the time-sharing operation, the first column of the corresponding portion of the input feature map, i.e., the first input vector of the input data of the first batch (i.e., the first corresponding partial one-dimensional vector), is input to the input activation region shown in (a) of fig. 8A through the gating switch module. At this time, the first input vector of the first batch of input data obtained from the input feature map and the first row of

convolution kernels

1,2 and 3 in the figure are multiplied and accumulated to obtain the first third of the output O1, which is denoted as O1,1 and is represented as Psum1 corresponding to the weight copy 1 in the figure, while the output results obtained by calculating the second row of

convolution kernels

7, 8 and 9 and the third row of

convolution kernels

4, 5 and 6 corresponding to the input activation regions in the array (a) of fig. 8A in the first period are invalid values, i.e., Psum2 and Psum3 of the weight copy 2 and the weight copy 3 have no meaning and are discarded and are not included in the final calculation result.

In the second period of the time-sharing operation, the second column of the corresponding portion of the input feature map, i.e., the second input vector (i.e., the one-dimensional vector of the corresponding portion of the second input data) of the first input data, is input into the input activation region shown in (b) of fig. 8A through the gating switch module. At this time, the second input vector of the first batch of input data is multiplied and accumulated with the first row of

convolution kernels

4, 5 and 6 in the figure to obtain the second third of the output O1, which is marked as O1,2 and is represented as Psum1 in the figure; the input vector is multiplied by the second row of

convolution kernels

1,2, and 3 in the figure to obtain the first third of the output O2, denoted as O2,1, which is represented as Psum2 in the figure. While the third column of the

convolution kernels

7, 8, 9 in the figure need to be discarded as invalid values in cycle two.

In the third period of the time-sharing operation, the third column of the corresponding portion of the input feature map, i.e., the third input vector of the input data of the first batch (i.e., the one-dimensional vector of the corresponding portion of the third batch) is input into the input activation region shown in (c) of fig. 8A by the gating switch module. At this time, the third input vector of the first batch of input data is multiplied and accumulated with the

convolution kernels

7, 8 and 9 in the first row in the figure to obtain the last third of the output 01, which is marked as O1,3 and is represented as Psum1 in the figure. To this end, O1 ═ O1,1+ O1,2+ O1,3, yielded the first complete calculated output point of the output feature map; the input vector is multiplied and accumulated with the second row of

convolution kernels

4, 5, 6 in the figure to obtain a second third O2,2 of the output O2, which is represented as Psum2 in the figure; the input vector is multiplied and accumulated with the

convolution kernels

1,2, 3 of the third column in the figure to obtain the first third O1,3 of the output O3, which is represented in the figure by Psum 3.

Then, in the input feature map, sliding three columns to the right to obtain the second batch of input, and in the fourth cycle of the time-sharing operation, inputting the first column of the corresponding part of the input feature map, i.e. the first input vector of the second batch of input data, into the input activation region shown in (a) of fig. 8A through the gating switch module. Performing multiply-accumulate operation on a first input vector of the second batch of input data and first rows of

convolution kernels

1,2 and 3 in the figure to obtain a first third of an output O4, which is marked as O4 and 1 and is expressed as Psum1 corresponding to the weight copy 1 in the figure, and performing multiply-accumulate operation on third rows of

convolution kernels

7, 8 and 9 corresponding to input activation regions in the array (a) of FIG. 8A to obtain a third of an output O2, which is marked as O2 and 3 and is expressed as Psum2 in the figure; the second row of

convolution kernels

4, 5, 6 corresponding to the input activation regions in the array (a) of fig. 8A is multiplied and accumulated to obtain the second third of the output O3, denoted as O3,2, which is represented as Psum3 in the figure. In this fourth cycle, so far O2 ═ O2,1+ O2,2+ O2,3, a second complete calculated output point of the output characteristic map is obtained.

In the fifth cycle of the time-sharing operation, the second column of the corresponding portion of the input feature map, i.e., the second input vector of the second batch of input data, is input into the input activation region shown in (b) of fig. 8A by the gating switch module. Performing multiply-accumulate operation on a second input vector of the second batch of input data and second rows of

convolution kernels

4, 5 and 6 in the figure to obtain a second third of an output O4, which is marked as O4 and 2 and is expressed as Psum1 corresponding to the weight copy 1 in the figure, and performing multiply-accumulate operation on first rows of

convolution kernels

1,2 and 3 corresponding to input activation regions in the array (b) of FIG. 8A to obtain a first third of an output O5, which is marked as O5 and 1 and is expressed as Psum2 in the figure; the third column of

convolution kernels

7, 8, 9 corresponding to the input activation region in the array (b) of fig. 8A is multiplied and accumulated to obtain the third of the output O3, which is denoted as O3,3 and is represented as Psum3 in the figure. In this fifth cycle, so far O3 ═ O3,1+ O3,2+ O3,3, a third complete calculated output point of the output characteristic map is obtained.

In the sixth cycle of the time-sharing operation, the third column of the corresponding portion of the input feature map, i.e., the third input vector of the second batch of input data, is input into the input activation region shown in (c) of fig. 8A by the gating switch module. Performing multiply-accumulate operation on a third input vector of the second batch of input data and third columns of

convolution kernels

7, 8 and 9 in the figure to obtain a third one-third output O4, which is marked as O4 and 3 and is expressed as Psum1 corresponding to the weight copy 1 in the figure, and performing multiply-accumulate operation on second columns of

convolution kernels

4, 5 and 6 corresponding to input activation regions in the (c) array of FIG. 8A to obtain a second third output O5, which is marked as O5 and 2 and is expressed as Psum2 in the figure; in the array (c) of fig. 8A, the

convolution kernels

1,2, and 3 in the first row corresponding to the input active region are multiplied and accumulated to obtain the first third of the output O6, which is denoted as O6,1 and is represented as Psum3 in the figure. In this sixth cycle, so far O4 ═ O4,1+ O4,2+ O4,3, a fourth complete calculated output point of the output characteristic map is obtained.

Then, in the input characteristic diagram, three columns are slid to the right every time, three input vectors of input data of a corresponding batch are input in three periods after each sliding, one third of each of three output points is obtained in each period, and a complete output point can be obtained through calculation of a peripheral digital circuit. In three cycles corresponding to the input data of the last batch, two invalid values are obtained for the weight copy 1 and discarded, and one invalid value is obtained for the weight copy 2 and discarded.

In the input characteristic diagram, after the sliding operation in one row is finished, the next row is entered for sliding operation until all the rows to be operated are finished, and the obtained calculation output points are combined according to corresponding dimensions to obtain the output characteristic diagram.

In summary, in at least one embodiment of the present disclosure, the gating switch module controls the activation regions with different input feature maps, divides the input data of different batches of the input feature maps into K parts, inputs the K parts into the storage and computation integrated array in a time-sharing manner, performs convolution operation with the convolution kernel which has been copied for K-1 times and mapped into the storage and computation integrated array, thereby obtaining a partial sum of output points, and adds the output results in different cycles through the peripheral circuit, thereby finally obtaining the computation result of the output feature map. In the horizontal direction, all input data of the input characteristic diagram corresponding to the size of the convolution kernel are acquired every time, namely the size of input data of one batch is K H D, the sliding step length of the input data acquired every time is K, so that the reading power consumption is greatly saved.

Therefore, the embodiment of the disclosure improves the area utilization rate of the storage and calculation integrated array, improves the multiplexing degree of input data and the precision of convolution calculation, reduces the consumption of the DAC, and solves the problems of large power consumption and insufficient ADC range caused by excessive accumulated current on the basis of not reducing the operation speed.

Fig. 9 is a schematic diagram of a mapping apparatus for convolutional layers according to an embodiment of the present disclosure. The mapping apparatus 100 includes: a storage volume array 110, a dimension acquisition module 120, a convolution expansion module 130, a matrix transformation module 140, and a weight mapping module 150.

For example, the storage bank array 110 is used to perform a cumulative addition operation in a convolution operation, and the storage bank array 110 includes at least one sub-array independent of each other; for example, the memory of the bank array may be memristors, SRAM cells, DRAM cells, PCM cells, Flash cells, etc., or other memory devices.

For example, the dimension obtaining module 120 is configured to obtain dimensions [ K, H, D, N ] of the convolutional layer, where N is the number of convolutional kernels in the convolutional layer, K is the width of the convolutional kernels, H is the height of the convolutional kernels, and D is the number of channels of the convolutional kernels.

For example, the convolution unroll module 130 is configured to unroll the convolutional layers into a 0 th matrix of row height K × H × D and column width N, where N columns in the 0 th matrix respectively correspond to one-dimensional vectors of length K × H × D that unroll the N convolution kernels respectively.

For example, the matrix transformation module 140 is configured to create K-1 transformation matrices based on the 0 th matrix, wherein the K-1 transformation matrices include the 1 st to K-1 st matrices, wherein the transformation of the m-th matrix with respect to the m-1 st matrix includes row number in the m-th matrix (row number + K in the m-1 st matrix) mod (K × H × D), and m is an integer between 1 and K-1.

For example, the weight mapping module 150 is configured to map the 0 th matrix through the K-1 th matrix into a banked array.

Fig. 10 is a schematic diagram of a convolution operation apparatus 200 according to an embodiment of the present disclosure, where the convolution operation apparatus 200 includes a mapping apparatus 100 according to at least one embodiment of the present disclosure, a data obtaining module 210, and an input control module 220.

For example, the data obtaining module 210 is configured to obtain a plurality of batches of input data from the input feature map in a sliding manner, wherein in each sliding, all input data corresponding to the convolutional layers in the input feature map are read and one-dimensional expansion is performed as the input data of the current batch.

For example, the input control module 220 is configured to input a plurality of batches of input data to the storage bank array computation, respectively, to perform a convolution operation. For example, the input control module includes a switch gating module, a digital-to-analog conversion module, and the like, and the switch gating module may be, for example, a multiplexer.

For example, the convolution operation apparatus provided in at least one embodiment of the present disclosure may further include an output module, a control module, and the like. The embodiment of the present disclosure does not limit the module composition of the convolution operation device.

Although the present disclosure has been described in detail hereinabove with general description and specific embodiments, it will be apparent to those skilled in the art that modifications and improvements can be made based on the embodiments of the disclosure. Accordingly, such modifications and improvements are intended to be within the scope of this disclosure, as claimed.

For the present disclosure, there are also several points to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.

(2) In the drawings used to describe embodiments of the disclosure, the thickness of layers or regions are exaggerated or reduced for clarity, i.e., the drawings are not necessarily to scale.

(3) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A method of mapping convolutional layers, comprising:

obtaining dimensions [ K, H, D, N ] of the convolutional layer, wherein N is the number of convolution kernels in the convolutional layer, K is the width of the convolution kernels, H is the height of the convolution kernels, and D is the number of channels of the convolution kernels;

expanding the convolutional layers into a 0 th matrix with a row height of K multiplied by H multiplied by D and a column width of N, wherein N columns in the 0 th matrix respectively correspond to one-dimensional vectors which are expanded by N convolution kernels and have a length of K multiplied by H multiplied by D;

creating K-1 transformation matrices based on the 0 th matrix, wherein the K-1 transformation matrices include a 1 st matrix to a K-1 st matrix, wherein the transformation of the m-th matrix with respect to the m-1 st matrix includes row number in the m-th matrix being (row number in the m-1 st matrix + K) mod (KxH x D), and m is an integer between 1 and K-1;

mapping the 0 th matrix to the K-1 th matrix into an banked array, wherein the banked array includes at least one sub-array independent of each other.

2. The convolutional layer mapping method of claim 1, wherein the banked array comprises a single sub-array,

the mapping the 0 th matrix through the K-1 th matrix into the bank array comprises:

mapping the 0 th through K-1 th matrices into the single subarray.

3. The convolutional layer mapping method of claim 1, wherein said mapping said 0 th matrix to said K-1 th matrix into said banked array comprises:

sequentially arranging and mapping the 0 th matrix to the K-1 th matrix to sequentially arranged parts in the storage and calculation integrated array.

4. The method of claim 1, wherein the N columns in the 0 th matrix respectively correspond to K × H × D-length one-dimensional vectors that respectively expand the N convolution kernels, and comprise:

and spreading each of the N convolution kernels into a one-dimensional vector with the length of K multiplied by H multiplied by D according to a specified spreading mode, wherein,

the specified expansion mode comprises the following steps: and firstly expanding the channel number D of the convolution kernel, then expanding the height H of the convolution kernel, and finally expanding the width K of the convolution kernel.

5. A convolution operation method, comprising:

mapping convolutional layers for said convolution operation into said array of memory banks according to the mapping method of any of claims 1 to 4;

acquiring a plurality of batches of input data from an input feature map in a sliding manner, wherein in each sliding process, all input data corresponding to the convolutional layer in the input feature map are read and one-dimensional expansion is carried out to serve as the input data of the current batch, the number of channels of the input feature map is D, and the convolution step length of the convolution operation is 1;

and inputting the input data of the plurality of batches into the storage and calculation integrated array for calculation so as to carry out convolution operation.

6. The convolution operation method of claim 5, wherein the obtaining the plurality of batches of input data from the input feature map in a sliding manner comprises:

and performing a plurality of horizontal sliding on each line of the input feature map to acquire input data corresponding to a current line in the plurality of batches from the input feature map, wherein the sliding step size of each horizontal sliding is K.

7. The convolution operation method of claim 5, wherein the obtaining the plurality of batches of input data from the input feature map in a sliding manner comprises:

and expanding the read input data of the input feature map into a one-dimensional vector with the length of K multiplied by H multiplied by D according to the same expansion mode as the convolution kernel.

8. The convolution operation method according to claim 5, wherein the inputting the plurality of batches of input data to the storage volume array computation, respectively, includes:

when the input data of the plurality of batches are input, the input data of the current batch are divided into K parts according to the width dimension of the convolution kernel, and each corresponding partial one-dimensional vector of the K parts is input into the storage and computation integrated array in a time-sharing mode through a multiplexing circuit.

9. The convolution operation method according to claim 8, wherein said time-divisionally inputting the K portions of each corresponding partial one-dimensional vector into the bank array by the multiplexing circuit includes:

in each time-sharing operation period, inputting the K parts of the corresponding partial one-dimensional vectors into the storage and calculation integrated array as input signals to perform cumulative addition operation with the convolution layer so as to obtain K × N output results at the same time.

10. The convolution operation method of claim 9, wherein the K x N output results include an output valid value or an output invalid value,

the output valid value is retained and the output valid value,

the output invalid value is discarded and not included in the final convolution calculation result.

11. The convolution operation method according to any one of claims 5 to 10, wherein the bank array includes a plurality of bank devices arranged in an array.

12. The convolution operation method of claim 11, wherein the memory bank device includes a memristor, a static random access memory cell, a dynamic random access memory cell, a phase change memory cell, or a flash memory cell.

13. A convolutional layer mapping device, comprising:

a storage-integrated array including at least one sub-array independent from each other;

a dimension obtaining module configured to obtain a dimension [ K, H, D, N ] of the convolutional layer, where N is the number of convolutional kernels in the convolutional layer, K is the width of the convolutional kernels, H is the height of the convolutional kernels, and D is the number of channels of the convolutional kernels;

a convolution unroll module configured to unroll the convolution layer into a 0 th matrix with a row height of K × H × D and a column width of N, where N columns in the 0 th matrix respectively correspond to one-dimensional vectors of length K × H × D that unroll the N convolution kernels respectively;

a matrix transformation module configured to create K-1 transformation matrices based on the 0 th matrix, wherein the K-1 transformation matrices include a 1 st matrix to a K-1 st matrix, wherein the transformation of the m-th matrix with respect to the m-1 st matrix includes a row number in the m-th matrix being (row number + K in the m-1 st matrix) mod (K × H × D), and m is an integer between 1 and K-1;

a weight mapping module configured to map the 0 th matrix through the K-1 th matrix into the banked array.

14. A convolution operation apparatus comprising:

the convolutional layer mapping device of claim 13;

the data acquisition module is configured to acquire a plurality of batches of input data from an input feature map in a sliding manner, wherein in each sliding, all input data corresponding to the convolutional layer in the input feature map are read and one-dimensional expansion is performed to serve as the input data of the current batch, the number of channels of the input feature map is D, and the convolution step length of the convolution operation is 1;

and the input control module is configured to input the input data of the plurality of batches into the storage and computation integrated array for computation so as to perform the convolution operation.