CN114003201A

CN114003201A - Matrix transformation method and device and convolutional neural network accelerator

Info

Publication number: CN114003201A
Application number: CN202111277687.XA
Authority: CN
Inventors: 陈世达; 张宏; 李永配
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-01

Abstract

The embodiment of the invention provides a matrix transformation method and device, a convolutional neural network accelerator, a storage medium and an electronic device, wherein the method comprises the following steps: caching a plurality of sub-feature graphs included in the input feature graph in sequence according to a preset sequence; performing matrix transformation on the cached sub-feature graphs each time to obtain a plurality of target sub-feature matrixes; and determining a target output characteristic diagram corresponding to the input characteristic diagram based on the plurality of target sub-characteristic matrixes and the weight parameters of the neural network model. The invention solves the problem of low input characteristic diagram matrix transformation efficiency in matrix operation related to the operation process of the accelerator, and reduces access and bandwidth redundancy, thereby improving the operation speed.

Description

Matrix transformation method and device and convolutional neural network accelerator

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a matrix transformation method and device, a convolutional neural network accelerator, a storage medium and an electronic device.

Background

At present, the deep learning field is widely concerned by the industry, wherein a Convolutional Neural Network (CNN) becomes a research hotspot in various fields such as image classification, target detection and semantic segmentation, and can achieve good effects. The most computationally intensive parts of CNN are convolutional layers and fully-connected layers, and their bottom layers can be implemented based on matrix multiplication. With the continuous increase of the calculation scale and complexity of the CNN model, the traditional CPU platform can not meet the requirement of practicability. Therefore, the implementation of the accelerator adopting the computing platforms such as the GPU and the FPGA is concerned widely, however, compared with the GPU, the FPGA has the characteristics of high energy efficiency, easy reconstruction, fast iterative update, and convenience for moving edge deployment, and is more suitable for the demand of fast development of the deep learning algorithm.

In the related art, the implementation method of the CNN accelerator based on the FPGA mainly includes two aspects: and circularly expanding parallel computation and systolic array computation. However, the first method achieves computation acceleration by increasing parallelism, but faces the problem of high fan-in/fan-out, resulting in lower final speed and poor versatility of computation mode; the second method converts convolution operation and full join operation in CNN into matrix multiplication, but the implementation of the matrix transformation img2col method therein has a large impact on the overall performance.

As can be seen from this, the related art has a problem that the input feature map matrix conversion efficiency in the matrix operation related to the accelerator operation process is low.

In view of the above problems in the related art, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a matrix transformation method and device, a convolutional neural network accelerator, a storage medium and an electronic device, which are used for at least solving the problem of low input characteristic diagram matrix transformation efficiency in matrix operation related to the operation process of the accelerator in the related technology.

According to an embodiment of the present invention, there is provided a matrix transformation method including: caching a plurality of sub-feature graphs included in the input feature graph in sequence according to a preset sequence; performing matrix transformation on the sub-feature graph cached each time to obtain a plurality of target sub-feature matrixes; and determining a target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrixes and the weight parameters of the neural network model.

According to another embodiment of the present invention, there is provided a matrix transformation apparatus including: the buffer module is used for sequentially buffering a plurality of sub-feature graphs included in the input feature graph according to a preset sequence; the transformation module is used for carrying out matrix transformation on the sub-feature graph cached each time to obtain a plurality of target sub-feature matrixes; and the determining module is used for determining a target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrixes and the weight parameters of the neural network model.

According to another embodiment of the present invention, there is provided a convolutional neural network accelerator including the apparatus in the above embodiment.

According to yet another embodiment of the invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program, when executed by a processor, implements the steps of the method as set forth in any of the above.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, a plurality of sub-feature graphs included in the input feature graph are sequentially cached according to a preset sequence, the sub-feature graphs cached at each time are subjected to matrix transformation to obtain a plurality of target sub-feature matrixes, and the target output feature graph corresponding to the input feature graph is determined according to the plurality of sub-feature matrixes and the weight parameters of the neural network model. The sub-feature diagram included in the output feature diagram is cached each time, matrix transformation is carried out on the sub-feature diagram, the number of times of accessing and reading data during calculation is reduced, and only one sub-feature diagram is cached each time, so that the problem of data redundancy access is avoided.

Drawings

Fig. 1 is a block diagram of a hardware structure of a mobile terminal according to a matrix transformation method of an embodiment of the present invention;

FIG. 2 is a flow chart of a method of matrix transformation according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a convolutional neural network accelerator architecture, according to an exemplary embodiment of the present invention;

FIG. 4 is a diagram illustrating a calculation process of convolution in the related art;

FIG. 5 is a graphical illustration of splitting input features according to input channel parallelism, line width, and number of lines, according to an exemplary embodiment of the invention;

FIG. 6 is a schematic diagram of a block overlap condition after segmenting an input feature map according to an exemplary embodiment of the present invention;

FIG. 7 is a diagram illustrating a conversion of convolution calculation into matrix operation in the related art;

FIG. 8 is a diagram illustrating a conversion of convolution calculation into matrix operation in the related art;

FIG. 9 is a diagrammatic illustration of sub-features of a cache in accordance with an exemplary embodiment of the present invention;

FIG. 10 is a line cache sliding window process diagram in accordance with an illustrative embodiment of the present invention;

FIG. 11 is a diagram of a pulse array unit structure according to an exemplary embodiment of the present invention;

fig. 12 is a block diagram of a matrix converting apparatus according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the operation on a mobile terminal, fig. 1 is a hardware structure block diagram of the mobile terminal of a matrix transformation method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the matrix transformation method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In the present embodiment, a matrix transformation method is provided, and fig. 2 is a flowchart of matrix transformation according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, caching a plurality of sub-feature graphs included in the input feature graph in sequence according to a preset sequence;

step S204, performing matrix transformation on the sub-feature graph cached each time to obtain a plurality of target sub-feature matrixes;

step S206, determining a target output characteristic diagram corresponding to the input characteristic diagram based on the plurality of sub-characteristic matrixes and the weight parameters of the neural network model.

In the above embodiment, when the convolutional neural network performs convolutional operation, an input feature map of an originally input picture may be determined, sub-feature maps in the input feature map are cached, matrix transformation is performed on the cached sub-feature maps each time to obtain a plurality of target sub-feature matrices, and then a target output feature map corresponding to the input feature map is determined according to the plurality of target sub-feature matrices and the weight parameters of the neural network model. Wherein, the input feature map can be a feature map facing the systolic array matrix calculation.

Optionally, the main body of the above steps may be an FPGA, a convolutional neural network accelerator, a background processor, or other devices with similar processing capabilities, and may also be a machine integrated with at least a data processing device, where the data processing device may include a terminal such as a computer, a mobile phone, and the like, but is not limited thereto.

In the above embodiment, when the main execution body of the above steps is an FPGA or a convolutional neural network accelerator, a schematic structural diagram of the convolutional neural network accelerator may refer to fig. 3, and as shown in fig. 3, the convolution operation may be performed by using the original input image as the input feature map of the first layer of convolutional layer and the corresponding weight parameter. The original input feature map and the weight parameters of each convolutional layer or all-connected layer in the network are stored in an external memory (generally DDR). Before the calculation is executed, a software side (i.e., a processing system) can configure a hardware module of a programmable logic side, for example, when the convolution calculation is performed, the processing system configures a working register of a direct memory access controller (DMA) through a bus, so that the DMA transfers data (input feature maps and weight parameters) from an external DDR to an FPGA on-chip logic. Because the on-chip resources of the FPGA are limited, an input feature map cache unit and a weight cache unit may be added to respectively store part of the input feature map (i.e., part of the feature map) and the parameter data, and after the two parts of data are operated for several times, the direct memory access controller performs the next data reading and caching. The input characteristic diagram cache unit can realize storage by adopting a channel priority strategy, and the matrix transformation unit can realize convolution torque matrix operation by adopting a line cache window sliding method, namely hardware realization of img2col operation. The pulse array unit is a main calculation unit, realizes matrix operation on the input characteristic diagram and the weight parameters, and the result is partial sum of convolution operation. Therefore, the results of the next batch can be buffered by the accumulator, and accumulated until the input feature diagram corresponding to the current weight parameter is calculated, and the final result is output. The bias module is responsible for increasing bias parameters in the direction of an output channel of the input feature map, and the result is output to a next-stage module through nonlinear processing of an activation function (such as ReLU) module, so that the calculation of one convolution layer is completed.

In the above embodiment, the next layer of the convolutional layer may be a pooling layer or an element level operation layer (for example, a concatenation Concat layer), and therefore, the pooling/element level operation processing unit is responsible for pooling operations and element level operations, and the output result is written back to the external memory through DMA, and as the input feature map of the next convolutional layer, a new round of processing of convolutional layer-bias-activation function-pooling/element level operation is executed until all the layers of the network are completely calculated.

In the above embodiment, the input of the convolution calculation includes the weight parameter weight (Co, Ci, Ky, Kx) and the input feature map ifmp (N, Ci, Hi, Wi), and the final calculation result is the output feature map ofmp (N, Co, Ho, Wo). Where N represents a batch size, which may be set to 1 (this value is merely an exemplary illustration, and the invention is not limited thereto). Co and Ci represent output and input channel data, Kx and Ky represent the length and width of a convolution kernel respectively, Hi, Wi and Ho, Wo represent the height and width of input and output characteristic diagrams respectively, and the final result is added with bias (Co, 1), namely, one bias corresponds to each output channel. The convolution layer is calculated as shown in equation (1), where S represents the step size of the sliding window.

Where Ho ═ ((Hi-Ky +2 × P)/S) +1, Wo ═ ((Wi-Kx +2 × P)/S) +1, and P denotes the number of zero-padding rows. Taking the first layer convolutional layer as an example, if the input feature map ifmp is (1, 3, 416, 416), weight is (16, 3, 3, 3), S is 1, P is 1, and the output feature map ofmp of the first layer is (1, 16, 416, 416).

Meanwhile, the full-link layer can be regarded as vector matrix multiplication, that is, if mp is (Ci, 1), weight is (Co, Ci), bias is (Co, 1), and ofmp (Co, 1) is weight × ifmp + bias.

According to the invention, a plurality of sub-feature graphs included in the input feature graph are sequentially cached according to a preset sequence, and the sub-feature graphs cached at each time are subjected to matrix transformation to obtain a plurality of target sub-feature matrixes, a plurality of sub-feature matrixes and the weight parameters of the neural network model, so that the target output feature graph corresponding to the input feature graph is determined. The sub-feature graph included in the output feature graph is cached each time, matrix transformation is carried out on the sub-feature graph, the number of times of accessing and reading data during calculation is reduced, and only one sub-feature graph is cached each time, so that the problem of data redundancy access is avoided.

In an exemplary embodiment, sequentially caching a plurality of sub feature maps included in the input feature map in a predetermined order includes: determining the input channel parallelism and the line width of the input feature map; determining the number of lines of each cache; dividing the input feature map according to the parallelism of the input channels, the line width and the line number to obtain a plurality of sub-feature maps; and sequentially caching the sub-feature graphs according to the preset sequence. In this embodiment, when the sub feature maps included in the input feature map are cached according to a predetermined sequence, the input channel parallelism of the input feature map and the line width of the input feature map may be determined, the number of lines cached at each time is determined, the input feature map is divided according to the input channel parallelism, the line width, and the number of lines to obtain a plurality of sub feature maps, and after the plurality of sub feature maps are obtained, the plurality of sub feature maps are cached in sequence according to the predetermined sequence. Wherein the predetermined sequence may be a top-to-bottom sequence.

In the related art, a schematic diagram of a convolution calculation process can be seen in fig. 4, as shown in fig. 4, each filter weight slides on the input feature map ifmp from left to right in sequence from top to bottom, and meanwhile, the two corresponding positions are multiplied and accumulated to generate a feature point of the output feature map ofmp, when the sliding window traverses the entire ifmp, the output feature map of one channel is generated, and the Co weights generate the final ofmps of the Co channels. In the convolution process, a large amount of input characteristic diagrams and parameter data are stored and multiplexed, so that the resources on an FPGA chip are limited. Therefore, only part of the data can be cached on the chip to operate, such as the gray part in the figure. The number of gray cubes in the Ci direction is PC and called input channel parallelism, and the number of gray cubes in the Co direction is PF and called output channel parallelism. It can be seen that the accelerator only calculates the convolution results for the entire input feature map and a subset of the weight parameters (dashed portion in fig. 4) at a time. In order to effectively read corresponding calculation data in an external memory, in the related art, the input feature map is divided into a plurality of blocks in the transverse direction and the longitudinal direction by partitioning, and the input feature map of only one block is read and calculated each time.

In the above embodiment, the input feature diagram is divided according to the input channel parallelism, the line width and the number of lines, as shown in fig. 5, referring to fig. 5, the direction of the input feature diagram in the line W and the direction of the input channel Ci are all stored, the direction of the column H only stores a part of the lines, for example, 32 lines (the value is configurable), W × Ci × 32 data are stored in total, the resource addresses stored on the chip are continuous, and the line size of each layer is up-compatible to the nth power of 2, for example, 416 → 512, 208 → 256, and the like. In addition, it should be noted that the input channel parallelism PC may be 8 (this value is only an exemplary illustration, and the present invention is not limited to this, and may also be set to 16, 32, etc.). PC means that each input data is externally stored, on-chip stored and on-chip calculated every 8 groups in the input channel direction. Step1 in the figure indicates that the first group of 8-parallelism is calculated according to the row → column direction first during calculation, and step2 indicates that the next group of 8-parallelism is calculated according to the input channel direction until the input channel direction is completely calculated. The problem of limited on-chip storage resources can be effectively solved by adopting a storage strategy with the input channel direction priority, and meanwhile, after the output characteristic diagram is obtained through calculation, the output characteristic diagrams corresponding to all the blocks do not need to be spliced and recovered. In addition, the overlapping part introduced by the regular blocks can reduce the access bandwidth through multiplexing.

In one exemplary embodiment, determining the number of lines per cache comprises: determining a sliding step length of a sliding window; determining the overlapping quantity of the overlapped rows in the two adjacent sliding processes based on the sliding step length; acquiring the pre-caching number of pre-determined pre-caching lines; determining a difference between the pre-cache number and the overlap number as the number of lines. In this embodiment, when determining the number of lines buffered each time, a sliding step of the sliding window may be determined first, the overlapping number of lines overlapped in two adjacent sliding is determined according to the sliding step, a pre-buffer number of a pre-buffer line determined in advance is obtained, and a difference between the pre-buffer number and the overlapping number is determined as the number of lines.

In the above embodiment, after the input feature map is divided, n feature maps are obtained, only Block n is stored on each feature map, where n is 1, 2, 3, and 4 …, the current Block is calculated, and then the next Block is cached. Taking the first channel as an example, when Ky is 3, S is 1, and P is 1, two adjacent blocks above and below overlap, and when the number of pre-buffers is 32, a schematic diagram of the Block overlap after the input feature map is divided can be seen in fig. 6, as shown in fig. 6, Block1 and Block2 overlap by two lines. Therefore, 2 lines of cache can be reduced each time the on-chip cache is used; the final results of the blocks do not need extra recovery operation, and the continuous addresses are directly written back to the external memory; each block contains all input feature maps of input channel dimensions, so that when all filters complete traversal calculation of the current block, a final output feature map result can be directly generated, and parts and results of on-chip cache intermediate convolution calculation are not needed. Therefore, the frequent access times of the external memory are reduced, and the bandwidth requirement is reduced.

In an exemplary embodiment, matrix transforming the sub-feature map buffered at each time to obtain a plurality of target sub-feature matrices includes: for each time period, performing the following: determining the line data included in the sub-feature map acquired in the time period; and performing matrix transformation on the row data to obtain the target sub-feature matrix. In this embodiment, when the target feature matrix is determined, matrix transformation may be performed on the row data acquired in each time period to obtain a target sub-feature matrix, convolution calculation is performed on the target sub-feature matrix to obtain an output feature map, after the calculation is completed, the row data in the next time period is acquired again to perform matrix transformation to obtain the target sub-feature matrix, and the output feature matrix is obtained through sequential calculation. Therefore, redundant data access can be effectively avoided, the hardware design of the pipeline is realized, and the method has important significance on the storage space, the access bandwidth and the computing resource of the FPGA. The time period may be N clock ticks, such as 3 clock ticks. The time period may be determined based on a sliding window and the number of lines per buffer.

In the related art, the convolution layer and the fully-connected layer in the CNN calculation may be converted into matrix multiplication, that is, may be implemented by using a systolic array, and the fully-connected layer is matrix vector multiplication, which is not described herein again. Matrix transformation for convolution calculation, i.e., hardware implementation of img2 col. As shown in fig. 7, the img2col process is to convert the input feature map into a matrix format corresponding to a filter, and there is a large amount of redundant data inside because of the overlapping of data in the sliding window.

The convolution operation of fig. 4 can be converted into an operation process of matrix multiplication by img2col, and corresponding calculation is performed by using a systolic array as shown in fig. 8. A relatively intuitive method is to implement img2col by using a high-level language at a software end, and then transmit the transformed data to an FPGA for calculation, however, a large amount of redundant data and calculation of each layer require matrix conversion operation, and time overhead brought by such a processing mode is huge. As can be seen from fig. 7 and 8, img2col converts the original convolution input feature map into a matrix form, which utilizes the data covered by each convolution kernel sliding on the input feature map ifmp at the corresponding position of ifmp, and unfolding the data covered by the convolution kernels into 1-dimensional vectors is the result of corresponding img2 col.

In the above embodiment, it is assumed that data already stored on the current chip is a cube as shown in fig. 9, i.e., PC ═ 8, and Ci × W × 32 data in total. Assuming that the convolution kernel of the current layer is 3 × 3 × Ci, its parallelism PC and ifmp remain corresponding. The position of the first window is window 1, the second window is window 2, and then window 3, window 4 and window 5 are sequentially arranged until the sliding of one row is finished. PKKC for each window overlay is 3 × 3 × PC 72 data, corresponding to a 1-dimensional vector of img2 col. As shown in fig. 10, assuming that the bit width of the input feature map data is DW, each beat of data is PC × DW, and for a convolution kernel of 3 × 3, a 2-line buffer may be adopted, wherein the buffer may be implemented based on a FIFO, and the depth of the FIFO depends on the maximum value of the line of the input feature map, for example, 512. It is easy to know that when the 2 rows of data are cached and the 3 rd row of data flows in, the data of the first 2 rows and the data of the 3 rd row flow into the register group simultaneously in sequence, each clock beat flows out 3 groups of PCxDW numbers, after 3 clock beats are waited, 9 groups of data corresponding to 3 x3 windows in the register group are all valid, the data are output to the systolic array for longitudinal input for 1 time, the next beat corresponds to the next 9 groups of data corresponding to the windows which slide for 1 time, and the like, and then the corresponding input characteristic diagram after the img2col matrix conversion can be obtained.

In the embodiment, the matrix transformation operation of the img2col is realized by adopting a line cache sliding window, so that the redundant data access problem of the img2col operation is effectively avoided, and meanwhile, the method is realized on the basis of hardware, so that the requirements of on-chip storage space and external memory access bandwidth are effectively reduced.

In an exemplary embodiment, determining the target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrices and the weight parameter of the neural network model includes: determining an output feature map corresponding to each sub-feature matrix based on each sub-feature matrix and the weight parameters; caching the output characteristic diagram corresponding to each sub-characteristic matrix to obtain a plurality of output characteristic diagrams; determining the target output feature map based on a plurality of the output feature maps. In this embodiment, when determining the target output feature map, the output feature map corresponding to each sub-feature matrix is determined according to each sub-feature matrix and the weight parameter, the output feature map corresponding to each sub-feature matrix is cached, that is, each output feature map obtained from the sub-feature map is cached, and the target output feature map is determined by performing corresponding calculation according to the plurality of output feature maps.

In an exemplary embodiment, determining the output feature map corresponding to each of the sub-feature matrices based on each of the sub-feature matrices and the weight parameter includes: determining a sub-weight parameter of a target input channel corresponding to the sub-feature matrix based on the weight parameter; determining a current sub-feature matrix obtained in a current time period; determining a first product of the current sub-feature matrix and the sub-weight parameter; determining a second product of a previous feature matrix in a previous time period of the current time period and the sub-weight parameter; determining a sum of the first product and the second product as the output signature. In this embodiment, when determining the output feature map, for each target input channel, a sub-weight parameter of the target input channel may be determined, a current sub-feature matrix obtained in a current time period is determined, a first product of the current sub-feature matrix and the sub-weight parameter is determined, a second product of a previous feature matrix and the weight parameter in a previous time period of the current time period is determined, and a sum of the first product and the second product is determined as the output feature map.

In the above embodiment, the output characteristic map may be determined by using a pulse array unit, the structure of which is shown in fig. 11, and as shown in fig. 11, the pulse array unit may include a two-dimensional PE matrix. The weight parameters of the target input channels may be determined first, wherein the weight parameters employed by each input channel may be fixed. The input characteristic diagram data and the output characteristic diagram data can be calculated in a directional propagation mode, wherein the weight parameters can be prestored in a longitudinal direction, then the input characteristic diagram data enters the pulse array unit for calculation in a transverse direction one beat by one beat, and flows to the PE on the right side, and the calculation result of each PE is directionally propagated in the longitudinal direction. All flows are once per clock beat. Each PE is a multiply-accumulate unit, i.e. when Wi and Xi meet in each PE, multiplication is performed, and the product result and the result flowing in from the upper adjacent part are accumulated and then flow to the lower adjacent PE. And after calculating the partial input characteristic diagram of the current batch, obtaining partial sum results, wherein the final output characteristic diagram result is the sum of all data products in the input channel direction. And temporarily storing the part sum corresponding to the next batch and accumulating until a final calculation result is output, namely the output characteristic diagram of the current convolutional layer.

In one exemplary embodiment, determining the current sub-feature matrix obtained in the current time period includes: determining a target sub-feature matrix as the current sub-feature matrix under the condition that an input channel does not exist before the target input channel; determining a first product of adjacent input channels included in the input channel adjacent to the target input channel as the current sub-feature matrix in a case where the target input channel is preceded by an input channel. In this embodiment, in a case where an input channel does not exist before a target input channel, a target sub-feature matrix is determined as a current sub-feature matrix, and in a case where an input channel exists before a target input channel, a first product of adjacent input channels included in the input channels and adjacent to the target input channel is determined as the current sub-feature matrix. As shown in fig. 11, when the target input channel is a first column channel, the target sub-feature matrix is determined as the current sub-feature matrix, and when the target input channel is a channel before the first column channel, the first product of the first column channel is determined as the current sub-feature matrix, for example, the second column channel.

In the above embodiment, each PE is connected to only an adjacent PE based on the systolic array method, and the fan-in and fan-out are 1, and the rectangular physical structure of the systolic array rule makes it more effective when mapping to the FPGA, because the physical space of DSP resource distribution on the corresponding FPGA is a rectangular area, the accelerator can achieve a higher clock frequency.

In the foregoing embodiment, the problem of on-chip storage resource limitation is effectively alleviated by the storage policy with the input channel direction first, and compared with other blocking policies, after the output feature map is obtained by calculation, the output feature maps corresponding to the respective blocks do not need to be spliced and restored. In addition, the overlapping part introduced by the regular blocks can reduce the access bandwidth through multiplexing. The fan-in/fan-out is 1 through the pulse array scheme, and meanwhile, the regular matrix physical structure of the pulse array scheme is easy for the layout and wiring of the FPGA, so that higher clock frequency can be realized, and the overall performance of the accelerator is improved. In addition, the accelerator based on the FPGA has the characteristics of flexible design, reconfigurability and low power consumption.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a device for determining an output characteristic diagram is further provided, where the device is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 12 is a block diagram of a matrix converting apparatus according to an embodiment of the present invention, as shown in fig. 12, the apparatus including:

a caching module 1202, configured to sequentially cache a plurality of sub feature maps included in the input feature map according to a predetermined order;

a transformation module 1204, configured to perform matrix transformation on the sub-feature maps cached each time to obtain a plurality of target sub-feature matrices;

a determining module 1206, configured to determine a target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrices and the weight parameter of the neural network model.

Wherein the buffer module 1202 corresponds to the input feature map buffer unit in fig. 3, the transformation module 1204 corresponds to the matrix transformation (line buffer window sliding) unit in fig. 3, and the determination module 1206 corresponds to the systolic array and the accumulator buffer in fig. 3.

In an exemplary embodiment, the caching module 1202 may sequentially cache the plurality of sub feature maps included in the input feature map in a predetermined order by: determining the input channel parallelism and the line width of the input feature map; determining the number of lines of each cache; dividing the input feature map according to the parallelism of the input channels, the line width and the line number to obtain a plurality of sub-feature maps; and sequentially caching the sub-feature graphs according to the preset sequence.

In an exemplary embodiment, the cache module 1202 may determine the number of lines cached at a time by: determining a sliding step length of a sliding window; determining the overlapping quantity of the overlapped rows in the two adjacent sliding processes based on the sliding step length; acquiring the pre-caching number of pre-determined pre-caching lines; determining a difference between the pre-cache number and the overlap number as the number of lines.

In an exemplary embodiment, the transformation module 1204 may perform matrix transformation on the sub-feature map buffered each time to obtain a plurality of target sub-feature matrices by: for each time period, performing the following: determining the line data included in the sub-feature map acquired in the time period; and performing matrix transformation on the row data to obtain the target sub-feature matrix.

In an exemplary embodiment, the determining module 1206 may determine the target output feature map corresponding to the input feature map based on a plurality of the target sub-feature matrices and the weight parameters of the neural network model by: determining an output feature map corresponding to each sub-feature matrix based on each sub-feature matrix and the weight parameters; caching the output characteristic diagram corresponding to each sub-characteristic matrix to obtain a plurality of output characteristic diagrams; determining the target output feature map based on a plurality of the output feature maps.

In an exemplary embodiment, the determining module 1206 may determine the output feature map corresponding to each of the sub-feature matrices based on each of the sub-feature matrices and the weight parameter by: determining a sub-weight parameter of a determined target input channel corresponding to the sub-feature matrix based on the weight parameter; determining a current sub-feature matrix obtained in a current time period; determining a first product of the current feature matrix and the sub-weight parameter; determining a second product of a previous feature matrix in a previous time period of the current time period and the weight parameter; determining a sum of the first product and the second product as the output signature.

In an exemplary embodiment, the determining module 1206 may determine the current sub-feature matrix obtained in the current time period by: determining a target sub-feature matrix as the current sub-feature matrix under the condition that an input channel does not exist before the target input channel; determining a first product of adjacent input channels included in the input channel adjacent to the target input channel as the current sub-feature matrix in a case where the target input channel is preceded by an input channel.

The embodiment also provides a convolutional neural network accelerator, which comprises the device in the embodiment and can operate the method in the embodiment of the method. The structure diagram can be seen in figure 3.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method as set forth in any of the above.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of matrix transformation, comprising:

caching a plurality of sub-feature graphs included in the input feature graph in sequence according to a preset sequence;

performing matrix transformation on the sub-feature graph cached each time to obtain a plurality of target sub-feature matrixes;

and determining a target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrixes and the weight parameters of the neural network model.

2. The method of claim 1, wherein sequentially buffering the plurality of sub feature maps included in the input feature map in a predetermined order comprises:

determining the input channel parallelism and the line width of the input feature map;

determining the number of lines of each cache;

dividing the input feature map according to the parallelism of the input channels, the line width and the line number to obtain a plurality of sub-feature maps;

and sequentially caching the sub-feature graphs according to the preset sequence.

3. The method of claim 2, wherein determining the number of lines per cache comprises:

determining a sliding step length of a sliding window;

determining the overlapping quantity of the overlapped rows in the two adjacent sliding processes based on the sliding step length;

acquiring the pre-caching number of pre-determined pre-caching lines;

determining a difference between the pre-cache number and the overlap number as the number of lines.

4. The method of claim 1, wherein performing a matrix transformation on the sub-feature maps buffered at each time to obtain a plurality of target sub-feature matrices comprises:

for each time period, performing the following:

determining the line data included in the sub-feature map acquired in the time period;

and performing matrix transformation on the row data to obtain the target sub-feature matrix.

5. The method of claim 1, wherein determining the target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrices and weight parameters of a neural network model comprises:

determining an output feature map corresponding to each sub-feature matrix based on each sub-feature matrix and the weight parameters;

caching the output characteristic diagram corresponding to each sub-characteristic matrix to obtain a plurality of output characteristic diagrams;

determining the target output feature map based on a plurality of the output feature maps.

6. The method of claim 5, wherein determining the output feature map corresponding to each of the sub-feature matrices based on each of the sub-feature matrices and the weight parameters comprises:

determining a sub-weight parameter of a target input channel corresponding to the sub-feature matrix based on the weight parameter;

determining a current sub-feature matrix obtained in a current time period;

determining a first product of the current sub-feature matrix and the sub-weight parameter;

determining a second product of a previous feature matrix in a previous time period of the current time period and the sub-weight parameter;

determining a sum of the first product and the second product as the output signature.

7. The method of claim 6, wherein determining the current sub-feature matrix obtained in the current time period comprises:

determining a target sub-feature matrix as the current sub-feature matrix under the condition that an input channel does not exist before the target input channel;

determining a first product of adjacent input channels included in the input channel adjacent to the target input channel as the current sub-feature matrix in a case where the target input channel is preceded by an input channel.

8. An apparatus for determining an output feature map, comprising:

the buffer module is used for sequentially buffering the sub-feature graphs included in the input feature graph according to a preset sequence;

the transformation module is used for carrying out matrix transformation on the sub-feature graph cached each time to obtain a plurality of target sub-feature matrixes;

and the determining module is used for determining a target output feature map corresponding to the input feature map based on the plurality of target sub-feature matrixes and the weight parameters of the neural network model.

9. A convolutional neural network accelerator, comprising the apparatus of claim 8.

10. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.