CN108764182B

CN108764182B - Optimized acceleration method and device for artificial intelligence

Info

Publication number: CN108764182B
Application number: CN201810553278.XA
Authority: CN
Inventors: 肖东晋; 张立群
Original assignee: Alva Beijing Technology Co ltd
Current assignee: Alva Beijing Technology Co ltd
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2020-12-08
Anticipated expiration: 2038-06-01
Also published as: CN108764182A

Abstract

The invention discloses an acceleration method, which comprises the following steps: determining the template size, the image size and the template number M of the convolution calculation; determining the number N of convolution elements in the acceleration chain and the shortest length of the data loop based on the image size, the template size, the number of templates and/or the computing power of the acceleration device; loading template coefficients into a convolution unit in an acceleration chain; loading a plurality of lines of image data in a FIFO cache; loading multiple columns of image data in an FIFO buffer memory to one or more of convolution units in sequence at one time, starting data flow, pushing the previous column of image data forward by the next column of image data by one bit, and calculating while flowing; the results of each calculation are stored in a designated storage location of a results memory.

Description

Optimized acceleration method and device for artificial intelligence

Technical Field

The invention relates to the field of computers, in particular to an optimized acceleration method and device for artificial intelligence.

Background

A Convolutional Neural Network (CNN) is a feedforward Neural Network, and compared with a traditional BP Neural Network, the Convolutional Neural Network (CNN) has the advantages of high recognition efficiency, good rotational scaling invariance, and the like, and has been widely applied in the fields of digital image processing, face recognition, and the like.

The application principle of the traditional convolutional neural network model is as follows: firstly, designing a convolutional neural network template framework according to the attributes of an image to be input, wherein the designed convolutional neural network template framework is of a multilayer structure and comprises 1 input layer, a plurality of convolutional layers and a plurality of downsampling layers are arranged behind the input layer according to various sequences, and finally, the convolutional neural network template framework is an output layer. The input layer is used for receiving an original image; each convolution layer comprises a plurality of feature maps with the same size, and the pixel of each feature map corresponds to the pixel set of the corresponding window positions of a plurality of feature maps specified by the previous layer; each down-sampling layer comprises a plurality of feature maps with the same size, each feature map of the down-sampling layer corresponds to a feature map of a convolution layer in the previous layer, and the feature map pixels of the down-sampling layer correspond to the sampling area of the corresponding feature map in the previous layer. The nodes of a certain layer are connected with the nodes of the previous layer and the nodes of the next layer through edges.

After the convolutional neural network template with the specific network architecture is built, when a certain picture needs to be identified, the convolutional neural network template needs to be trained, and the training process is as follows: initializing parameters of the convolutional neural network template to random values, including: the weight value of the edge, the value of the convolution kernel, etc.; then, inputting the training sample into the convolutional neural network template, repeatedly stimulating the convolutional neural network template, and continuously adjusting the weight value of the edge, the value of the convolutional kernel and the like until the convolutional neural network template capable of identifying the picture is obtained through training. In subsequent application, the classification and intelligent identification can be achieved only by inputting the picture to be analyzed or other samples into the trained convolutional neural network template.

In order to separate and identify each object from the complex scene, a large number of templates are used to perform traversal convolution calculation on the image, which is large in calculation amount and long in calculation time, and such calculation is usually completed by a special acceleration unit.

In the prior art, artificial intelligence AI computing systems include a main processor and an accelerator, which typically includes up to tens of thousands of multipliers, employing a two-dimensional matrix systolic acceleration principle. The transmission and handling of the calculation data during acceleration requires the main processor to perform, thus requiring a significant amount of main processor time and latency. The AI accelerator has higher idle rate, namely higher waste rate and low energy efficiency; the data organization of the main processor is complex, the chip design is difficult, and the method is mainly embodied in the aspect of a global clock.

Accordingly, there is a need in the art for a novel acceleration method and system for artificial intelligence that at least partially addresses the problems with the prior art.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an accelerating device for artificial intelligence, which comprises:

a template memory storing template coefficients to be calculated;

an input data memory including a plurality of first-in first-out (FIFO) buffers, each FIFO buffer storing a line of image data;

the image processing device comprises an acceleration chain, a first convolution unit, a second convolution unit, a first image data input port, a second image data input port, a first image data output port and a convolution and output port, wherein the acceleration chain comprises the first convolution unit to the Cth convolution unit, C is an integer greater than or equal to 1, and each of the first convolution unit to the Cth convolution unit comprises a template data input port, a first image data input port, a second image data input port, a first image data output port and a convolution; each template data input port is connected with the template memory; a first image data input port of the first convolution unit is connected to an input data memory; a first image data output port of the first convolution unit is connected to a first image data input port of the second convolution unit; a first image data output port of the second convolution unit is connected to a first image data input port of the third convolution unit; … the first image data output port of the C-1 th convolution unit is connected to the first image data input port of the C-th convolution unit; a first image data output port of the C convolution unit is connected to a first image data input port of the first convolution unit; second image data input ports of the first convolution unit to the C convolution unit are connected with the FIFO cache, and when data circulation starts, multiple columns of image data of the FIFO cache are synchronously loaded to one or more of the first convolution unit to the C convolution unit; the convolution and output port of each of the first convolution unit to the C convolution unit are respectively connected with the result memory; and

and the convolution and output port of each of the first convolution unit to the Cth convolution unit are respectively connected with the result memory.

In one embodiment of the present invention, each of the first to the C-th convolution units further includes a second image data output port, and the second image data output port of each of the first to the C-th convolution units is connected to the input data memory.

In one embodiment of the present invention, the acceleration device further comprises:

an accumulator that circularly accumulates convolution calculation results;

the pooling unit is used for pooling the output result of the accumulator;

a nonlinear unit that performs nonlinear processing on the calculation result; and/or

And the storage unit stores the data after the nonlinear processing.

In one embodiment of the present invention, the storage unit is connected to the input data memory, and the data stored in the storage unit is used as the input data of the next stage of convolution calculation.

Another embodiment of the present invention provides an acceleration method for the above acceleration apparatus, including:

determining template size, image size and template number F of convolution calculation;

determining the number C of convolution units in an acceleration chain and the shortest length Dn of a data cycle based on the image size, the template size, the number of templates and/or the computing power of an acceleration device, wherein Dn is an integer obtained by dividing the number of image data columns by the number of template columns of each template;

loading a plurality of lines of image data in a FIFO cache;

loading a template on an acceleration chain according to the number C of convolution units, the number F of image data columns and the shortest length Dn of data circulation, starting data flow and calculating while flowing;

the results of each calculation are stored in a designated storage location of a results memory.

In another embodiment of the present invention, a template is loaded on an acceleration chain according to the number C of convolution units, the number F of image data columns, and the shortest length Dn of a data cycle, and data flow is started, and the computation while flowing includes one or more of six operation modes:

in a first operation mode, Dn is larger than or equal to C and larger than or equal to F, template coefficients of F templates are loaded on F convolution units of an acceleration chain, multiple rows of data in an FIFO cache are synchronously loaded to the F convolution units, a first row of image data in the FIFO cache is loaded to the rightmost end of an Fth convolution unit, data flow is started, and the remaining multiple rows of data in the FIFO cache follow up from the first convolution unit;

in a second operation mode, Dn is larger than or equal to F and larger than or equal to C, template coefficients of C templates in the F templates are loaded on C convolution units of an acceleration chain, multiple columns of data in an FIFO cache are synchronously loaded into the C convolution units, a first column of image data in the FIFO cache is loaded to the rightmost end of a C convolution unit, the rest multiple columns of data in the FIFO cache follow up from the first convolution unit, the image data column returns to the FIFO after coming out of the C convolution unit, when the first column of image data appears at the right end of the C convolution unit again, F is subtracted from C, the value of F is updated by the value of F-C, at the moment, F and C are compared, if F is larger than C, the step of the second operation mode is repeated by using the updated F templates, and if F is smaller than or equal to C, the step of the first operation mode is performed by the updated F templates;

in a third operation mode, C is larger than or equal to Dn and larger than or equal to F, template coefficients of F templates are loaded on F convolution units of an acceleration chain, multiple rows of data in an FIFO cache are synchronously loaded into the F convolution units, a first row of image data in the FIFO cache is loaded to the rightmost end of an Fth convolution unit, data flow is started, and the remaining multiple rows of data in the FIFO cache follow up from the 1 st convolution unit;

in a fourth operation mode, C is larger than or equal to F and larger than or equal to Dn, template coefficients of F templates are loaded on F convolution units of an acceleration chain, multiple rows of data in an FIFO cache are synchronously loaded to Dn convolution units in the F convolution units, a first row of image data in the FIFO cache is loaded to the rightmost end of the Dn convolution unit, and data flow is started;

in a fifth operation mode, F is more than or equal to Dn and more than or equal to C, and the steps of the fifth operation mode are the same as those of the second operation mode;

in a sixth operation mode, F is larger than or equal to C and larger than or equal to Dn, template coefficients of C templates in the F templates are loaded on C convolution units of an acceleration chain, Dn convolution units are synchronously loaded on a plurality of columns of data in a FIFO buffer, a first column of image data in the FIFO buffer is loaded to the rightmost end of the Dn convolution unit, data flow is started to enter a data inner circulation mode, the image data column returns to the first convolution unit after coming out of the C convolution unit, when the first column of image data appears at the right end of the Dn convolution unit again, F is subtracted from C and the value of F-C is used for updating the value of F, the updated F and C are compared, if the updated F is larger than or equal to C, the step of the sixth operation mode is repeated by using the updated F templates, if the updated F is smaller than C, the template coefficients of the F templates are loaded on the C convolution units of the acceleration chain, and continuing the data inner circulation, and returning the image data column to the FIFO buffer after coming out of the Fth convolution unit.

In another embodiment of the present invention, in the first operation mode and/or the third operation mode and/or the fourth operation mode, the F +1 th convolution unit to the C-th convolution unit may be omitted from the data cycle, the image data column comes out from the F-th convolution unit and returns to the FIFO buffer, and when the first column of image data appears at the right end of the F-th convolution unit again, one data cycle ends.

In another embodiment of the present invention, in the sixth operation mode, when the updated F < C, one data cycle ends when the first column of image data appears at the right end of the Dn-th convolution unit again.

In another embodiment of the present invention, the method for accelerating further comprises, after completing one data cycle:

it is determined whether there is one or more lines of image data that are not calculated,

if one or more lines of image data which are not calculated exist, the step of loading the lines of image data in the FIFO buffer is returned, the FIFO buffer is updated by using the new line of image data, and the data circulation is performed again.

In another embodiment of the present invention, the acceleration method further comprises performing post-processing on the calculation result of the convolution unit, wherein the post-processing comprises one or more of accumulation, pooling and non-linearity.

In another embodiment of the present invention, the acceleration method further includes using the processed result as input image data for the next-stage acceleration calculation.

In another embodiment of the invention, a new line of image data is loaded into the FIFO buffer while the data is being cycled.

The accelerating device and the accelerating method for the artificial intelligence AI calculation load data during operation, reduce the bandwidth requirement and simultaneously do not need to independently prepare the data for each convolution calculation unit. The input data memory of the accelerator adopts FIFO cache, and does not need external read-write address lines, so that the use is very simple, the data organization is simple, the accelerator architecture is simple, the chip design is simple, the power consumption is low, and the efficiency is high.

Drawings

To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, the same or corresponding parts will be denoted by the same or similar reference numerals for clarity.

Fig. 1 shows a schematic view of an image to be recognized and a template.

Fig. 2 shows a schematic illustration of an acceleration device 200 of an artificial intelligence AI according to an embodiment of the invention.

Fig. 3 shows a schematic block diagram of a post-processing device 300 for post-processing of the calculation results output in fig. 2 according to an embodiment of the invention.

Fig. 4 shows a flow chart of an acceleration process of an artificial intelligence AI according to an embodiment of the invention.

Detailed Description

In the following description, the invention is described with reference to various embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details, or with other alternative and/or additional methods, materials, or components. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of embodiments of the invention. Similarly, for purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the embodiments of the invention. However, the invention may be practiced without specific details. Further, it should be understood that the embodiments shown in the figures are illustrative representations and are not necessarily drawn to scale.

Reference in the specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

First, the related concepts used in processing an image using a template are introduced:

the template refers to a matrix block, the mathematical meaning of which is a convolution calculation.

And (3) convolution calculation: this can be seen as a weighted summation process, using each pixel in the image region to be multiplied by each element of the convolution kernel (i.e., the weight matrix), and the sum of all products as the new value of the region center pixel.

And (3) convolution kernel: the weights used in the convolution are represented by a matrix which has the same size as the used image area, and the matrix is a weight matrix with odd rows and columns.

Convolution calculation example:

convolution calculation of the pixel region R of 3 × 3 with the convolution kernel G:

assuming that R is a 3 × 3 pixel region, the convolution kernel is G:

convolution sum ═ R1G1+ R2G2+ R3G3+ R4G4+ R5G5+ R6G6+ R7G7+ R8G8+ R9G9

The invention proposes to calculate a category score of an image using a template, and to detect whether or not it is an identified object based on the category score. A specific process of calculating the category score of the image is described below with reference to fig. 1 and 2.

Fig. 1 shows a schematic view of an image to be recognized and a template. As shown in fig. 1, the rectangular box 110 is an image, which is composed of a plurality of pixels and has a specific width and height. Shaded box 120 is a template. The template 120 is convolved with the image of the covered area, i.e. the value of each point of the template is multiplied by the corresponding value of the image of the covered area, the obtained values are summed, and the final sum is used as the category score of the image area. The category score represents the response strength of the region and the template, and the higher the response strength is, the lower the score is.

In the process of identifying the image, the template needs to traverse the whole image. The template is traversed starting from the start of the image for the convolution calculation. For example, let the coordinates of the start position of the image be (0,0), and with the start position (0,0) as the starting point, an image region having the same size as the template is taken from the starting point in the x-axis direction and the y-axis direction. And (3) convolving the region with the templates, namely multiplying the pixel values of the region with the corresponding values of the two templates respectively and then summing the pixel values to obtain the category score of the image region for the template. Next, the start point coordinates are incremented by 1 in the x-axis direction, and an image area having the same size as the template is obtained in the x-axis direction and the y-axis direction with the position (1,0) as the start point. The obtained image area is convolved with a template to obtain a category score of the image area for the template. And (4) increasing the coordinates of the starting point along the direction of the x axis, calculating convolution until the taken image area is judged to be beyond the range of the image, setting the x value of the coordinates of the starting point as the coordinate value of the starting position, and increasing the y value by 1. And taking an image area with the same size as the template from the new starting point position along the x-axis direction and the y-axis direction, convolving the obtained image area with the template, and then, gradually increasing the image area by pixel along the x-axis direction to carry out convolution calculation until the taken image area is judged to be beyond the range of the image, setting the x value of the starting point coordinate as the coordinate value of the starting point and increasing the y value by 1. And repeating the steps until the image area is judged to be beyond the range of the image along the y-axis direction, and completing the convolution of the whole image.

In the convolution calculation process, the input image data cannot be directly applied to the convolution neural network, and corresponding two-dimensional data needs to be extracted according to the convolution size and is transmitted to the convolution calculation unit. This requires the main memory to configure data and clocks for the convolution calculation unit, and when the number of image data and convolution units is large, a large amount of main processor time and latency are required, and the main processor data organization is complicated, and the chip design is difficult. To address this problem, the present invention provides an acceleration method and apparatus for artificial intelligence AI, which loads data while operating, reducing bandwidth requirements, and does not require a main processor to prepare data for each convolution computation unit.

Fig. 2 shows a schematic illustration of an acceleration device 200 of an artificial intelligence AI according to an embodiment of the invention. As shown in fig. 2, the acceleration apparatus 200 includes a template memory 210, an input data memory 220, an acceleration chain 230, and a result memory 240.

The template memory 210 in the present invention may be Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), non-volatile memory such as flash memory, or any other type of memory.

The input data memory 220 may include a plurality of first-in-first-out FIFO data buffers 221-1 through 221-N for sequentially writing data and sequentially reading data whose data addresses are automatically incremented by 1 by internal read and write pointers. Hereinafter, for convenience of description, the data buffers 221-1 to 221-N of the FIFO are collectively referred to as FIFO buffers 221. Each FIFO buffer 221 stores one line of image data and reads a new line of image data from the external memory while sequentially supplying columns of image data to the acceleration chain 230, so the number of FIFO buffers in the input data memory 220 should be greater than the number of lines of image data required for one calculation by the convolution unit.

The acceleration chain 230 may include first through nth convolution units 231-1 through 231-N. The functions of the first to nth convolution units 231-1 to 231-N are substantially the same. In other words, each of the first to nth convolution units 231-1 to 231-N may perform convolution calculation based on the template data and the image data to obtain a convolution sum. The templates used by the first through Nth convolution units 231-1 through 231-N are the same size. In one embodiment of the present invention, the first through Nth convolution units 231-1 through 231-N may be 3 × 3 convolution units, 5 × 5 convolution units, 8 × 8 convolution units, or the like.

Each of the first through nth convolution units 231-1 through 231-N includes a template data input port, a first image data input port, a second image data input port, a first image data output port, a second image data output port, and a convolution and output port. Each template data input port is coupled to a template memory 210 for receiving template data used by the convolution unit.

A first image data input port of the first convolution unit 231-1 receives image data from the input data memory 220. The image data input port of the first convolution unit 231-1 is connected to the plurality of FIFO buffers 221 to receive one column of data at a time from the plurality of FIFO buffers 221. A first image data output port of the first convolution unit 231-1 is connected to a first image data input port of the second convolution unit 231-2; a first image data output port of the second convolution unit 231-2 is connected to a first image data input port of the third convolution unit 231-3; … the first image data output port of the N-1 th convolution unit 231-N-1 is connected to the first image data input port of the N-th convolution unit 231-N; the first image data output port of the nth convolution unit 231-N is connected to the first image data input port of the first convolution unit 231-1. The first to nth convolution units 231-1 to 231-N form an inner loop of data. Hereinafter, for convenience of description, left and right ends of the convolution unit are used, the left end of the convolution unit referring to a column of calculation units closest to the first image data input port, and the right end of the convolution unit referring to a column of calculation units closest to the image data output port.

The second image data input ports of the first to nth convolution units 231-1 to 231-N are connected to the FIFO buffer 221 through the bus 222, and multiple columns of image data of the FIFO buffer 221 are loaded to one or more of the first to nth convolution units 231-1 to 231-N at a time at the beginning of a data cycle.

The second image data output port of each of the first through nth convolution units 231-1 through 231-N is connected to the input data memory 220 through a data bus 232. The input data memory 220 and one or more of the first through nth convolution units 231-1 through 231-N form a data outer loop.

The convolution and output port of each of the first through nth convolution units 231-1 through 231-N is connected to the result memory 240. The result memory 240 in the present invention may be a Random Access Memory (RAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), a non-volatile memory such as a flash memory, or any other type of memory.

The acceleration device 200 may further comprise a data sorting unit. At the beginning of data cycle, the image data in multiple columns of the FIFO buffer 221 are loaded into one or more of the first convolution unit 231-1 to the nth convolution unit 231-N at a time, and then data cycle is started, each column of the image data is shifted by one bit, and the convolution calculation can be started by starting the convolution units, so that the utilization rate of the convolution units can be increased, but the output result is not in the order of the original image. Therefore, when the result data output by the convolution unit is used as input data to be sent to the next stage of convolution calculation, the data sorting unit needs to sort the order.

The operation of the acceleration device 200 will be described as a specific example. In one embodiment of the present invention, the first convolution unit 231-1 to the nth convolution unit 231-N are 5 × 5 convolution units, the total number M of templates to be calculated is equal to the number N of convolution units, the templates are loaded into the N convolution units of the acceleration chain, the input data memory 220 includes at least six FIFO buffers, and the first image data input port of the first convolution unit 231-1 is connected to the five FIFO buffers 221 in the input data memory 220. Firstly, five lines of data stored in five FIFO buffers 221 are loaded to the first convolution unit 231-1 to the nth convolution unit 231-N in sequence at a time through the bus 222, wherein the first column to the fifth column of the five lines of data are loaded to the nth convolution unit, the sixth column to the tenth column are loaded to the N-1 th convolution unit, and so on, the data flow is started, the image data of the next column pushes the image data of the previous column forward by one bit, and the data in the FIFO buffer enters the first convolution unit 231-1 through the first image data input port. The first column of data first comes out from the nth convolution unit, and the data is calculated while flowing, and the result of each calculation is stored in the designated storage location of the result memory 240. The first column of data first comes out of the Nth convolution unit 231-N and then enters the FIFO buffer 221 or the first convolution unit 231-1.

When the first column of image data again appears at the right end of the nth convolution unit 231-N, one data cycle ends.

Considering that the external memory and the internal logic operate asynchronously, and the read/write speed of the external memory may be much lower than the operating speed of the internal logic, five lines of image data are pre-stored, the sixth line of image data is loaded into the sixth FIFO buffer 221 while the first convolution unit 231-1 to the nth convolution unit 231-N calculate one loop, and then the sixth line of image data is updated to the fifth line of image data, the fifth line of image data is updated to the fourth line of image data, the fourth line of image data is updated to the third line of image data, the third line of image data is updated to the second line of image data, and the second line of image data is updated to the first line of image data. And carrying out next data cycle after the updating is completed. And repeating the operations of updating the FIFO buffer and circulating the data until all the data flow through the N convolution units.

In an embodiment of the present invention, when the total number M of the templates to be calculated is greater than the number N of the convolution units, after the image data is circularly calculated for one turn in the first convolution unit 231-1 to the nth convolution unit 231-N, the template of the convolution unit needs to be updated, and then the image data is circularly calculated for one turn in the first convolution unit 231-1 to the nth convolution unit 231-N; by analogy, when all the M templates are calculated, the FIFO buffer in the input data memory 220 is updated.

In an embodiment of the present invention, when the number L of templates to be calculated is less than the number N of convolution units, the first L convolution units in the first convolution unit 231-1 to the nth convolution unit 231-N operate effectively, so that the data loop may omit the L +1 th convolution unit to the nth convolution unit, and at the beginning of the data loop, the image data is loaded to the first convolution unit 231-1 to the L convolution unit 231-L in sequence at one time through the bus 222, wherein the first column to the fifth column are loaded to the L convolution unit, the sixth column to the tenth column are loaded to the L-1 th convolution unit, and so on, the data flow is started, and the image data in the next column pushes the image data in the previous column by one bit. The image data comes out of the Lth convolution unit 231-L and enters the FIFO buffer 221 or the first convolution unit 231-1.

The result memory 240 may include a plurality of storage units 241-1 to 241-N, each corresponding to one convolution unit, storing the calculation result of the corresponding convolution unit.

Optionally, the post-processing device 300 may include an accumulator 310 and a pooling unit 320. Accumulator 310 cyclically accumulates the results for each memory cell. Pooling unit 320 pools the output of accumulator 310.

The post-processing means 300 comprises a non-linear unit 330 and a result memory 340. When the post-processing device 300 does not include the accumulator 310 and/or the pooling unit 320, the data may be directly entered into the non-linear unit 330, non-linearly processed and stored in the result memory 340.

The data from the results memory 340 may be input as input image data to the input data memory 220 for a second level of convolution calculations.

For example, in one specific embodiment, the size of the input image calculated by the first stage convolution is 32 × 32, the acceleration chain 230 includes 7 convolution units of 5 × 5, and the acceleration chain 230 is loaded with the template coefficients of the first to sixth convolution units, first, five rows of data stored in the five FIFO buffers 221 are loaded in sequence to the first to seventh convolution units 231-1 to 231-7 at a time through the bus 222, wherein the first to fifth columns of the five rows of data are loaded to the seventh convolution unit, the sixth to tenth columns are loaded to the sixth convolution unit, the eleventh to fifteenth columns are loaded to the fifth convolution unit, the sixteenth to twentieth columns are loaded to the fourth convolution unit, the twenty to twenty-fifth columns are loaded to the third convolution unit, the twenty-sixth to thirty-sixth columns are loaded to the second convolution unit, the thirty-third to thirty-second columns are loaded to the first convolution unit, and starting a data stream, enabling the first column of image data to enter the first convolution unit after coming out of the seventh convolution unit, updating a line of data cached in the FIFO after completing data circulation, and performing the data circulation again. Because the number of templates in the first-stage convolution calculation is less than 7, each convolution unit only needs to calculate one template, and the calculation result corresponding to only one template is buffered, then is subjected to down-sampling and nonlinear processing by the accumulator 310, and enters the input data memory 220. And then, performing a second-stage convolution calculation, wherein the input image is 14 × 14 × 6, 16 templates are provided, so that the coefficients of the previous 7 templates are loaded into the convolution unit, starting a data stream, performing the data cycle, and after the data cycle is completed, sequentially replacing and loading the 7 convolution units with the coefficients of the templates from 8 th to 14 th, after the data cycle is completed, updating and loading the convolution unit with the coefficients of the templates from 15 th to 16 th, and after the data cycle is completed, entering the FIFO buffer.

With the post-processing device 300 shown in fig. 3, the calculation result of the previous convolution calculation can be directly used as the input data of the next convolution calculation to perform the next calculation without the need of the main processor to reorganize data. Therefore, a large amount of calculation time and waiting time of the main processor are saved, the power consumption is low, the energy efficiency is high, the framework of the accelerating device can be simplified, and the design of an accelerator chip is simplified.

First, in step 410, the template size, image size, and number of templates F of the convolution calculation are determined.

In step 420, the number of convolution units in the acceleration chain, C, and the shortest length of data loop, Dn, the number of image data columns, F/template columns, is determined based on the image size, template number, and/or computing power of the accelerator. That is, the shortest length Dn of the data loop is the number of columns of image data divided by the number of columns of templates per template, and is an integer.

At step 430, lines of image data are loaded/updated in the FIFO buffer.

In step 440, a template is loaded on the acceleration chain according to the number C of convolution units, the number F of image data columns, and the shortest length Dn of the data cycle, and the data flow is started while the calculation is performed.

In one embodiment of the present invention, the acceleration chain has six operation modes according to the relationship among the number C of convolution units, the number F of image data columns, and the shortest length Dn of the data cycle.

In a first operation mode, Dn is larger than or equal to C and larger than or equal to F, template coefficients of F templates are loaded on F convolution units of an acceleration chain, multiple rows of data in an FIFO cache are synchronously loaded to the F convolution units, a first row of image data in the FIFO cache is loaded to the rightmost end of an Fth convolution unit, data flow is started, and the remaining multiple rows of data in the FIFO cache follow up from the 1 st convolution unit. In one embodiment of the invention, the data cycle may omit the F +1 th convolution unit to the C convolution unit, and the image data column comes out of the F convolution unit and returns to the FIFO buffer. When the first column of image data appears again at the right end of the F-th convolution unit, one data cycle ends.

In a second operation mode, Dn is larger than or equal to F and larger than or equal to C, template coefficients of C templates in the F templates are loaded on C convolution units of the acceleration chain, multiple rows of data in an FIFO cache are synchronously loaded on the C convolution units, a first row of image data in the FIFO cache is loaded to the rightmost end of the C-th convolution unit, and the rest multiple rows of data in the FIFO cache follow up from the first convolution unit; and returning the image data column from the C convolution unit to the FIFO, subtracting C from F and updating the value of F by the value of F-C when the image data in the first column appears at the right end of the C convolution unit again, comparing F with C, if F is larger than C, repeating the step of the second operation mode by using the updated F templates, and if F is less than or equal to C, performing the step of the first operation mode by using the updated F templates.

In a third operation mode, C is larger than or equal to Dn and larger than or equal to F, template coefficients of F templates are loaded on F convolution units of an acceleration chain, multiple rows of data in an FIFO cache are synchronously loaded into the F convolution units, a first row of image data in the FIFO cache is loaded to the rightmost end of an Fth convolution unit, data flow is started, and the remaining multiple rows of data in the FIFO cache follow up from the 1 st convolution unit. In one embodiment of the invention, the data cycle may omit the F +1 th convolution unit to the C convolution unit, and the image data column comes out of the F convolution unit and returns to the FIFO buffer. When the first column of image data appears again at the right end of the F-th convolution unit, one data cycle ends.

In a fourth operation mode, C is larger than or equal to F and larger than or equal to Dn, template coefficients of F templates are loaded on F convolution units of an acceleration chain, multiple columns of data in a FIFO cache are synchronously loaded to Dn convolution units in the F convolution units, a first column of image data in the FIFO cache is loaded to the rightmost end of the Dn convolution unit, and data flow is started. In one embodiment of the invention, the data loop may omit the F +1 th convolution element through the C convolution element, and the image data column is returned to the FIFO after the F convolution element. When the first column of image data appears again at the right end of the Dn-th convolution unit, one data cycle ends.

In the fifth mode of operation, F ≧ Dn ≧ C, the steps of the fifth mode of operation are the same as those of the second mode of operation. For simplicity, a detailed description thereof is omitted.

In a sixth operation mode, F is larger than or equal to C and larger than or equal to Dn, template coefficients of C templates in the F templates are loaded on C convolution units of an acceleration chain, Dn convolution units are synchronously loaded on a plurality of columns of data in a FIFO buffer, a first column of image data in the FIFO buffer is loaded to the rightmost end of the Dn convolution unit, data flow is started to enter a data inner circulation mode, the image data column returns to the first convolution unit after coming out of the C convolution unit, when the first column of image data appears at the right end of the Dn convolution unit again, F is subtracted from C and the value of F-C is used for updating the value of F, the updated F and C are compared, if the updated F is larger than or equal to C, the step of the sixth operation mode is repeated by using the updated F templates, if the updated F is smaller than C, the template coefficients of the F templates are loaded on the C convolution units of the acceleration chain, and continuing the data inner circulation, and returning the image data column to the FIFO buffer after coming out of the C-th convolution unit. When the first column of image data appears again at the right end of the Dn-th convolution unit, one data cycle ends.

At step 450, the calculation results for each convolution element are stored at a specified location.

After completing the data cycle, step 460 determines whether there is one or more rows of uncomputed image data, and if so, returns to step 430 to update the FIFO buffer with a new row of image data. If not, the calculation of the image data is finished.

When the FIFO buffer is updated, a line of image data needs to be read from the external memory, the external memory and the internal logic work asynchronously, and the read-write speed of the external memory is possibly far lower than the working speed of the internal logic, so that a plurality of lines of image data are prestored, a new line of image data is loaded to the FIFO buffer while the first convolution unit to the Nth convolution unit circularly calculate one circle, and the new line of image data is extruded out of the FIFO buffer.

Next, optionally, in some embodiments, the calculation results of each convolution unit are circularly accumulated by an accumulator, the accumulated results are pooled, and then the non-linear processing is performed before being stored in a result memory. In other embodiments of the present invention, the accumulation and/or pooling process may not be performed, and the data may be directly non-linearly processed and then stored in the results memory 340.

The data of the result memory may be input as input image data to the input data memory for the next stage of convolution calculation.

The accelerating device and the accelerating method for the artificial intelligence AI calculation load data during operation, reduce the bandwidth requirement and simultaneously do not need to independently prepare the data for each convolution calculation unit. The input data memory of the accelerator adopts FIFO cache, an external read-write address line is not needed, data is written in sequentially, data is read out sequentially, the data address is completed by automatically adding 1 to an internal read-write pointer, and therefore the accelerator is very simple to use, simple in data organization, simple in accelerator architecture, simple in chip design, low in power consumption and high in energy efficiency.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various combinations, modifications, and changes can be made thereto without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. An acceleration apparatus for artificial intelligence, comprising:

a template memory storing template coefficients to be calculated;

the image processing device comprises an acceleration chain, a first convolution unit, a second convolution unit, a first image data input port, a second image data input port, a first image data output port and a convolution and output port, wherein the acceleration chain comprises the first convolution unit to the Cth convolution unit, C is an integer greater than or equal to 1, and each of the first convolution unit to the Cth convolution unit comprises a template data input port, a first image data input port, a second image data input port, a first image data output port and a convolution; each template data input port is respectively connected with the template memory and used for receiving the template data used by the convolution unit; a first image data input port of the first convolution unit is connected to an input data memory; a first image data output port of the first convolution unit is connected to a first image data input port of the second convolution unit; a first image data output port of the second convolution unit is connected to a first image data input port of the third convolution unit; … the first image data output port of the C-1 th convolution unit is connected to the first image data input port of the C-th convolution unit; a first image data output port of the C convolution unit is connected to a first image data input port of the first convolution unit; second image data input ports of the first convolution unit to the C convolution unit are connected with the FIFO cache, and when data circulation starts, multiple columns of image data of the FIFO cache are synchronously loaded to one or more of the first convolution unit to the C convolution unit; the convolution and output port of each of the first convolution unit to the C convolution unit are respectively connected with the result memory; and

a result memory to which the convolution and output ports of each of the first convolution unit through the Cth convolution unit are respectively connected,

wherein each of the first through C convolution units further includes a second image data output port, the second image data output port of each of the first through C convolution units being connected to an input data memory.

2. The accelerating apparatus of claim 1, further comprising:

an accumulator that circularly accumulates convolution calculation results;

the pooling unit is used for pooling the output result of the accumulator;

And the storage unit stores the data after the nonlinear processing.

3. An accelerator arrangement according to claim 2, wherein the memory unit is connected to the input data memory for storing data for use as input data for a subsequent convolution calculation.

4. An acceleration method for an acceleration device of any one of claims 1 to 3, comprising:

loading a plurality of lines of image data in a FIFO cache;

5. An acceleration method according to claim 4, characterized in that, loading the template on the acceleration chain according to the number of convolution units C, the number of columns of image data F and the shortest length Dn of the data cycle, and starting the data flow, the side-flowing side calculation comprises one or more of six modes of operation:

in a sixth operation mode, F is larger than or equal to C and larger than or equal to Dn, template coefficients of C templates in the F templates are loaded on C convolution units of an acceleration chain, Dn convolution units are synchronously loaded on a plurality of columns of data in a FIFO buffer, a first column of image data in the FIFO buffer is loaded to the rightmost end of the Dn convolution unit, data flow is started to enter a data inner circulation mode, the image data column returns to the first convolution unit after coming out of the C convolution unit, when the first column of image data appears at the right end of the Dn convolution unit again, F is subtracted from C and the value of F-C is used for updating the value of F, the updated F and C are compared, if the updated F is larger than or equal to C, the step of the sixth operation mode is repeated by using the updated F templates, if the updated F is smaller than C, the template coefficients of the F templates are loaded on the C convolution units of the acceleration chain, and continuing the data inner circulation, and returning the image data column to the FIFO buffer after coming out of the C-th convolution unit.

6. An acceleration method according to claim 5, characterized in, that in the first mode of operation and/or in the third mode of operation and/or in the fourth mode of operation, the data cycle ignores the (F + 1) th to (C) th convolution elements, the columns of image data are brought back to the FIFO buffer after coming out of the (F) th convolution element, and one data cycle ends when the first column of image data is brought again to the right of the (F) th convolution element.

7. An acceleration method according to claim 5, characterized in that in the sixth mode of operation, when the updated F < C, one data cycle ends when the first column of image data appears again at the right end of the Dn-th convolution unit.

8. An acceleration method according to claim 4, characterized in that, after completing one data cycle, it further comprises:

9. An acceleration method according to claim 4, characterized in that it further comprises post-processing the results of the convolution unit calculations, said post-processing comprising one or more of accumulation, pooling, non-linearity.

10. An acceleration method according to claim 9, characterized in that it further comprises taking the processed result as input image data for the next level of acceleration calculation.

11. An acceleration method according to claim 4, characterized in, that a new line of image data is loaded into the FIFO buffer while the data is being circulated.