CN111832713A

CN111832713A - Parallel computing method and computing device based on line buffer Linebuffer

Info

Publication number: CN111832713A
Application number: CN201910317455.9A
Authority: CN
Inventors: 张伟豪; 李涵; 王封; 丁瑞强
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2020-10-27
Also published as: WO2020211654A1

Abstract

The invention provides a parallel computing method and computing equipment based on line buffer Linebuffer, which are applied to a template computing structure, and the method comprises the following steps: determining a template calculation object; constructing a preset template of a line buffer Linebuffer according to the template parameters of the template calculation object and the number of the calculation units; and simultaneously transmitting the template data to the plurality of computing units through the preset template of the line buffer Linebuffer, and processing respective computing tasks in parallel by each computing unit. According to the method, the template data required by the multiple computing units for executing the computation are simultaneously acquired by the preset template of the line buffer Linebuffer, and then the multiple computing units execute the computation synchronously.

Description

Parallel computing method and computing device based on line buffer Linebuffer

Technical Field

The invention relates to the field of convolutional nerves, in particular to a parallel computing method and computing equipment based on line buffer Linebuffer.

Background

In recent years, with the development of artificial intelligence, convolutional neural networks are applied more and more, and accelerator architectures designed for convolutional neural networks are emerging continuously.

At present, a convolutional neural network needs to provide data required by each calculation for a calculation unit when performing the calculation each time, and a traditional method stores all input data into an on-chip memory or constantly accesses an off-chip memory to obtain the input data, so that the on-chip memory pressure is increased by adopting a first method, and the I/O access pressure is increased by adopting a second method. At this time, buffering of on-chip intermediate data is generally implemented by using a Linebuffer structure, but the traditional Linebuffer does not support parallel synchronous execution of consumers.

Disclosure of Invention

In view of the above problems, the present invention provides a parallel computing method and a computing device based on line buffer Linebuffer, which overcome or at least partially solve the above problems.

According to one aspect of the invention, a parallel computing method based on line buffer Linebuffer is provided, which is applied to a template computing structure, and the method comprises the following steps:

determining a template calculation object;

establishing a preset template of a line buffer Linebuffer according to the template parameters of the template calculation object and the number of the calculation units;

and simultaneously transmitting the template data to the plurality of computing units through the preset template of the line buffer Linebuffer, and processing respective computing tasks in parallel by each computing unit. The preset template of the Linebuffer is established according to the template parameters of the template calculation object and the number of the calculation units, so that the template data required by the calculation units for executing the calculation can be simultaneously acquired, and then the calculation is synchronously executed by the calculation units.

Optionally, when the template calculation object is a convolutional neural network, the method includes:

determining a network layer needing parallel processing;

distributing a plurality of computing units for the network layer;

constructing a preset template of a line buffer Linebuffer according to the template parameters of the network layer and the number of the computing units;

and simultaneously transmitting template data to the plurality of computing units through a preset template of the line buffer Linebuffer, and processing tasks of the network layer in parallel by each computing unit, wherein the template data is original template data limited by the template parameters.

When network layers needing parallel processing are selected from a convolutional neural network, one or more network layers can be selected from all the network layers according to the calculated amount of each network layer, a preset template of a line buffer Linebuffer is constructed through the template parameters of the selected network layers and the number of the computing units, then the data are transmitted to a plurality of computing units simultaneously based on the preset template, and each computing unit processes tasks of the network layers in parallel.

Optionally, the preset template of the Linebuffer consists of a plurality of original templates with specified sizes, and the number of the original templates is equal to the number of the computing units;

the original templates are sequentially connected in the preset template, and at least partially overlapped. The size of the template is enlarged by combining a plurality of original templates together so as to simultaneously acquire data required by each computing unit, and further realize the parallel computing of a plurality of computing units. The original templates are combined together to form an enlarged preset template, and each computing unit simultaneously acquires data to be processed, so that parallel computing of the computing units is realized.

Optionally, the transmitting, by the preset template of the line buffer Linebuffer, the template data to the plurality of computing units at the same time, and processing, by each computing unit, the task of the network layer in parallel includes:

and simultaneously transmitting the template data of each original template to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer, and processing the tasks of the network layer in parallel by each computing unit.

Optionally, the simultaneously transmitting the template data of each original template to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer, where each computing unit processes the tasks of the network layer in parallel, includes:

averagely dividing an input feature map of the convolutional neural network into a plurality of data image blocks in advance;

simultaneously acquiring template data required by each computing unit to execute convolution operation based on the data image blocks by using the original templates, and transmitting the acquired template data to the corresponding computing unit;

and continuously moving the original templates by preset step length according to the specified direction, simultaneously acquiring new template data required by each calculation unit for currently executing convolution operation after the original templates move each time, and transmitting the new template data to the corresponding calculation unit until all the data image blocks are read.

Optionally, the method further comprises:

when template data which are currently required by the plurality of computing units are obtained, obtaining new template data which are required by the plurality of computing units to execute the next template computing based on the input feature map;

and storing the new template data into a preset data buffer area. When a plurality of templates in the preset templates acquire data required by each computing unit to execute convolution computation, the Linebuffer buffer area continuously reads data generated by the previous data layer, so that a plurality of template data can be sent simultaneously, a plurality of consumers can perform parallel computation simultaneously, the time for acquiring the data by the templates is reduced, and the computation efficiency is improved.

Optionally, when the data buffer is full, the preset template is moved by a preset step length.

Optionally, the preset step size is p × stride_xWherein p represents the number of the computing units; stride_xAnd representing the horizontal direction step length of the original template.

Optionally, the Linebuffer is implemented by a group of registers;

each original template in the preset templates comprises a plurality of registers, so that template data required by each execution of template calculation is read and written into the template data required by each execution of template calculation by the calculation unit based on the data pattern blocks in the input feature map.

According to another aspect of the present invention, there is also provided a computing device comprising:

a processor for executing the line buffer Linebuffer-based parallel computing method according to any one of the above.

Optionally, the computing device further comprises:

a storage device for storing a computer program that is loaded and executed by the processor when running in the computing device.

The invention provides a more efficient synchronous calculation method based on a line buffer Linebuffer, which is characterized in that a plurality of calculation units are distributed to a network layer needing to execute parallel calculation after the network layer is determined, a preset template of the line buffer Linebuffer is constructed according to template parameters of the network layer and the number of the calculation units, template data are simultaneously transmitted to the plurality of calculation units through the preset template, and then the plurality of calculation units execute calculation synchronously.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a schematic diagram of the working principle of Linebuffer;

FIG. 2 shows a schematic diagram of a Linebuffer-based convolution calculation;

FIG. 3 is a schematic diagram showing Linebuffer saving intermediate computation results between layers of a convolutional neural network;

FIG. 4 shows a Linebuffer implementation schematic using shift registers;

FIG. 5 shows a line feed schematic of the Linebuffer shown in FIG. 4;

FIG. 6 illustrates a schematic diagram of splitting a neural network layer;

FIG. 7 is a schematic diagram of the template calculations assigned to each of the calculation units shown in FIG. 6;

FIG. 8 is a diagram illustrating conventional computing unit computation times;

FIG. 9 is a flow chart diagram illustrating a Linebuffer-based parallel computing method according to an embodiment of the present invention;

FIG. 10 shows a schematic diagram of a Linebuffer prototemplate;

FIG. 11 shows a schematic diagram of composing a preset template based on a plurality of original templates;

FIG. 12 is a diagram showing a buffer setting in accordance with the first embodiment;

FIG. 13 is a diagram illustrating the synchronous computation time of the computation units according to the first embodiment;

FIG. 14 is a diagram showing a buffer setting in the second embodiment;

FIG. 15 shows a synchronous Linebuffer working diagram;

FIG. 16 is a diagram illustrating a synchronous Linebuffer move job; and

FIG. 17 shows a schematic diagram of the synchronous Linebuffer line wrapping operation.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Linebuffer, also called line buffer, is a technology widely applied to template calculation, and the fields of image processing, artificial intelligence and the like all use template calculation in large quantity. Generally speaking, Linebuffer can reduce the number of times of memory access and on-chip memory, and is a common structure in the calculation of a flow-through template. The convolution operation in the convolutional neural network is also a template calculation, so the Linebuffer technology is also often used in some convolutional accelerator architectures, which makes the Linebuffer technology largely applied again in recent years.

FIG. 1 shows a schematic diagram of the working principle of Linebuffer. In the figure, the size of the input feature map is 5 × 5, and a template (sliding window) is continuously slid on the input feature map. For each sliding, the data contained in the template is subjected to a template calculation. In fig. 1, non-white parts (01-21) represent data stored in Linebuffer, wherein dark gray parts (01, 10, 11, 12, 21) are templates of the secondary template calculation, i.e. input data related to the current calculation. Each time the template is calculated, the Linebuffer needs to provide the data required by the calculation to the calculation unit. After the template calculation is completed once, the Linebuffer needs to be updated, new data needs to be read in, and data which cannot be reused is discarded. In this example, after the calculation unit completes the first calculation, the calculation unit moves in the horizontal direction with the step size of 1, and Linebuffer discards data 01 and reads data 22. If the Linebuffer is not used, all input data are stored in the on-chip memory, so that the on-chip memory pressure is increased, or the off-chip memory is continuously accessed to obtain the input data, so that the I/O access pressure is increased. By using Linebuffer, the on-chip storage pressure or the external access pressure is greatly reduced. Fig. 1 shows an example of template calculation in which the template is a cross shape, and in practical application, the template may be any shape. In a typical convolutional neural network, the shape of the template is preferably rectangular.

Fig. 2 is a diagram illustrating a conventional convolution calculation based on Linebuffer, in which the template size is 3 × 3 and the step size is 1. In pipelined template computations, such as convolutional neural network computations, Linebuffer tends to act as a buffer between layers to preserve intermediate results between layers with minimal storage penalty. In the flow process, the front layer and the rear layer usually adopt a producer and consumer mode, namely, after the front layer calculates all data required by the rear layer for one calculation, the rear layer starts one calculation immediately. Therefore, Linebuffer will send the template data to the subsequent network layer after receiving all the data required for the calculation, and the subsequent network layer starts the calculation. In the embodiment shown in fig. 3, Linebuffer mainly implements data transfer between layers 0 (network Layer 0) to 1 (network Layer 1), layers 1 (network Layer 1) to 2 (network Layer 2), and layers 2 (network Layer 2) to 3 (network Layer 3).

In terms of hardware, Linebuffer may be implemented by using a section of memory, or may be implemented by using a group of registers, and fig. 4 shows a schematic diagram of building Linebuffer by using a shift register. Taking the Linebuffer of fig. 2 as an example, when every template is run in a line, the register shifts 1 time to the left according to the black line (step size of the template in the horizontal direction), referring to fig. 4, the register R00 discards one data, and the register R22 reads in new data. The registers R00-R22 output the data contained in the template. With each line change, the register is shifted to the left by 3 bits (width of the template in the horizontal direction) and three new digits are read in, as shown in fig. 5.

In the pipeline template calculation similar to the convolutional neural network, the calculation amount of different layers may be greatly different, so that the layer with slow calculation often needs to wait for the calculation of the previous layer, and a bottleneck of the calculation of the whole network is formed.

In this case, the layers that are computationally slow may be parallelized. Taking Layer0 and Layer1 in fig. 3 as an example, assuming that Layer0 is not parallel and is calculated on one calculation unit, Layer1 is split into three parts, namely calculation unit 1, calculation unit 2 and calculation unit 3, and is calculated on three calculation units in parallel, as shown in fig. 6.

In this case, the three calculation units will equally distribute the convolution calculations of Layer 1. Assuming that the calculation assignment of three calculation units is as shown in fig. 7, i.e. Layer1 needs to perform 9 template calculations in total, each calculation unit is responsible for three template calculations. Note that each template is calculated as ladder [ i ] [ j ], and the required data is recorded as data [ i ] [ j ].

After Layer1 splits, Linebuffer needs to provide data for each compute unit, and this process is shown in a first behavior example. When Layer0 calculates data 00 to data 22, Linebuffer sends data 00-22 calculated by Layer0, i.e. data [0] [0] to calculation unit 1 and starts to calculate the tencel [0] [0 ]. However, the

calculation units

2 and 3 cannot start the calculation, because the

data

23 and 24 are not calculated by the Layer 0. When Layer0 completes the calculation of one template again, Linebuffer gets data 23, and Linebuffer updates once and sends data [0] [1] to calculation unit 2, and starts to calculate the tencel [0] [1 ]. Similarly, after Linebuffer obtains the data 24, it sends the data [0] [2] to the computing unit 3, and starts to compute the tencel [0] [2 ]. It can be seen that the three calculation units cannot start the calculation at the same time, i.e. they cannot be synchronized. Let us assume that Layer0 calculates a template to obtain a time of a number as S _0, Layer1 calculates a time of a template as S _1, ignores the read data of Linebuffer, updates, and sends the time of the data. This process is illustrated in fig. 8.

It can be found that the calculation unit 2 needs to wait S₀Can the calculation be started and the calculation unit 3 needs to wait 2S₀The calculation can be started. The calculations of the 3 calculation units are not synchronized and will not be synchronized in future calculations. If the underlying hardware architecture is a strong synchronous architecture, the asynchronous operation brings great trouble to algorithm scheduling, and even the hardware architecture does not support such operation at all.

The problem can be solved by adopting a method of negotiating synchronization among the other three computing units, but the communication cost among the three computing units is undoubtedly increased, and the synchronization logic is also complicated.

The embodiment of the invention provides a parallel computing method based on line buffer Linebuffer, which can be applied to template computing, so that the Linebuffer has the synchronous adjustment capability, and further, consumers of the Linebuffer can perform synchronous computing. Optionally, the specific method may include: firstly, determining a template calculation object; secondly, establishing a preset template of a line buffer Linebuffer according to the template parameters of the template calculation object and the number of the calculation units; and finally, simultaneously transmitting the template data to the plurality of computing units through the preset template of the line buffer Linebuffer, and processing respective computing tasks in parallel by each computing unit. Taking a convolutional neural network as a template calculation object as an example, as can be seen from fig. 9, the Linebuffer-based parallel calculation method provided by the embodiment of the present invention may include:

step S901, determining a network layer to be processed in parallel;

step S902, allocating a plurality of computing units to the network layer. Taking the example shown in fig. 6, three calculation units, namely calculation unit 1, calculation unit 2 and calculation unit 3, are allocated to Layer 1. When a network layer requiring parallel processing is selected in the convolutional neural network, one or more network layers may be selected from all network layers according to the calculation amount of each network layer, which is not limited in the present invention.

And step S903, constructing a preset template of the line buffer Linebuffer according to the template parameters of the network layer and the number of the computing units.

Step S904, the template data is transmitted to the plurality of computing units through the preset template of the line buffer Linebuffer at the same time, and each computing unit processes the tasks of the network layer in parallel, where the template data is the original template data defined by the template parameters.

Fig. 10 shows a schematic diagram of a conventional Linebuffer technology, which expands an input feature diagram on the basis of fig. 6. As can be seen from an analysis of fig. 10 with reference to fig. 6, the conventional scheme would use a 3 × 3 template for each computing unit, and the step size would be 1.

Optionally, the preset template in the embodiment of the present invention is composed of a plurality of original templates with specified sizes, and the template data required by the computation unit to perform the convolution operation is located in the original templates. The number of the original templates is equal to that of the calculating units, and the original templates are connected in sequence in the preset template and at least partially overlapped. The original templates may be the same size or different sizes, which is not limited in the present invention.

That is to say, in the conventional scheme, one original template is adopted to sequentially acquire data required by each computing unit to execute convolution computation, and in the embodiment of the present invention, a plurality of original templates are combined together to form an enlarged preset template, so that each computing unit simultaneously acquires data to be processed, thereby implementing parallel computation of a plurality of computing units. In practical application, the calculation template of Linebuffer is preferentially expanded in the horizontal direction.

Assuming that the parallelism of Linebuffer consumers is p, p templates continuous in the horizontal direction form a new template, which is called a large template (i.e. the preset template in the above embodiment). The horizontal direction step length of the large template is p multiplied by stride_x. Fig. 11 shows this process (taking p ═ 3 as an example). Referring to fig. 11, the original template has a rectangular shape and a size of 3 × 3, and a large template may be formed by expanding the original template in a horizontal direction by consecutive three original templates, wherein the consecutive three original templates may have overlapping portions. The step S902 may further include: and simultaneously transmitting the template data of each original template to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer, and processing the tasks of the network layer in parallel by each computing unit. Optionally, the method specifically includes:

s902-1, averagely dividing an input feature map of the convolutional neural network into a plurality of data segments, such as the 8 × 6 data segments shown in fig. 11;

s902-2, simultaneously acquiring template data required by each calculation unit to execute convolution operation by using a plurality of original templates, and transmitting the acquired template data to the corresponding calculation unit to execute calculation.

In the working process, Linebuffer can obtain p original template data according to the data contained in the large template, and simultaneously send the p original template data to p computing units for template computing.

After the step S902-2, the method may further include: and S902-3, continuously moving the plurality of original templates by preset step length according to the designated direction, simultaneously acquiring new template data required by each calculation unit for currently executing convolution operation based on the plurality of data image blocks after the plurality of original templates move each time, and transmitting the new template data to the corresponding calculation unit until the plurality of data image blocks are completely read.

In an optional embodiment of the present invention, when acquiring template data currently required by a plurality of computing units to execute a template, new template data required by the plurality of computing units to execute a next template calculation may also be acquired based on the input feature map; storing the new template data into a preset data buffer area; and when the data buffer area is full, the preset template moves by a preset step length. The hatched portions (data blocks 25, 26, and 27) in fig. 12 are buffers.

That is, in the embodiment of the present invention, p × stride may be added at the end of Linebuffer_xAnd when a plurality of templates in the preset templates acquire data required by each calculation unit to execute convolution calculation, the Linebuffer buffer continuously reads data generated by the previous data layer so as to reduce the time for acquiring the data by the templates and further improve the calculation efficiency. When the Linebuffer buffer is full, the Linebuffer can move a preset template including a plurality of original templates once. The Linebuffer added with the buffer area can simultaneously send a plurality of template data, so that a plurality of consumers can simultaneously calculate in parallel. As shown in FIG. 13, when Linebuffer obtains all the data of the first template, three template data will be sent to three computing units at the same time, and the three computing units can start computing synchronously. At this time, Linebuffer will continuously receive the data generated by Layer0 and store it into the buffer. When the buffer is full, the Linebuffer sends the following 3 template data, and the calculation unit starts the next round of calculation immediately after receiving the template data.

In the embodiment shown in FIG. 13, 3S₀＞S₁The computing unit waits for a period of time after a round of computation is completed. If the number of parallel lines is coincidentally equal to the number of front layer calculationsThe speed is multiple of the calculation speed of the back layer, so that the back layer calculation unit does not need to wait and can directly start the calculation. At this time, if the overhead of Linebuffer and the overhead of communication and control are ignored, the calculation utilization rates of the two layers are both 100%, and all parallel calculations start synchronously. In summary, the synchronous Linebuffer occupies a little more storage, so that the parallel computing can be synchronized.

The Linebuffer-based parallel computing method in the embodiment is suitable for the situation with finer control granularity. In practical applications, sometimes the underlying hardware can only provide control with a coarser granularity, such as control in units of lines, and Linebuffer adopts a line pipelining technique. When the control granularity is multi-line, i.e. multi-line pipelining, the line buffers support the parallel approach. At this time, the buffer of Linebuffer becomes stride_yRow, i.e. buffer stride_yThe data of the row is as shown in fig. 14.

Optionally, in the embodiment of the present invention, when performing calculation based on Linebuffer, Linebuffer is implemented by a group of registers; each original template in the preset templates comprises a plurality of registers, so that template data required by template calculation executed each time are read and written into the calculation unit based on data image blocks in the input feature map. Alternatively, one register may correspond to reading data of one data tile.

As shown in fig. 15, the registers R00 to R24 are 3 × 3 templates, similar to those shown in fig. 4 and 5, and are respectively sent to 3 calculation units. And when the computing unit computes the template, the synchronous Linebuffer continuously acquires new template data through the read-in 2 so that each computing unit can execute the template computation. The synchronous Linebuffer continuously reads new data (

data

25, 26, 27 in fig. 12) by reading in 1, and stores the new data in a buffer, where the buffer is composed of three shift registers of BOO, B01, and B02. The write controller will constantly control the incoming data from read-in 1 to be cyclically written to BOO, B01, B02, BOO, B01, B02, etc. in sequence.

When the buffer is full, Linebuffer can perform a large template shift once, and at this time, all shift registers (including the buffer) in Linebuffer shift 3 bits to the left, and the state of Linebuffer changes as shown in fig. 16. The registers R00 to R24 will send new templates to the compute units and the buffers BOO, B01, B02 wait to read in

new data

30, 31, 32.

Then, Linebuffer reaches the position where line change is needed. Linefeed performs linefeed operation, all registers move 3 bits to the left, and 3 new data are read in. Then Linebuffer reaches the state shown in fig. 17, and the buffers BOO, B01, B02 wait for reading in

new data

35, 36, 37.

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including: the processor is used for executing the parallel computing method based on the line buffer Linebuffer according to any embodiment. Additionally, the computing device may further include: a storage device for storing a computer program that is loaded and executed by the processor when running in the computing device.

The embodiment of the invention provides a more efficient synchronous calculation method based on a Linebuffer Linebuffer, for a neural network, firstly, a network layer needing to be executed with parallel calculation is determined, then a plurality of calculation units are distributed to the network layer, a preset template of the Linebuffer Linebuffer is constructed according to template parameters of the network layer and the number of the calculation units, template data are simultaneously transmitted to the plurality of calculation units through the preset template, and then the plurality of calculation units execute the calculation in parallel. The method provided by the embodiment of the invention can be realized on most general storage architectures, such as a register set or a RAM. The synchronous computing method based on the Linebuffer can solve the problem that algorithms such as a neural network algorithm and a multi-step image processing algorithm are not synchronous after being split in parallel, so that the synchronous Linebuffer can be widely applied to hardware architectures such as a many-core neural network accelerator architecture and a many-core image processor architecture, and is particularly suitable for hardware architectures needing strong synchronization.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.

Claims

1. A parallel computing method based on line buffer Linebuffer is applied to a template computing structure, and comprises the following steps:

determining a template calculation object;

constructing a preset template of a line buffer Linebuffer according to the template parameters of the template calculation object and the number of the calculation units;

and simultaneously transmitting the template data to the plurality of computing units through the preset template of the line buffer Linebuffer, and processing respective computing tasks in parallel by each computing unit.

2. The method of claim 1, when the template computing structure is a convolutional neural network, the method comprising:

determining a network layer needing parallel processing;

distributing a plurality of computing units for the network layer;

3. The method according to claim 2, wherein the preset template of Linebuffer is composed of a plurality of original templates with specified sizes, and the number of the original templates is equal to the number of the computing units;

the original templates are sequentially connected in the preset template, and at least partially overlapped.

4. The method according to claim 2 or 3, wherein the transmitting the template data to the plurality of computing units simultaneously through the preset template of the line buffer Linebuffer, and the processing of the tasks of the network layer by each computing unit in parallel comprises:

and simultaneously transmitting the template data of the original templates to the computing units through the original templates of the line buffer Linebuffer, and processing the tasks of the network layer in parallel by each computing unit.

5. The method according to any one of claims 2 to 4, wherein the simultaneously transmitting the template data of the plurality of original templates to the plurality of computing units through the plurality of original templates of the line buffer Linebuffer, and the parallel processing of the tasks of the network layer by each computing unit comprises:

6. The method of claim 5, wherein the method further comprises:

and storing the new template data into a preset data buffer area.

7. The method of claim 6, wherein the preset template is moved by a preset step size when the data buffer is full.

8. The method of any of claims 1-7, wherein the Linebuffer is implemented by a set of registers;

9. A computing device, comprising:

a processor for performing the line buffer Linebuffer-based parallel computing method as recited in any one of claims 1 to 8.

10. The computing device of claim 9, wherein the computing device further comprises: