CN110716751B

CN110716751B - High-parallelism computing platform, system and computing implementation method

Info

Publication number: CN110716751B
Application number: CN201810765894.1A
Authority: CN
Inventors: 王俊斌; 王汐; 方绍峡; 于谦; 隋凌志; 单羿
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2022-10-18
Anticipated expiration: 2038-07-12
Also published as: CN110716751A

Abstract

The invention discloses a high-parallelism computing platform, a high-parallelism computing system and a related implementation method. The computing platform includes: the method comprises the following steps: an input buffer for buffering the first and second data read from the external memory; the read controller module is used for reading first data and second data required by the parallel computing module to execute single operation from the input cache, wherein the number of the read second data is N times that of the first data, and N is an integer greater than or equal to 2; a parallel computation module operating at N times a system clock frequency for performing a specific computation of the first and second data in a single operation, wherein each of the first data is multiplexed by N copies of the second data to perform the computation within one system clock cycle; and a result cache for caching the calculation results output from the parallel calculation module. Therefore, the limitation of the operating frequency of the data carrying module can be broken through by calculating frequency multiplication matching data multiplexing, and the overall calculation efficiency of the calculation platform is improved.

Description

High-parallelism computing platform, system and computing implementation method

Technical Field

The invention relates to the field of hardware architecture, in particular to a high-parallelism computing platform, a high-parallelism computing system and a computing implementation method.

Background

In recent years, methods based on Artificial Neural Networks (ANN), in particular Convolutional Neural Networks (CNN), have been highly successful in many applications. In the field of computer vision, particularly aiming at the problem of image classification, the introduction of the CNN greatly improves the precision of image classification.

Although the artificial neural network based approach has advanced performance, it requires more computational and memory resources than the traditional approach. Particularly, with the development of neural networks, large neural networks have more and more levels and data volumes, and the practical requirements of the large neural networks cannot be met by using the traditional CPU platform. Therefore, the design of the neural network accelerator by using high-parallelism heterogeneous computing platforms such as an FPGA, a GPU and an ASIC becomes a new research hotspot. Among them, FPGAs and ASICs have good market prospects with their advantages of high customization, high energy efficiency ratio, and low latency.

When a high-parallelism computing platform such as an FPGA (field programmable gate array) and an ASIC (application specific integrated circuit) is used for executing computation, how to improve the computation utilization rate on the basis of the existing hardware performance becomes an important problem to be considered, in particular to a neural network computing platform which relates to a large number of convolutional layer operations.

Therefore, there is still a need for a correlation scheme that can further optimize high-parallelism computations.

Disclosure of Invention

In order to solve at least one of the above problems, the present invention provides a computation structure for computing logic frequency multiplication and maintaining common frequencies in other logics, which can fully utilize the frequency lifting space of the existing computation logic, and realize multiplication of computation efficiency with relatively simple hardware configuration and relatively low power consumption cost, thereby significantly improving the computation capability of a computation platform for large data volume computation, especially for high-multiplexing parallel computation.

According to an aspect of the present invention, there is provided a high parallelism computing platform, comprising: an input buffer operating at a system clock frequency for buffering first data and second data read from an external memory for high-parallelism computation; the read controller module is operated under the system clock frequency and is used for reading first data and second data required by the parallel computing module to execute single operation from the input cache, wherein the number of the read second data is N times of that of the first data, and N is an integer greater than or equal to 2; a parallel computation module operating at N times a system clock frequency for performing a specific computation of the first data and the second data in a single operation, wherein each of the first data is multiplexed by N pieces of the second data to perform the computation within one system clock cycle; and a result cache operating at a system clock frequency for caching computation results output from the parallel computation module. Therefore, the frequency limit of the data carrying module can be broken through by calculating frequency multiplication matching data multiplexing, and the overall calculation efficiency of the calculation platform is improved. Here, the value of N may be determined based on at least a hardware configuration of the platform and a parallelism policy of the high-parallelism computation.

Preferably, the parallel computing module may comprise a plurality of multipliers operating at N times the system clock frequency to complete multiplication of the first data and the second data, wherein for each multiplier, the input of the first input terminal is the first data and the input of the second input terminal is the second data selected via the N-bit data selector. The read controller module may include a plurality of read controllers for reading the first data and the second data from the input buffer to the first input terminal of the corresponding multiplier and the N second input terminals of the N-bit data selector, respectively. Therefore, the data multiplexing under the calculated frequency multiplication is realized by matching relatively simple hardware design.

Preferably, the output ends of M multipliers are connected to the input end of the adder to form a multiplication and addition unit, wherein M is an integer greater than or equal to 2. Therefore, the computing platform is suitable for more complex parallel computing. Each multiply-add unit further comprises N registers respectively connected with the output end of the adder, and the N registers are respectively used for registering one multiply-add result in N multiply-add results in one system clock period and outputting the N-bit multiply-add result in one system time period to the result cache. Or, each multiply-add unit further includes N accumulators respectively connected to the output end of the adder, and each accumulator is configured to register one of the N multiply-add accumulation results in a predetermined system clock period, and to simultaneously buffer the N-bit multiply-add accumulation result into the result buffer in a predetermined system time period. Therefore, the result registering and outputting mode can be flexibly selected according to actual needs.

The control signals of the N-bit data selector and the N registers or accumulators are given at N times the system clock frequency for the number of the second data to be currently calculated, so that correct selection of the second data is conveniently achieved by the control signals.

The high-parallelism computing platform can be a neural network computing platform, and is particularly suitable for performing convolutional neural network computing. The first data may be multiplexed profile data and the second data may be weight data, or vice versa.

Preferably, M is equal to 3, and the multiply-add unit includes: three multipliers; an N-bit data selector connected to the second input of each multiplier; an adder having an input coupled to the output of the multiplier; and N accumulators connected in parallel with the output end of the adder. Thereby being suitable for parallel computation of an input feature map for the RGB three channels.

The parallel processing module may be implemented at least in part by an FPGA, GPU or ASIC. In addition, the input cache and the result cache may also be implemented by dynamically configurable on-chip caches.

According to another aspect of the invention, a computing platform implemented method for a neural network is provided, comprising: reading feature map data and weight data from the external memory into the input cache using the computing platform of any one of the above; the reading controller module reads the feature map data and the weight data required by single parallel computing operation, wherein the data volume of the feature map data read at a single time is N times of the weight data, or the data volume of the weight data read at a single time is N times of the feature map data, and N is an integer greater than or equal to 2; and the parallel computing module realizes multiplication operation of multiplexing weight data or characteristic diagram data for N times in a single operation at N times of the system clock frequency, thereby completing computation at one system clock frequency.

Preferably, the method further comprises: the result buffer receives the multiplication and addition result of the parallel computing module in a single operation at the system clock frequency, or receives the accumulated multiplication and addition result of the parallel computing module in a predetermined number of operations at a predetermined interval of system clock cycles.

Reading, using a computing platform, feature map data and weight data from the external memory into the input cache comprises: and reading the characteristic diagram data and the weight data from the external memory into the input buffer according to the data quantity of N times of the characteristic diagram data or N times of the weight data.

The input buffer may read new feature map data and weight data from the external memory after each predetermined number of single parallel computing operations by the parallel computing module.

Preferably, the reading controller module reading the feature map data and the weight data required for a single parallel computing operation may include: inputting the feature map data into first input ends of multipliers of the multiplying and adding units according to the parallelism of input channels; and sending the weight data belonging to N different weights into N input ends of an N-bit data selector of which the output end is connected with the second input end of the multiplier according to the parallelism of output channels.

According to still another aspect of the present invention, there is provided a highly parallel computing system including: the computing platform of any of the preceding claims; a mass storage memory located external to the computing platform; and a processor coupled to the computing platform and the memory, for performing the implementation method of any of the preceding claims.

In one embodiment, the parallel processing module is implemented at least in part by an FPGA, GPU, or ASIC.

The computing platform provided by the invention is more suitable for high-parallelism computing by frequency multiplication of computing logic, in particular to convolution neural network computing with high data multiplexing degree. Aiming at the data handling module which can not further increase the operation frequency due to the complexity of hardware, the invention can fully utilize the operation frequency of the computing module to increase the space by only increasing the operation frequency of the computing module and matching with a relatively simple reading controller and a data selector circuit, thereby increasing the overall computing efficiency of the computing platform. The computing platform is simple in design, does not need high-complexity asynchronous circuit assistance, is easy to expand, and can realize frequency design according to actual requirements.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a series of ordered running layers for a typical CNN.

Fig. 2 shows a typical operation example of one convolutional layer in a neural network.

Fig. 3 shows an example of a convolution operation.

FIG. 4 shows a schematic diagram of a computing platform, according to one embodiment of the invention.

Fig. 5 shows an example of prior art homodyne calculation.

Fig. 6 shows an example of calculating the logical double in the present invention.

Fig. 7 shows a waveform timing diagram of the structure of fig. 6.

FIG. 8 illustrates a flow diagram of a computing platform implemented method according to one embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Existing general purpose processors (CPUs) require high versatility to handle a variety of different data types and their logical decisions introduce a large amount of branch jumps and interrupt handling. These all make the internal structure of the CPU unusually complex, and are not suitable for data operations of large-scale data with highly uniform types and no mutual dependency. Therefore, designing a high-parallelism computing platform, especially a neural network accelerator, by using high-parallelism heterogeneous computing platforms such as an FPGA, a GPU and an ASIC becomes a new research hotspot. Compared with a GPU platform, the FPGA and the ASIC can achieve higher calculation energy efficiency ratio, and meanwhile, the characteristics of the FPGA and the ASIC, such as quick iteration, flexible reconstruction or design, can be adapted to the requirement of high-speed algorithm development.

The invention provides a novel high-parallelism computing platform which is suitable for parallel operation of large-scale data with relatively uniform types and no dependence on each other, and is particularly suitable for a convolutional neural network capable of improving the computation parallelism by multiplexing input feature maps or weight parameter data. The parallel computing modules or the whole of the computing platform are preferably implemented by FPGAs or ASICs.

Although the computing platform solution of the present invention will be described below mainly in conjunction with parallel computing for convolutional neural networks, it should be understood by those skilled in the art that the computing solution of the present invention is applicable to various high-parallelism computing scenarios such as scientific computing, weather simulation, biological simulation, molecular mechanics model, aircraft manufacturing, and military simulation, and is particularly applicable to application scenarios with high data reuse rate.

CNN basic concept

Artificial intelligence is rapidly developed in recent years, and the method has good application effects in the fields of image classification, detection, video and voice processing and the like, and still has great development prospects. Neural networks are the core of artificial intelligence applications, and a deep learning neural network algorithm is one of the most common neural network models. The workload characteristics of neural networks are computational and data intensive. The multiplication and addition operation required by the neural network calculation is usually in the order of G, for example, the calculation amount of the target detection type neural network SSD is 120G operation times. The parameters required for calculation are typically in the order of M to hundreds of mbytes, for example, the parameters of the classification neural network VGG are 480 mbytes.

Common Artificial Neural Networks (ANN) include Deep Neural Networks (DNN), recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN). The following is a description with a certain degree of background using CNN as an example.

As shown in fig. 1, a typical CNN consists of a series of layers (layers) that run in order.

The CNN neural network is composed of an input layer, an output layer and a plurality of hidden layers which are connected in series. The first layer of the CNN reads an input value, such as an input image, and outputs a series of activation values (which may also be referred to as a feature map). The lower layer reads the activation value generated by the previous layer and outputs a new activation value. The last classifier (classifier) outputs the probability of each class to which the input image may belong.

These layers can be roughly divided into weighted layers (e.g., CONV layers, fully connected layers, bulk normalization layers, etc.) and unweighted layers (e.g., pooling layers, reLU layers, softmax layers, etc.). The convolutive layers (Convolutional layers) take a series of feature maps as input, and convolution kernels are used for convolution to obtain output activation values. A Pooling layer (Pooling layer) is typically connected to the CONV layer for outputting a maximum or average value for each partition (sub area) in each feature map, thereby reducing the computational effort by sub-sampling while maintaining some degree of displacement, scale and deformation invariance. Multiple alternations between convolutional and pooling layers may be included in a CNN, thereby gradually reducing the spatial resolution and increasing the number of feature maps. The CONV layers can also be directly connected without a pooling layer. It can then be connected to at least one full connection layer (FC), resulting in a one-dimensional vector output comprising a plurality of eigenvalues, by means of a linear transformation applied on the input eigenvectors.

In general, the operation of a weighted layer can be represented as:

Y＝WX+b，

where W is the weight value, b is the bias, X is the input activation value, and Y is the output activation value.

The operation of the unweighted layer can be represented as:

Y＝f(X)，

wherein f (X) is a non-linear function.

Here, "weights" (weights) refer to parameters in the hidden layer. In a CNN network, the weights may be considered as convolution kernels that may vary in size for each convolution layer, and may also vary in value for each channel of each convolution layer. It is to be understood in a broad sense that the weights may also include biases and are values learned through the training process and remain unchanged at the time of inference. In addition, the CNN may also include parameters for performing other operations, such as parameters required by the unweighted layers for various types of operations. The activation value refers to a value, also referred to as a feature value, transferred between layers, starting from an input layer, and an output of each layer is obtained by an operation of the input value and a weight value. Unlike the parameter values, the distribution of activation values may vary dynamically depending on the input data sample.

As shown, each layer from the input feature map (input image) has a plurality of channels (channels) to characterize different features of the input image before the feature values are fed into the FC layer. When the color image is input, the initial input feature map usually has three channels of RGB, the feature values and convolution kernels with the same size but different values in different channels in the same Layer are respectively subjected to convolution calculation to generate the output feature value of the Layer, and then the feature value is sent to the next CONV Layer (Layer 1) with the number of channels and the size of the convolution kernels being different for further feature extraction. The above process is repeated until the output of Layer 7 is fed into the FC Layer. As shown, W, H, and C in the input feature map refer to the three dimensions width, height, and channel, respectively. The above arrows may refer to a specific order of computation or degree of computational parallelism (especially if the computation is performed on a high-parallelism computing platform).

The first FC layer may be a fully-connected layer for extracting features of individual channels as one-dimensional feature vector. The second FC layer may then be a classifier for classification.

Operation of the convolutional layer

A typical neural network model, particularly for computer vision applications, whether DNN, RNN or CNN, comprises a number of CONV layers as shown in figure 1. For each CONV layer, higher level abstract data is extracted from the input profile data to preserve important and unique information in the input data. Modern DNNs are able to achieve excellent visual performance by utilizing deep levels (e.g., hundreds of convolutional layers).

Fig. 2 shows a typical operation example of one convolutional layer in a neural network. The same applies to fully connected layers such as the FC layer shown in fig. 1. The three-dimensional input to each convolutional layer is a two-dimensional feature map (W H) with a plurality of channels (C). The first input to a neural network that performs visual processing is typically a two-dimensional image with three color channels of RGB. A plurality of three-dimensional filters (M filters with R × S × C dimensions, which may also be referred to as convolution kernels) are then convolved with the input feature map, and each filter may generate one channel of the output three-dimensional feature map (two-dimensional E × F feature map with M channels). The same set of M filters can be applied to a batch (B) with N input profiles. Thus, N input profiles can obtain N output profiles (here, batch B can also be considered as the fourth dimension of the input). In addition, a 1-dimensional bias (not shown in FIG. 2) may be applied to the filtered results.

Fig. 3 shows an example of a convolution operation. This convolution operation can be regarded as a convolution of the two-dimensional filter (R × S) and the two-dimensional feature map (W × H) on one channel C. As shown in fig. 3, a 5 × 5 (W × H) feature map is convolved with step size 1 using a 3 × 3 (R × S) convolution kernel. The left side of the figure shows the first convolution calculation, the middle shows the second convolution calculation, and so on. As can be seen from the definition of convolution calculation, each specific convolution calculation can be decomposed into multiple multiply-add calculations. After 9 convolution calculations, the convolved 3x3 feature map on the right side of fig. 3 is obtained. There is no dependency between these 9 convolution calculations, so when performing calculations with a high-parallelism computing platform, the execution can be completed in a single operation (parallelism M can typically reach thousands of orders of magnitude). Fig. 3 can be regarded as a convolution operation of one channel C of a plurality of channels of the CONV layer, and the feature map of one channel of the M channels of the output three-dimensional feature map can be obtained only after the convolution operation of all the channels C and the subsequent addition operation are completed. Further, the output three-dimensional feature map (two-dimensional E × F feature map with M channels) is only one of the N output three-dimensional feature maps in the batch.

The CNN calculation involves a large number of convolution operations without dependency relationship between the CNN and the CNN, so that the CNN calculation is particularly suitable for being implemented on a high-parallelism calculation platform.

Basic architecture of the computing platform of the present invention

In order to deal with the calculation with high parallelism, the invention provides a new calculation platform. FIG. 4 shows a schematic diagram of a computing platform, according to one embodiment of the invention. As depicted in FIG. 4, high parallelism computing platform 400 comprises input cache 410, read controller module 420, parallel computing module 430, and result cache 440. The parallel computing module 430 operates on the frequency multiplication of the system frequency f, so as to improve the overall computing efficiency of the computing platform.

Input buffer 410 preferably operates at the system clock frequency f and may be used to buffer computational data read from an external memory (e.g., a mass storage unit as shown). The read calculation data includes first data that needs to be multiplexed, and second data used for calculation with the multiplexed first data. In one embodiment, when the computing platform is used to perform CNN calculations, the read data may include feature map data and weight data, and whether to use the feature map data or the weight data as the multiplexed first data is selected according to a multiplexing rule.

The read controller module 420 also preferably runs at the system clock frequency f and may be used to read the first data and the second data from the input buffer 420 that are needed by the parallel computing module 430 to perform a single operation. Here, if it is assumed that the parallelism of the parallel computing block 430 is M, the "single operation" of the parallel computing block 430 above refers to an operation of the parallelism M performed once by the parallel computing block 430. This "single operation" can be completed within one system clock cycle 1/f and involves M operations of the computing units that make up the parallel computing module 430.

The parallel computing module 430 operates at a multiple of the system clock frequency f, i.e., at Nf, where N is an integer greater than or equal to 2. Accordingly, of the first data and the second data required for the parallel computing module 430 read by the read controller module 420 at the frequency f to perform a single operation, the number of copies of the second data may be N times the number of copies of the first data. Thus, for example, for each compute unit in parallel compute module 430, a multiplexing of N copies of the second data to one copy of the first data is achieved for one system clock cycle 1/f.

Subsequently, result cache 440, which also runs at system clock frequency f, may cache the computation results output from the parallel computation modules. Here, the calculation result sent to the result cache 440 may be a final calculation result that needs to be directly stored back to the external memory, or may be an intermediate calculation result that is directly subjected to the next on-chip operation without being stored back externally, which is not limited in the present invention.

In the prior art, because the data handling logic for implementing data storage and reading in the computing platform needs a relatively complex hardware structure, the time delay is large, the critical path is long, and it is generally difficult to operate at a high frequency. In contrast, the parallel computing module is generally formed by connecting computing units with simpler structures in parallel, so that the hardware structure of the parallel computing module can meet the requirement of realizing high-frequency operation. The inventor of the invention fully considers the above limitation in the existing hardware, provides a brand new computing structure supporting computing logic frequency multiplication and other logic common frequencies, and can easily realize the doubling promotion of the computing utilization rate of the parallel computing module by providing corresponding multiplexing data and non-multiplexing data for each operation of the parallel computing module, thereby obviously promoting the overall computing efficiency of the computing platform.

Here, the parallel computing module 430 may be composed of a plurality of computing units that perform various types of specific computations. In a single operation of the parallel computing module 430, each computing unit may perform a computation on, for example, one copy of the first data and N copies of the second data. In other words, in a single operation (frequency f) of the parallel computing module 430, the computing unit may perform N operations, i.e., at the frequency Nf, in which each operation is directed to one of the N pieces of second data and one identical piece of first data. Here, "one copy" of data refers to the amount of the first data and the second data required for each operation of the calculation unit. The amount of the first data and the second data may be the same or different depending on the operation to be performed. In one embodiment, the first and/or second data may each be a collection of multiple types of data. In the case where the data amount of the "one copy" of data for the first and second data is the same, the ratio of the first data and the second data read from the external memory at a time by the input buffer 410 may be 1. The above-mentioned read ratio can also be varied correspondingly, when there is also a more complex multiplexing relationship between the first and second data.

In one embodiment, the plurality of calculation units may be a plurality of multipliers for performing multiplication operations of the first data and the second data. Here, the data amount of each of the first data and the second data may be the same, for example, both are 8-bit operands. For each multiplier, the input to the first input may be a multiplexed first operand and the input to the second input may be a selected one of the N second operands. The selection can be implemented, for example, via an N-bit data selector. The N-bit data selector operates at the Nf frequency, each input terminal receives one copy of the second data, and outputs the selected second operand to the second input terminal of the multiplier at a cycle of 1/Nf according to the input control signal. Thus, the single-operation parallelism M of the parallel computing module 430 can be achieved with M/N computing units operating at Nf frequency.

In order to achieve correct computation of the input first data (one copy) and second data (N copies) by each computing unit, the read controller module 420 may include a plurality of read controllers for reading the first data and the second data from the input buffer 410 to the first input terminal of the corresponding computing unit of the parallel computing module 430 and the N input terminals of the N-bit data selector, respectively. For example, in the case where the parallel computing block 430 has M/N multipliers operating at Nf frequency to achieve a single degree of parallelism M of operation, the read controller block 420 may have M/N read controllers for feeding first data to the first input terminal of each multiplier, and M read controllers for feeding second data to the N input terminals of each N-bit data selector.

Several computing units may also be combined with other computing units to form functional units for composing the parallel computing module 430. In one embodiment, the outputs of P multipliers may be connected to the input of an adder to form a multiply-add unit, where P is an integer greater than or equal to 2. For example, for M/N multipliers of the parallel computing module 430, each P multiplier may be connected with an adder to form M/NP multiply-add units, thereby implementing multiply-add operations, such as those commonly found in convolutional neural network computing.

At the output of each of the above-mentioned functional units (for example, the output of the multiply-add unit), or at the output of each of the above-mentioned computing units (for example, the output of the multiplier), a register corresponding to each of the N-way outputs may be additionally added. In one embodiment, each of the multiply-add units may include N registers respectively connected to the output of the adder, for respectively registering one of the N multiply-add results for one system clock cycle and outputting the N-bit multiply-add result for one system time cycle to the result buffer 440. The registers may be replaced with accumulators if the outputs of the multiply-add units need to be accumulated (e.g., as required in a convolution operation). Thus, in another embodiment, each multiply-add unit may include N accumulators respectively coupled to the outputs of the adders for respectively registering one of the N multiply-add accumulation results for a predetermined system clock period and simultaneously buffering the N-bit multiply-add accumulation results into result buffer 440 for a predetermined interval of system time periods. For example, the accumulator may accumulate 10 beats of the calculation results and output to result buffer 440.

Similar to the operation of the N-bit data selector, the above-described register or accumulator connected for output also operates under a control signal at the Nf frequency. For example, if the ith data of the N second data is to be calculated currently, the input control signal selects the ith input terminal of the receiving N-bit data selector as the second data to perform a calculation operation, such as a multiplication operation, with the multiplexed first data. In the case of a multiply-add unit, N-bit data selectors of M multipliers simultaneously receive the control signal, perform multiplication operations of ith second data and first data input thereto, obtain sums of P multiplication results by an adder, and send the multiply-add sum for the ith data to the ith of N registers or accumulators by using the control signal in the same 1/Nf cycle, thereby implementing one-beat (N times of the system frequency) operations in the multiply-add unit, and repeat the operations N times to implement operations on each second data in one system time cycle.

A large number of convolution operations are involved in the neural network calculation, the convolution operations can realize a large number of multiply-add operations under various parallel strategies, and multiplexing of feature maps and/or weight data can be conveniently realized under the condition of permission of calculation capacity. Therefore, the high-parallelism computing platform is particularly suitable for neural network computing, and the convolution computing capability of the neural network computing platform can be remarkably improved.

In one embodiment, the high-parallelism computing platform of the present invention can be a neural network computing platform, e.g., a dedicated neural network processor that performs neural network inference. The computational logic frequency multiplication principle of the present invention will be described in more detail below in conjunction with the examples of fig. 5 and 6.

Fig. 5 shows an example of prior art full co-frequency calculation. The various blocks shown in fig. 5 all operate at the system clock frequency f. The input buffer 510 reads feature map data and weight data for convolution calculation from an external memory, and buffers the feature map data and the weight data in the feature map buffer and the weight buffer, respectively. Assuming that the read profile data is input profile data with RGB three channels, for example, RGB values of the same pixel position are respectively sent to one input terminal of three multipliers 532 in a multiplication and addition unit of the parallel computing module 530 via the read controller module 520 (the specific structure of the read controller is not shown), and meanwhile, corresponding weight values are also sent to another input terminal by the read controller 520, that is, each input terminal of the multipliers obtains a piece of data (1 xdata), for example, an input of 8-bit operand. The three multipliers then perform respective multiplications, with the multiplication results added by adder 533 and fed into accumulator 534, thereby completing the one-beat calculation. The accumulator accumulates the calculations for a predetermined number of beats and feeds the accumulated result into result buffer 540. In the case where the convolution kernel size is 1x1, the calculation result output to the result buffer 540 may be regarded as the final result. In the case of a convolution kernel with a larger size, the computation results output to the result cache 540 may be tasked with intermediate results and may be continued for further computation.

Therefore, the hardware architecture of fig. 5 can only achieve the parallelism equal to the number of multipliers, and simultaneously obtain the multiplication and addition results equal to the number of multiplication and addition units in one beat.

In contrast, fig. 6 shows an example of calculating the logical double in the present invention. The data handling module shown in fig. 6 operates at the system clock frequency f, and only the parallel computation module for computation operates at 2f (i.e., N =2 in this example). The input buffer 610 reads feature map data and weight data for convolution calculation from an external memory, and buffers the feature map data and the weight data in the feature map buffer and the weight buffer, respectively. Also assuming that the read profile data is input profile data with three channels RGB, for example, RGB values of the same pixel position are respectively fed to one input terminal of three multipliers 632 in a multiplication and addition unit of the parallel computing module 630 via the read controller 620 (the specific structure of the read controller is not shown), and meanwhile, corresponding weight values from two different filters are fed to two input terminals of the 2-bit data selector 635 via the read controller 620.

Fig. 7 shows a waveform timing diagram of the structure of fig. 6. As shown in fig. 7, the computation clock is twice the system clock frequency, i.e., the frequency is 2f; the data clock (i.e., for data storage, registration, read-in and read-out) then runs at the system clock frequency f. Thus, the profile data, the weight _0, and the weight _1 are respectively supplied to the first input terminal of the multiplier 632 and the two input terminals of the 2-bit data selector 635 connected to the second input terminal of the multiplier 632 at the first rising edge of the data clock (corresponding to the rising edge of the #2 calculation clock cycle in the figure). During the #2 computation clock cycle, the computation unit completes the multiply-add operation on the input feature map data and the weight _0 data, and sends the computation multiply-add result to the register 634_0 at the rising edge of the #3 computation clock cycle. Meanwhile, at the rising edge of the #3 computation clock cycle, since the signature data, the weight _0, and the weight _1 are still located at the first input terminal of the multiplier 632 and the two input terminals of the 2-bit data selector 635 connected to the second input terminal of the multiplier 632, the weight _1 may be sent to the multiplier 632 by the selection signal to be multiplied by the signature data, and the multiply-add operation is completed within the #3 computation clock cycle, and the computation multiply-add result is sent to the register 634 v 1 at the rising edge of the #4 computation clock cycle. Thus, by calculating the logical double, parallel calculations on the two output channels (i.e., weight _0 and weight _ 1) are completed within one system clock cycle. Accumulator 634 may accumulate a plurality of multiply-add results at predetermined clock cycles and may feed corresponding accumulated result _0 and accumulated result _1 as intermediate or final results into result buffer 640 depending on control signals.

Thus, the hardware architecture of fig. 6 is capable of achieving twice the number of multipliers of computational parallelism by multiplying the computational logic (and corresponding minor modifications to the data access logic) and obtaining twice the number of multiply-add units of multiply-add results simultaneously in one system clock cycle.

In practical applications, the parallel policy of the computing platform may be determined according to many factors, and thus the first data to be multiplexed may be feature map data or weight data. In one embodiment, the option to multiplex the same profile data for different filters (weights) can be implemented to achieve parallelism on the output channels, as shown in FIG. 6 above. In other embodiments, the same weight data may also be multiplexed for different parts of the feature map to realize multiplexing in the feature map length and width directions.

Similarly, in practical application, the actual values of N and M can be determined according to many factors, and corresponding hardware implementation is performed. In a computing platform for neural network computing, M may be taken to be equal to 3, for example, as shown above in fig. 6, whereby one multiply-add unit may comprise: three multipliers; an N-bit data selector coupled to the second input of each multiplier; an adder having an input coupled to the output of the multiplier; and N accumulators connected in parallel with the output end of the adder. Therefore, the method is particularly suitable for multiplexing calculation of the input characteristic diagram of the RGB three channels, and can obtain higher calculation efficiency for the subsequent multi-channel characteristic diagram.

The value of N may be determined based at least on a hardware configuration of the platform and a parallelism policy of the high-parallelism computation. For example, although theoretically the higher the N, the greater the parallelism. However, the specific hardware implementation of the computing module determines the value upper limit of N, and in addition, the implementation complexity, power consumption, data access capability and the like need to be comprehensively considered, so that the value of N is reasonably set and corresponding hardware implementation is performed.

A computing platform implemented method for neural networks in accordance with the present invention will be described below in conjunction with fig. 8. FIG. 8 illustrates a flow diagram of a computing platform implemented method according to one embodiment of the invention. The method may be implemented using the aforementioned computing platform and preferred embodiments thereof, such as computing platform 400 shown in fig. 4 or the specific architecture example shown in fig. 6.

In step S810, the feature map data and the weight data are read from the external memory into the input buffer. The reading may be implemented as a computing platform as shown in fig. 4 or 6. For example, the input buffer 410 or 610 reads the feature map data and weight data directly from the mass storage unit. In the case of simple multiplexing, the feature map data and the weight data are read from the external memory into the input buffer in an amount of N times the weight data or N times the weight data. When more complex multiplexing relationships exist, the ratio of the read data may not be N. Generally, a single read of data from the input buffer is sufficient for the parallel computing module to perform multiple parallel operations. Therefore, the input buffer reads new feature map data and weight data from the external memory after each predetermined number of single parallel computing operations by the parallel computing module.

Subsequently, in step S820, the read controller module reads the feature map data and the weight data required for a single parallel computing operation, where the data amount of the feature map data read at a single time is N times of the weight data, or the data amount of the weight data read at a single time is N times of the feature map data, where N is an integer greater than or equal to 2. Here, it is possible to use a plurality of read controllers constituting a read controller block to supply data (feature map or weight data) to be multiplexed to the first input terminal of the multiplier, respectively, and supply data (weight or feature map data) calculated in parallel to each input terminal of the N-bit data selector.

In step S830, the parallel computation module performs multiplication operations on the multiplexing weight data or the feature map data N times in a single operation at N times of the system clock frequency, thereby completing computation at one system clock frequency.

In one embodiment, the method further comprises: the result buffer receives the multiplication and addition result of the parallel computation module in a single operation at the system clock frequency, or the result buffer receives the accumulated multiplication and addition result of the parallel computation module in a preset number of operations at a preset interval of system clock cycles.

In one embodiment, the reading controller module reads the feature map data and the weight data required by the single parallel computing operation, including: inputting the feature map data into first input ends of multipliers of the multiplication and addition unit according to the parallelism of input channels; and sending the weight data belonging to N different weights into N input ends of an N-bit data selector of which the output end is connected with the second input end of the multiplier according to the parallelism of output channels. Thereby realizing the parallelism calculation of N output channels.

The computing platform of the present invention may be implemented as a neural network processor. In contrast to a single computing platform (i.e., a host or CPU only computing platform), the present invention is directed to a neural network specific processor that is specifically designed to perform neural network computations. It will be understood by those skilled in the art that the term "neural network dedicated processor" as used in the present application may also be referred to simply as "neural network processor" or "NN processor". Since deep learning is currently one of the most popular technology classes in neural network technology, the neural network dedicated processor may be implemented as a deep learning dedicated processor or a deep learning processor. However, those skilled in the art will appreciate that there are various branches of technology for neural networks, such as Deep Neural Networks (DNN) and CNN, and thus the neural Network dedicated processor may also be implemented as a Deep neural Network dedicated processor (DNN processor) or a convolutional neural Network dedicated processor (CNN processor). That is, neural network computing implementation techniques involving "deep learning processors" or "deep neural network processors" or "convolutional neural network processors" in heterogeneous computing platforms are also within the scope of the present invention.

The DPU (Deep-learning Processing Unit) is a general acceleration platform for a Neural Network algorithm in artificial intelligence, and realizes reasoning based on a Convolutional Neural Network (CNN) by utilizing the characteristics of high parallelism and low power consumption of an FPGA. Herein, a DPU may be considered as one specific implementation of the above "deep learning processor" or "deep neural network processor" or "convolutional neural network processor" or "neural network processor". The description herein is primarily based on DPUs implemented via FPGAs using CNN architecture, but it will be understood by those skilled in the art that the principles of the present invention are equally applicable to neural network processors that reason for other neural networks through hardware architectures such as GPU.

The computing platform of the present invention may be implemented in a highly parallel computing system in which some or all of the functions for performing highly parallel computations, such as neural network computations, may be implemented by digital circuitry. In one embodiment, the computing system of the present invention may be implemented in a system on chip (SoC) that includes a general purpose processor, mass storage memory, and digital circuitry.

In one embodiment, the highly parallel computing platform required by the present system, such as a computing platform for convolutional neural networks, may be implemented by a digital circuit portion (e.g., FPGA, GPU, or ASIC) on the SoC. The computing platform or the parallel computing modules therein may be a hardware device implemented based on FPGA or GPU or ASIC or the like. Because CNNs perform parallel computations, implementing convolutional neural network computation functions via logic hardware, particularly FPGAs, has natural computational advantages and can achieve lower power consumption than software implementations.

In one embodiment, all the parameters related to CNN obtained by previous training and the feature map required to be classified, for example, may be stored in an external memory, and when the neural network computation is subsequently performed, the method as described above in conjunction with fig. 8 may be executed by a general-purpose processor to achieve high-performance parallel computation on a computing platform.

The high parallelism computing platform, system and computing implementation method according to the present invention have been described in detail above with reference to the accompanying drawings.

According to the characteristics of high data parallelism and reusability in neural network calculation, the invention designs a calculation logic frequency multiplication structure and greatly improves the calculation efficiency of a convolution calculation module by utilizing the characteristic of short calculation logic delay. The computing platform is simple in design, and the computing efficiency gain of the neural network can be improved in multiples without complicated hardware setting and obvious improvement of power consumption.

It is emphasized that although the present invention has been shown with an optimized hardware architecture primarily in connection with the computation of a convolutional neural network, it will be understood by those skilled in the art that the computing platform of the present invention is also applicable to other high-parallelism computations, and is particularly applicable to a parallel computing application scenario in which the input data type is relatively single and the data multiplexing is high.

In addition, although the "first data" and the "second data" are used in the present invention, the "first" and the "second" are only for distinguishing the data types, and do not impose any limitation or suggestion on the input order, importance, and the like thereof.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A high parallelism computing platform, comprising:

an input buffer operating at a system clock frequency for buffering first data and second data read from an external memory for high parallelism computation;

the read controller module is operated under the system clock frequency and is used for reading first data and second data required by the parallel computing module to execute single operation from the input cache, wherein the number of the read second data is N times of the number of the first data, and N is an integer greater than or equal to 2;

a parallel computation module operating at N times a system clock frequency for performing a specific computation of the first data and the second data in a single operation, wherein each of the first data is multiplexed by N pieces of the second data to perform the computation within one system clock cycle; and

and the result cache is operated under the system clock frequency and used for caching the calculation result output by the parallel calculation module.

2. The platform of claim 1, wherein the parallel computation module includes a plurality of multipliers operating at N times a system clock frequency to complete multiplication of the first data and the second data, wherein for each multiplier, an input to a first input is the first data and an input to a second input is the second data selected via the N-bit data selector.

3. The platform of claim 2, wherein the read controller module comprises a plurality of read controllers to read first data and second data from the input buffer to the first input of the corresponding multiplier and the N second inputs of the N-bit data selector, respectively.

4. The platform of claim 2, wherein M of said multipliers have outputs coupled to inputs of an adder to form a multiply-add unit, wherein M is an integer greater than or equal to 2.

5. The platform of claim 4, wherein each multiply-add unit further comprises N registers respectively coupled to the output of the adder for registering one of the N multiply-add results for one system clock cycle and outputting the N-bit multiply-add result for one system time cycle to the result buffer, or

Each multiply-add unit further comprises N accumulators respectively connected with the output end of the adder and respectively used for registering one of N multiply-add accumulation results in a preset system clock period and simultaneously caching the N multiply-add accumulation results into the result cache according to a preset system time period.

6. The platform as claimed in claim 5, wherein the control signals for the N-bit data selector and N of the registers or accumulators are given at N times the system clock frequency for the number of second data currently to be calculated.

7. The platform of claim 5, wherein the high-parallelism computing platform is a neural network computing platform.

8. The platform of claim 7, wherein the first data is multiplexed profile data, the second data is weight data, or

The first data is multiplexed weight data and the second data is profile data.

9. The platform of claim 7, wherein M is equal to 3, and the multiply-add unit comprises:

three multipliers;

an N-bit data selector connected to the second input of each multiplier;

an adder having an input coupled to the output of the multiplier; and

and the N accumulators are connected with the output end of the adder in parallel.

10. The platform of claim 1, wherein the parallel computing module is implemented at least in part by an FPGA, a GPU, or an ASIC.

11. The platform of claim 1, wherein the input cache and the result cache are implemented by a dynamically configurable on-chip cache.

12. The platform of claim 1, wherein a value of N is determined based at least on a hardware configuration of the platform and a parallelism policy of the high-parallelism computation.

13. A computing platform implemented method for a neural network, comprising:

reading feature map data and weight data from the external memory into the input cache using the computing platform of any one of claims 1-12;

the reading controller module reads the feature map data and the weight data required by single parallel computing operation, wherein the data volume of the feature map data read at a single time is N times of the weight data, or the data volume of the weight data read at a single time is N times of the feature map data, and N is an integer greater than or equal to 2; and

the parallel computation module realizes multiplication operation of multiplexing weight data or characteristic diagram data for N times in a single operation at N times of system clock frequency, thereby completing computation under one system clock frequency.

14. The method of claim 13, further comprising:

the result cache receives the result of the multiply-add of the parallel computation module in a single operation at the system clock frequency, or

The result buffer receives the accumulated multiplication and addition results of the parallel computing module in a preset number of operations in a preset system clock period at preset intervals.

15. The method of claim 13, wherein reading feature map data and weight data from the external memory into the input cache using the computing platform of any of claims 1-12 comprises:

and reading the characteristic diagram data and the weight data from the external memory into the input buffer according to the data quantity of N times of the characteristic diagram data or N times of the weight data.

16. The method of claim 15, wherein the input cache reads new feature map data and weight data from the external memory after each predetermined number of single parallel computing operations by the parallel computing module.

17. A computing platform implemented method for a neural network, comprising:

reading feature map data and weight data from the external memory into the input cache using the computing platform of any of claims 4-9;

the parallel computation module realizes multiplication operation of multiplexing weight data or characteristic diagram data for N times in a single operation at N times of system clock frequency so as to complete computation under one system clock frequency,

the method for reading the feature map data and the weight data required by the single parallel computing operation by the controller reading module comprises the following steps of:

inputting the feature map data into first input ends of multipliers of the multiplication and addition unit according to the parallelism of input channels; and

and sending the weight data belonging to N different weights into N input ends of an N-bit data selector of which the output end is connected with the second input end of the multiplier according to the parallelism of output channels.

18. A highly parallel computing system comprising:

the computing platform of any one of claims 1-12;

mass storage external to the computing platform; and

a processor coupled to the computing platform and the memory, for performing the method of any of claims 13-17.

19. The system of claim 18, wherein the system is implemented at least in part by an FPGA, a GPU, or an ASIC.