CN110321997B

CN110321997B - High-parallelism computing platform, system and computing implementation method

Info

Publication number: CN110321997B
Application number: CN201810277338.XA
Authority: CN
Inventors: 王俊斌; 隋凌志; 方绍峡; 于谦; 单羿
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-03-31
Filing date: 2018-03-31
Publication date: 2021-10-19
Anticipated expiration: 2038-03-31
Also published as: CN110321997A

Abstract

The invention discloses a high-parallelism computing platform, a high-parallelism computing system and a related implementation method. The computing platform includes: the first-level cache is used for caching the calculation data read from the external memory; a plurality of reading controllers, each reading controller is used for reading the calculation data or part thereof required by a single operation in the parallel calculation module from any position of the primary cache; and the parallel computing module is used for performing high-parallelism computing operation on the computing data read by the plurality of reading controllers. Thereby enabling sufficient multiplexing of data to improve processing efficiency. The invention also improves the whole data processing efficiency by further introducing the multi-level cache and further improves the processing speed by data filtering.

Description

High-parallelism computing platform, system and computing implementation method

Technical Field

The invention relates to the field of hardware architecture, in particular to a high-parallelism computing platform, a high-parallelism computing system and a computing implementation method.

Background

Neural networks (Neural networks) have become a research hotspot in the field of image recognition in recent years. The trained neural network model can be used in the fields of image classification, object recognition, significance detection and the like. In recent years, the neural network model shows the trend of increasing the calculation scale and the complexity, and the traditional CPU platform can not meet the practical requirement. Therefore, designing a neural network accelerator by using high-parallelism heterogeneous computing platforms such as an FPGA, a GPU and an ASIC becomes a new research hotspot. Compared with a GPU platform, the FPGA can achieve higher calculation energy efficiency ratio, meanwhile, the FPGA can iterate rapidly, and the characteristic of hardware reconstruction can be more suitable for the requirement of high-speed algorithm development.

When performing computations using a highly parallel computing platform including an FPGA, a GPU, and the like, the execution time of the parallel computations is short compared to the time cost required for data access with an external memory, which causes a bandwidth limitation to become a bottleneck in increasing the processing speed. In addition, how to fully utilize the bandwidth to improve the parallelism is also an important issue to be considered for a high-parallelism computing platform.

Therefore, there is still a need for a correlation scheme that can optimize high-parallelism computations.

Disclosure of Invention

To address at least one of the problems described above, the present invention proposes a new computing platform architecture. The parallel processing capability of the computing platform can be further improved by arranging the cache module capable of realizing simultaneous parallel reading at any position and the reading controller matched with the cache module. By further the layout of the multi-level cache, the maximum data processing speed can be realized by fully utilizing the bandwidth. By introducing the filtering mechanism, meaningless null values in the sparse data can be removed, so that the computing efficiency of the parallel computing module is further improved. Therefore, the parallel processing capacity of the computing platform is greatly improved on the whole.

According to an aspect of the present invention, there is provided a high parallelism computing platform, comprising: the first-level cache is used for caching the calculation data read from the external memory; a plurality of reading controllers, each reading controller is used for reading the calculation data or part thereof required by a single operation in the parallel calculation module from any position of the primary cache; and the parallel computing module is used for simultaneously carrying out high-parallelism computing operation on the computing data read by the plurality of reading controllers. Therefore, the parallel processing capacity of the computing platform can be further improved by arranging the cache module capable of realizing simultaneous parallel reading at any position and the reading controller matched with the cache module. The level one cache may be implemented by a register file.

Preferably, the computing platform may further comprise: and the filtering module is connected between the plurality of reading controllers and the parallel computing module and is used for filtering null data in the computing data read by the reading controllers and/or non-null data in the computing data corresponding to the null value of the corresponding computing result. The justification efficiency of the parallel computing module itself can be improved by extruding "bubbles" in the data that do not affect the computation results before the computed data (especially sparse data) is fed into the parallel computing module.

Preferably, the computing platform may further comprise: a second level cache that reads the computation data directly from the external memory, the second level cache sequentially storing the read computation data, and based on at least the read controller state parameter indicating a data consumption status in the first level cache, performing at least one of: writing calculation data into the first-level cache; and reading and buffering the calculation data from the external memory. Thus, by further arranging the multi-level cache, the maximum data processing speed can be realized by fully utilizing the bandwidth. Preferably, the computing platform may further comprise: a read controller status register for storing the read controller status parameter; and the cache updating controller generates a cache updating instruction based on the read controller state parameter so as to structurally realize good cooperation among multiple levels of caches.

Preferably, the computing platform may further comprise: and the calculation result caching module is connected with the parallel calculation module and is used for caching calculation result data of calculation operation performed by the parallel calculation module and storing the calculation result data back to the external memory.

According to another aspect of the present invention, there is provided a high parallelism computing platform for convolutional neural networks, comprising: the first-level cache is used for caching the characteristic diagram data and the weight data read from the external memory; a plurality of reading controllers, each reading controller is used for reading the characteristic diagram data and/or the weight data required by a single convolution calculation operation from any position of the primary cache; and the parallel computing module is used for simultaneously performing high-parallelism convolution computing operation on the feature graph data read by the plurality of reading controllers and the weight data. The level one cache may be implemented by a register file.

Preferably, the first-level cache may include a feature map data pool for caching the feature map data and a weight data pool for caching the weight data, and the plurality of read controllers includes 2M read controllers, where a first group of M read controllers is configured to read M feature map convolution windows simultaneously, and a second group of M read controllers is configured to read M weight convolution windows simultaneously, so as to implement a feature map convolution operation with a degree of parallelism M in the parallel computation module, where M is an integer greater than 2.

Preferably, the computing platform may further comprise: and the filtering module is connected between the plurality of reading controllers and the parallel computing module and is used for filtering out null values in the feature map data and/or the weight data read by the reading controllers and/or non-null value data, corresponding to the null values of the corresponding convolution calculation results, in the feature map data and the weight data. Preferably, the filtering module retains values of which the same positions in the feature map convolution windows and their corresponding weight convolution windows are not zero and filters out other values.

Specifically, the filtering module comprises a 0-level cache and M asynchronous first-in first-out queues, the 0-level cache comprises or can be divided into 2M convolution window cache pools, the 2M reading controllers store M characteristic diagram convolution windows and M weight convolution windows into the 2M convolution window cache pools of the 0-level cache, filter out at least one value with zero in the same position in the characteristic diagram convolution windows and the corresponding weight convolution windows in a one-to-one corresponding mode, and respectively store reserved values into the M asynchronous first-in first-out queues. The parallelism M is determined based on at least a bandwidth of reading data from the external memory and a size of the convolution window.

Preferably, the computing platform may further comprise: a second level cache which directly reads the feature map data and the weight data from the external memory and is connected to the first level cache, wherein the cache capacity of the second level cache is larger than the cache capacity of the first level cache, the second level cache is used for separately storing the read feature data and the weight data, and at least one of the following operations is performed based on the read controller state parameter indicating the data consumption condition in the first level cache: writing feature graph data and weight data into the first-level cache; and reading the new feature map data and the weight data from the external memory. The data consumption parameter may be a position coordinate of the feature map data read in the reading controller on the feature map.

Preferably, the computing platform may further comprise: a read controller status register for storing the read controller status parameter; and a cache update controller that generates a cache update instruction based on the read controller state parameter.

Preferably, the computing platform may further comprise: and the calculation result caching module is connected with the parallel calculation module and is used for caching convolution calculation result data of calculation operation performed by the parallel calculation module and storing the calculation result data back to the external memory.

According to yet another aspect of the present invention, there is provided a method for implementing a computing platform for convolutional neural network inference, comprising: reading feature map data and weight data from the external memory into the level one cache using the computing platform as described in any one of the above; the reading controllers respectively read feature map data and/or weight data required by a plurality of single convolution calculation operations; and the parallel computing module simultaneously performs high-parallelism convolution computing operation on the multiple groups of feature map data and weight data read by the reading controllers.

The feature map data and the weight data read by the plurality of read controllers have a high degree of data multiplexing, the data multiplexing including at least one of: multiplexing of feature map data caused by the length and width of the convolution window being greater than the step size; and multiplexing of weight data resulting from performing convolution simultaneously on multiple locations in the feature map. By sufficiently multiplexing data, the influence of the bandwidth limitation of the external storage on the processing speed can be further reduced.

Preferably, the implementation method may further include: and a second-level cache connected with the first-level cache directly reads the feature map data and the weight data from the external memory, and writes the feature map data and the weight data into the first-level cache and reads new feature map data and weight data from the external memory based on the position coordinates of the feature map data read from the read controller on the feature map, wherein the time interval for writing the data into the first-level cache by the second-level cache is less than or equal to the time interval for reading the data from the external memory by the second-level cache. Thereby enabling full utilization of both bandwidth and processing speed.

The reading of the feature map data and/or the weight data required by the plurality of single convolution calculation operations by the plurality of reading controllers respectively comprises: and reading the corresponding feature map convolution window and the corresponding weight convolution window according to the size of the convolution kernel to be subjected to the convolution operation. Accordingly, the method may further comprise: and filtering values of positions corresponding to zero values in convolution operation results in the feature map convolution window and the weight data convolution window, and sending the filtered feature map convolution window and the filtered weight convolution window to the parallel computing module.

The time interval of each time the feature map data and the weight data are read by the plurality of read controllers is the same as the time period required by the parallel computing module to perform M groups of computation in a single time, and is smaller than the time interval of each time the second-level cache writes data into the first-level cache. Therefore, the overall calculation efficiency of the platform can be further improved.

According to still another aspect of the present invention, there is provided a highly parallel computing system including: the computing platform of any of the preceding claims; a mass storage memory located external to the computing platform; and a processor coupled to the computing platform and the memory, for performing the implementation method of any of the preceding claims.

In one embodiment, the parallel processing module is implemented at least in part by an FPGA, GPU, or ASIC.

The computing platform provided by the invention is more suitable for high-parallelism computing, especially convolutional neural network computing, by improving the hardware architecture, and the overall processing efficiency of the platform is improved by fully utilizing the bandwidth and improving the computing parallelism. According to the method, a multi-level cache structure is designed according to the rule that the convolutional neural network data depends on the determinacy, and the efficiency of supplying data to a convolutional calculation module is greatly improved by utilizing the characteristics of bandwidth, multiple frequencies and data multiplexing. In addition, the invention also utilizes the weight of the convolutional neural network and the sparse characteristic of the characteristic diagram, and greatly improves the operation efficiency by filtering null values. The calculation platform is simple in design, and the maximum calculation efficiency benefit can be obtained from the sparse characteristic of the neural network parameters without the cost of complex coding and decoding and sparse data information storage.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a series of ordered running layers for a typical CNN.

FIG. 2 shows a schematic diagram of a computing platform, according to one embodiment of the invention.

Fig. 3 shows an example of a convolution operation.

FIG. 4 shows a schematic diagram of a computing platform, according to one embodiment of the invention.

Fig. 5 shows an example of filtering data for convolution calculations in CNN.

FIG. 6 shows a schematic diagram of a computing platform, according to one embodiment of the invention.

Fig. 7 shows a detailed operation diagram of a high parallelism computing platform for CNN according to an embodiment of the present invention.

FIG. 8 illustrates a flow diagram of a computing platform implemented method according to one embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

High parallel computing has long been fully applied in the fields of scientific computing, weather simulation, biological simulation, molecular mechanics models, aircraft manufacturing, military simulation and the like. In recent years, with the continuous fermentation of deep learning hot tide, a high parallel computing implementation scheme for a neural Network, especially a Convolutional Neural Network (CNN) is more concerned.

Existing general purpose processors (CPUs) require high versatility to handle a variety of different data types and their logical decisions introduce a large amount of branch jumps and interrupt handling. These all make the internal structure of the CPU unusually complex, and are not suitable for data operations of large-scale data with highly uniform types and no mutual dependency. Therefore, designing a high-parallelism computing platform, especially a neural network accelerator, by using high-parallelism heterogeneous computing platforms such as an FPGA, a GPU and an ASIC becomes a new research hotspot. Compared with a GPU platform, the FPGA can achieve higher calculation energy efficiency ratio, meanwhile, the FPGA can iterate rapidly, and the characteristic of hardware reconstruction can be more suitable for the requirement of high-speed algorithm development.

The invention provides a high-parallelism computing platform with a novel hardware architecture, which is suitable for parallel operation of large-scale data with relatively uniform types and no mutual dependence, and is particularly suitable for a convolution neural network with a sparse input feature map and weight parameters. The parallel computing modules of the computing platform or at least parts thereof are preferably implemented by FPGAs.

Although the computing platform solution of the present invention will be described below mainly in conjunction with parallel computing for convolutional neural networks, it should be understood by those skilled in the art that the hardware architecture of the present invention is suitable for various high-parallelism computing scenarios, and is particularly suitable for application scenarios with high data reuse rate and sparsity.

[ CNN basic concept ]

CNN achieves the most advanced performance in a wide range of vision-related tasks. To aid in understanding the CNN-based computational operations analyzed in the present application, the underlying knowledge of CNN is first introduced based on existing CNN models.

As shown in fig. 1, a typical CNN consists of a series of layers that run in order.

The parameters of the CNN model are called "weights" (weights). The first layer of CNN reads the input graph and outputs a series of feature graphs (featuremaps). The lower layer reads the feature map generated by the upper layer and outputs a new feature map. The last classifier (classifier) outputs the probability that the input graph may belong to a certain class. The CONV layer (convolutional layer) and the FC layer (full link layer) are two basic layer types in CNN. After the CONV layer, there is usually a Pooling layer (Pooling layers).

In the present application, for one CNN layer,

shows the jth input feature map (input featuremap),

shows the ith output characteristic diagram (output featuremap), b_iThe offset term of the ith output plot is shown.

For the CONV layer, n_inAnd n_outRepresenting the number of input and output profiles, respectively.

For FC layer, n_inAnd n_outRepresenting the length of the input and output feature vectors, respectively.

Definition of CONV layers (Convolutional layers): the CONV layer takes a series of feature maps as input and obtains an output feature map by convolution kernel convolution.

A non-linear layer, i.e. a non-linear excitation function, usually connected to the CONV layer, is applied to each element in the output signature. The excitation function used is typically a ReLU function, which layer is also commonly referred to as the ReLU layer.

The CONV layer may be represented by expression 1:

wherein g is_i,jIs a convolution kernel applied to the jth input feature map and the ith output feature map. Definition of FC layers (full-Connected layers): the FC layer applies a linear transformation of the input features upwards:

f^out＝Wfⁱⁿ+b (2)

w is an integer n_out×n_inTransform matrix, b is a bias term. It is worth noting that for the FC layer, what is input is not a combination of several two-dimensional feature maps, but one feature direction. Thus, in expression 2, the parameter n_inAnd n_outIn effect corresponding to the length of the input and output feature vectors.

Pooling (pooling) layer: usually connected to the CONV layer for outputting the maximum or average value of each partition (subarea) in each profile. The Pooling maximum value can be represented by expression 3:

where p is the size of the pooling kernel. This non-linear "down-sampling" not only reduces the size and computation of the feature map for the next layer, but also provides a translation invariance.

CNN can be used for image classification in the forward inference process. The CNN calculation involves a large number of convolution operations without dependency relationship between the CNN and the CNN, so that the CNN calculation is particularly suitable for being implemented on a high-parallelism calculation platform.

[ basic Structure of computing platform of the invention ]

In order to deal with the calculation with high parallelism, the invention provides a brand-new hardware architecture. FIG. 2 shows a schematic diagram of a computing platform, according to one embodiment of the invention. As shown in fig. 2, the high-parallelism computing platform 200 includes a level one cache 210, a plurality of read controllers 220, and a parallel computing module 230.

The first level cache 210 may cache the computed data read from the external memory and has the capability that cache contents at any location may be read simultaneously. The external memory in the figure is implemented as a mass storage unit, such as DDR. In one embodiment, level one cache 210 is comprised of a register file to enable the ability of any of the locations described above to be read simultaneously. When the computing platform is used to perform CNN calculations, the data read may include feature map data and weight data.

Each of the read controllers 220 is capable of reading the computational data, or portions thereof, required for a single operation of the parallel computation module from a corresponding location of the level one cache 210. The parallel computing module 230 is configured to perform a high-parallelism computing operation on the computing data read by the plurality of read controllers.

Here, it is assumed that the parallelism of the parallel computing block 230 is M. The above "single operation of the parallel computing module" refers to the parallel computing module 230 performing a single operation of an operation with a degree of parallelism M. That is, each operation may include M single operations.

The number of read controllers, N, may be a number related to M depending on the particular implementation. In one embodiment, each read controller may read all of the computation data required by the parallel computation module 230 to perform a single operation of the M-parallelism operation from the first-level cache 210, in which case the number N of read controllers is equal to the M-parallelism. In another embodiment, each read controller may read from the first level cache 210 a portion of the computation data required by the parallel computation module 230 to perform a single operation of the M-degree-of-parallelism operation. For example, for the CNN operation, 2M read controllers (i.e., N ═ 2M) may be provided, where M read controllers are used to read the feature map convolution windows and M read controllers are used to read M weighted convolution windows simultaneously, so as to implement the feature map convolution operation with the degree of parallelism M in the parallel computation module 230. At this time, the parallel computing module 230 may be implemented as M multiply-accumulate units independent of each other. In other embodiments, each read controller may also read both the signature graph and the weight convolution window that are required to perform the convolution operation.

The hardware architecture shown in fig. 2 is particularly suitable for parallel computing with high data multiplexing rate, for example, convolution computation which is most common in CNN. For ease of understanding, fig. 3 shows an example of a convolution operation. As shown in fig. 3, a convolution calculation is performed on a 5x5 input feature map at step 1 using a 3x3 convolution kernel. The left side of the figure shows the first convolution calculation, the middle shows the second convolution calculation, and so on. After 9 convolution calculations, the convolved feature map on the right side of fig. 3 is obtained.

Since there is no dependency between these 9 convolution calculations, the execution can be completed in one loop of the parallel computation module 230 (the parallelism M can reach thousands of orders of magnitude in general). In one embodiment of the present invention, in order to complete the 9 convolution calculations simultaneously, 18 read controllers may be used for data reading. The first set of 9 read controllers is used to read the signature graph convolution window. For example, 9 3x3 convolution windows sliding one by one from the top left to the bottom right of the 5x5 input feature map. Since the step size is 1, 6 data in each adjacent convolution window can be multiplexed, as shown in the left and middle graphs in fig. 3. The second set of 9 read controllers is used to read the weight convolution windows, i.e., the 9 3x3 convolution windows for which the data is identical (since all use one convolution kernel). Thus, in the first-level cache 210, only 34(5x5+3x3) data need be stored, and parallel computation in the computation module can be realized by the read control of the read controller. It should be understood that the input feature map is introduced herein for illustrative purposes in a size of 5x5, and in practice, the input feature map is typically larger than 5x 5.

In addition, for high-sparsity data frequently encountered in high-parallel computing, the computing efficiency of the parallel computing module can be improved by additionally arranging a filtering module. FIG. 4 shows a schematic diagram of a computing platform, according to one embodiment of the invention. Further, the high-parallelism computing platform 400 shown in fig. 4 includes, in addition to the first-level cache 410, the plurality of read controllers 420 and the parallel computing module 430, a filtering module 440 connected between the plurality of read controllers 420 and the parallel computing module 430, and configured to filter out null data in the computed data read by the read controllers and/or non-null data in the computed data corresponding to a null value in the corresponding computed result. In other words, the filtering module 440 may be used to filter out values that are meaningless to subsequent computation results in the parallel computation module 430.

In the case of CNN calculation, the filtering module 440 may filter out null values in the feature map data and/or the weight data read by the controller, and/or non-null value data corresponding to a null value in the result of the corresponding convolution calculation in the feature map data and the weight data. Fig. 5 shows an example of filtering data for convolution calculations in CNN. As shown, according to the definition of convolution, only the values at the corresponding positions of the feature map convolution window and the weight convolution window are not zero, and the output value is affected. Thus, the filtering module 440 only retains values that are non-zero at the same position in the feature map convolution windows and their corresponding weight convolution windows and filters out other values.

In one embodiment, the filtering module 440 may include a level 0 buffer and a plurality of asynchronous first-in-first-out queues (FIFOs). In a preferred embodiment, the level 0cache may comprise or may be divided into 2M convolution window cache pools, with the plurality of asynchronous FIFOs implemented as M FIFOs. The previous 2M read controllers may be configured to store the M feature map convolution windows and the M weight convolution windows into the 2M convolution window buffer pools of the level-0 buffer, filter out at least one zero value in the feature map convolution windows and the corresponding weight convolution windows in a one-to-one correspondence manner, and input the obtained tightly-arranged sparse data removal data (as shown in fig. 5) into the M asynchronous FIFOs at a high frequency. The sparsely processed data is then fed from the FIFO to the parallel computing device. In an application to convolution calculations, the parallel computing means may be implemented as a convolution computation kernel comprising M mutually independent multiply-accumulate modules. Obviously, through the introduction of the filtering module, the convolution calculation kernel can execute the complete convolution calculation (including M times of simultaneous convolution calculation) more quickly.

As described above, the first-level cache needs to have the capability that the cache contents at arbitrary positions can be read by a plurality of read controllers at the same time, and for example, the first-level cache may be set to be constituted by a register file to realize the above-described function. However, the implementation of the above functions may cause large power consumption and logic overhead, so the capacity of the first-level cache is not suitable to be too large. The capacity of the primary cache should be as small as possible to meet the computational parallelism requirement, for example, in a neural network hardware accelerator, tens or hundreds of KB of primary caches already meet the computational parallelism requirement of thousands or tens of thousands. Because the first-level cache is limited in capacity and access mode, the first-level cache cannot read data from the external memory to the maximum extent, and the existing bandwidth cannot be utilized, and extra overhead is caused to be too large, in an embodiment of the present invention, a second-level cache with a large capacity may be arranged between the external memory and the first-level cache to solve the above problem.

FIG. 6 shows a schematic diagram of a computing platform, according to one embodiment of the invention. Further, the high-parallelism computing platform 600 shown in fig. 6 includes, in addition to the primary cache 610, the plurality of read controllers 620, the filter module 640, and the parallel computing module 630, a secondary cache 650 connected to the primary cache for directly reading required computing data from the external memory. The level two cache 650 is preferably implemented as a conventional cache, capable of sequentially caching externally read computation data, and capable of delivering data to the level one cache 610 as needed. The buffer capacity (e.g., several MB) of the second level buffer 650 is typically larger than the buffer capacity (e.g., several hundred KB) of the first level buffer 610, and can be used as a transmission medium between an external memory (likewise, the external memory is implemented as a large-capacity storage unit, such as DDR) and the first level buffer 610. By introducing the second-level cache, the storage bandwidth can be maximally utilized, and the requirements on the storage capacity and the power consumption of the first-level cache are reduced.

The level two cache 650 may write data to the level one cache 610 and/or read data from an external memory according to the data consumption status in the level one cache. In one embodiment, the data consumption status in the level one cache may be indicated by a read controller status parameter.

In an implementation for CNN, the second level cache 650 may separately cache the read characteristic value data and the weight data, and based on at least the data consumption status in the first level cache, may perform at least one of the following operations: writing feature graph data and weight data into the first-level cache; and reading the new feature map data and the weight data from the external memory. Preferably, the data consumption status can be characterized according to the position coordinates of the characteristic diagram data read in the reading controller on the characteristic diagram.

The introduction of the multi-level cache structure can improve the system efficiency and reduce the system overhead. Further, the data update frequency among the multiple levels of cache can be designed based on the computation speed of the parallel computation module and the storage bandwidth of the computing platform and the external memory, so that the system performance is maximally improved.

Ideally, the time interval between each reading of the feature map data and the weight data by the plurality of reading controllers is the same as the time period required for the parallel computation module to perform a single M-group computation. In other words, the update from the level one cache to the level 0cache is a high frequency spread update of the data, which can be the same as the time required for, for example, a convolution computation kernel to perform a single pass computation (including M parallel convolution window computations), thereby enabling maximum utilization of the computation power of the parallel computation module.

The frequency of data updates in the level one cache is obviously less than the update frequency in the level 0 cache. In other words, the time interval between each writing of data to the first-level cache by the second-level cache is greater than the time interval between each writing of data to the 0-level cache by the first-level cache. When, for example, the position coordinates of the feature map data read in the read controller on the feature map indicate that the data in the primary cache is about to be exhausted, the secondary cache can write new data into the primary cache.

In contrast, the rate of data update in the second level cache can be flexibly set as needed. The time interval for the second level cache to read data from the external memory may be equal to or greater than the time interval for the second level cache to write data to the first level cache. In one embodiment, whenever data in the primary cache is depleted, the secondary cache fills the primary cache with data and simultaneously reads new data from the external memory. For example, when a first-level cache with a capacity of 500KB is depleted, a second-level cache with a capacity of 5MB is filled with 500KB of data in the first-level cache, and simultaneously reads new 500KB of data from the external memory. In other embodiments, the frequency of the second-level cache to supplement the data may be lower, for example, the first-level cache data is self-supplemented with a corresponding amount of new data every time the first-level cache data is updated twice, three times or more, and the invention is not limited herein.

The hardware architecture of the present invention and its preferred embodiments have been described above in connection with fig. 2-6. One specific application of the present invention and its operation will be described below with reference to fig. 7.

Fig. 7 shows a detailed operation diagram of a high parallelism computing platform for CNN according to an embodiment of the present invention. As shown in fig. 7, computing platform 700 includes a level two cache (L2cache)750, a level one cache (L1cache)710, a plurality of read controllers 720, a level 0cache (L0cache)741 and an asynchronous FIFO 742 that function as filtering modules, and a parallel computing module 730 implemented as a convolution computing module, as mentioned above.

Computing platform 700 may include, among other things, an optional computation result cache module 760 coupled to convolution computation module 730. This module is used to cache the convolution calculation result data of the calculation operation performed by the parallel calculation module 730, and store the calculation result data back to the external large-capacity storage unit. The calculation result buffer module 760 may, for example, perform writing to the external memory after being filled with certain data.

In one embodiment, computing platform 700 further includes an optional read controller status register 770 to store the read controller status parameter, and a cache update controller 780 to generate a cache update instruction based on the read controller status parameter. The read controller status register 770 and the cache update controller 780 may be used to update the level one and level two caches based on the status of data consumption in the level one cache 710.

A computing platform implementation method for a convolutional neural network in accordance with the present invention will be described below in conjunction with fig. 8. FIG. 8 illustrates a flow diagram of a computing platform implemented method according to one embodiment of the invention. The method may be implemented using a computing platform as previously described and preferred embodiments thereof, such as computing platform 700 shown in FIG. 7.

In step S810, the feature map data and the weight data are read from the external memory into the primary cache.

In one embodiment, the read may be implemented via level two cache 750 as shown in FIG. 7. The second level cache 750 connected to the first level cache 710 reads the feature map data and the weight data directly from the mass storage unit, and writes the feature map data and the weight data to the first level cache and reads new feature map data and weight data from the external memory based on the position coordinates of the feature map data read in the read controller, wherein the time interval for the second level cache to write data to the first level cache is less than or equal to the time interval for the second level cache to read data from the external memory.

Subsequently, in step S820, the plurality of read controllers respectively read the feature map data and/or the weight data required for the plurality of single convolution calculation operations. Here, the plurality of read controllers may respectively read the corresponding feature map convolution windows and the weight convolution windows according to the sizes of the convolution kernels to perform the convolution operation as shown in fig. 3. In this case, there is a high degree of data multiplexing between the profile data and the weight data read by the plurality of read controllers, said data multiplexing comprising at least one of the following: multiplexing of feature map data caused by the length and width of the convolution window being greater than the step size; and multiplexing of weight data resulting from performing convolution simultaneously on multiple locations in the feature map.

In step S830, the parallel computation module performs a convolution computation operation with high parallelism on the sets of feature map data and weight data read by the plurality of read controllers at the same time. The sets of feature map data and weight data may also undergo a filtering operation before being fed into the convolution computation kernel. Therefore, the implementation method of the present invention may also preferably include step S825, filtering out values in the feature map convolution window and the weight data convolution window at positions corresponding to zero values in the convolution operation result, and sending the filtered feature map convolution window and the filtered weight convolution window to the parallel computation module.

In order to maximize the computing power of the convolution calculation kernel, the time interval of each time the plurality of read controllers read the feature map data and the weight data is the same as the time period required by the convolution calculation kernel to perform single-loop M-group calculation, and is smaller than the time interval of each time the second-level cache writes data into the first-level cache.

An application example of the calculation implementation method of the present invention will be further described below with reference to fig. 7.

First, the cache update controller 780 of the level two cache 750 obtains data from the mass storage unit according to the state of the read controller 720 of the current level one cache 710, and stores the feature map data and the weight data separately. Here, the state of the read controller 720 of the first-level cache 710 may be a position coordinate of a convolution window of the neural network on a feature map, and consumption of data in a data pool of the first-level cache 710 may be estimated by the position coordinate, so that the consumed data may be replaced with new data.

The data is then stored in a data-dependent manner into the data pool of the level one cache 710. Here, "data dependency" refers to dependency between data stream arrangement orders. The data pools may also include data pool #0 cached for the feature map and data pool #1 cached for the weights.

The signature graph and weights are then updated from the level one cache 710 to the level 0cache 741 (convolution window cache pool) by the two sets of read controllers, respectively. The signature and weights are stored in the level 0 buffer 741 in a sparse manner, and then all the corresponding numbers with 0 at any position in the convolution windows of the signature and weights are filtered out through a filtering operation, so as to obtain parameters with closely arranged and removed sparse data, and the parameters are input into the asynchronous FIFO 742 at a high frequency. Then, the data after the sparse processing is sent to a convolution calculation core 730 (M independent multiply-accumulate modules) from the FIFO, and is stored in a calculation result buffer module 760 after the calculation is completed. After sufficient data is finally stored, the calculation result cache module 760 stores the results in the mass storage module.

It is noted that the multi-level cache of the present invention is not a conventional processor cache, but is specifically designed for parallel computing, especially CNN convolution computing requirements. Where data from the mass storage unit (e.g., DDR) to the level two cache, and from the level two cache to the level one cache, are complete signatures and weights that have not been sparsely processed. The level 0cache is updated with high frequency (even at the same speed as the convolution kernel), while the level two cache and the level one cache can be updated with lower frequency. In addition, the updating from the first-level cache to the 0-level cache is the high-frequency expansion of data, the weight can be shared by a plurality of characteristic diagram data in the expansion process, and the characteristic diagram has high data reuse degree when the convolution kernel is larger than the step. Updates from the second level cache to the first level cache may be co-frequent. Therefore, based on the hardware architecture of the multi-level cache, the processing speed of a convolution computing kernel can be utilized to the maximum, and the storage bandwidth no longer limits the processing efficiency of a computing platform.

In addition, although the feature map and weight cache module and the data pool with the same size are shown in fig. 7, it should be understood that, in practical applications, the proportion of the feature map and the weight data in the first-level cache and the second-level cache may be reasonably adjusted according to a specific parallel computing scheme. In general, the feature map buffer module and the data pool are much larger than the weight buffer module and the data pool because the convolution kernel has higher multiplexing rate.

The computing platform of the present invention may be implemented as a neural network processor. In contrast to a single computing platform (i.e., a host or CPU only computing platform), the present invention is directed to a neural network specific processor that is specifically designed to perform neural network computations. It will be understood by those skilled in the art that the term "neural network dedicated processor" as used in the present application may also be referred to simply as "neural network processor" or "NN processor". Since deep learning is currently one of the most popular technology classes in neural network technology, the neural network dedicated processor may be implemented as a deep learning dedicated processor or a deep learning processor. However, those skilled in the art will appreciate that there are various branches of technology for neural networks, such as Deep Neural Networks (DNN) and CNN, and thus the neural Network dedicated processor may also be implemented as a Deep neural Network dedicated processor (DNN processor) or a convolutional neural Network dedicated processor (CNN processor). That is, neural network computing implementation techniques involving "deep learning processors" or "deep neural network processors" or "convolutional neural network processors" in heterogeneous computing platforms are also within the scope of the present invention.

The DPU (Deep-learning Processing Unit) is a general acceleration platform for a Neural Network algorithm in artificial intelligence, and realizes reasoning based on a Convolutional Neural Network (CNN) by utilizing the characteristics of high parallelism and low power consumption of an FPGA (field programmable gate array). Herein, a DPU may be considered as one specific implementation of the above "deep learning processor" or "deep neural network processor" or "convolutional neural network processor" or "neural network processor". The description herein is primarily based on a DPU implemented via an FPGA using a CNN architecture, but it will be understood by those skilled in the art that the principles of the present invention are equally applicable to neural network processors that reason about other neural networks through hardware architectures such as GPUs.

The computing platform of the present invention may be implemented in a highly parallel computing system in which some or all of the functions for performing highly parallel computations, such as neural network computations, may be implemented by digital circuitry. In one embodiment, the computing system of the present invention may be implemented in a system on a chip (SoC) that includes a general purpose processor, a mass memory, and digital circuitry.

In one embodiment, the highly parallel computing platform required by the present system, such as a computing platform for convolutional neural networks, may be implemented by a digital circuit portion (e.g., FPGA, GPU, or ASIC) on the SoC. The computing platform or the parallel computing modules therein may be a hardware device implemented based on FPGA or GPU or ASIC or the like. Because CNNs perform parallel computations, implementing convolutional neural network computation functions via logic hardware, particularly FPGAs, has natural computational advantages and can achieve lower power consumption than software implementations.

In one embodiment, all parameters related to CNN obtained by previous training and the feature map to be classified, for example, may be stored in an external memory, and when neural network computation is performed subsequently, the method as described above in connection with fig. 8 may be executed by a general-purpose processor to achieve high-performance parallel computation on a computing platform.

The high parallelism computing platform, system and computing implementation method according to the present invention have been described in detail above with reference to the accompanying drawings.

According to the method, a multi-level cache structure is designed according to the rule that the convolutional neural network data depends on the determinacy, and the efficiency of supplying data to a convolutional calculation module is greatly improved by utilizing the characteristics of bandwidth, multiple frequencies and data multiplexing. In addition, the invention also utilizes the weight of the convolutional neural network and the sparse characteristic of the characteristic diagram, and greatly improves the operation efficiency by filtering null values. The calculation platform is simple in design, and the maximum calculation efficiency benefit can be obtained from the sparse characteristic of the neural network parameters without the cost of complex coding and decoding and sparse data information storage.

It is emphasized that, although the present invention has been primarily illustrated in connection with the computation of convolutional neural networks as an optimized hardware architecture, it will be understood by those skilled in the art that the computing platform of the present invention is also applicable to other high-parallelism computations, and is particularly applicable to parallel computing application scenarios in which the input data type is relatively single and/or sparsity is high.

In addition, although the names of "level two cache", "level one cache", and "level 0 cache" are used in the present invention, the names of "level two", "level one", and "level 0" are used herein only to illustrate the relationship between the caches in an embodiment including more than two caches, and not to limit any limitation on the caches themselves.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A high parallelism computing platform, comprising:

the first-level cache is used for caching the calculation data read from the external memory;

a plurality of reading controllers, each reading controller is used for reading the calculation data or part thereof required by a single operation in the parallel calculation module from any position of the primary cache;

the parallel computing module is used for performing high-parallelism computing operation on the computing data read by the plurality of reading controllers; and

a second level cache for directly reading the calculation data from the external memory, wherein the second level cache sequentially caches the read calculation data, and at least one of the following operations is performed based on the data consumption condition in the first level cache:

writing calculation data into the first-level cache; and

and reading and buffering the calculation data from the external memory.

2. The computing platform of claim 1, further comprising:

and the filtering module is connected between the plurality of reading controllers and the parallel computing module and is used for filtering null data in the computing data read by the reading controllers and/or non-null data in the computing data corresponding to the null value of the corresponding computing result.

3. The computing platform of claim 1, wherein a data consumption status in the level one cache is indicated by the read controller state parameter, and further comprising:

a read controller status register for storing the read controller status parameter; and

and the cache updating controller generates a cache updating instruction based on the read controller state parameter.

4. The computing platform of claim 1, further comprising:

and the calculation result caching module is connected with the parallel calculation module and is used for caching calculation result data of calculation operation performed by the parallel calculation module and storing the calculation result data back to the external memory.

5. The computing platform of claim 1, wherein the level one cache is implemented as a register file.

6. A high parallelism computing platform for convolutional neural networks, comprising:

the first-level cache is used for caching the characteristic diagram data and the weight data read from the external memory;

a plurality of reading controllers, each reading controller is used for reading the characteristic diagram data and/or the weight data required by a single convolution calculation operation from any position of the primary cache;

the parallel computing module is used for performing high-parallelism convolution computing operation on the feature graph data and the weight data read by the plurality of reading controllers; and

a second level cache connected to the first level cache and directly reading the feature map data and the weight data from the external memory, wherein the cache capacity of the second level cache is larger than that of the first level cache, the second level cache is used for separately caching the read feature map data and the weight data, and at least one of the following operations is performed based on the data consumption status in the first level cache:

writing feature graph data and weight data into the first-level cache; and

new feature map data and weight data are read from the external memory.

7. The computing platform of claim 6, wherein the level one cache comprises a feature map data pool to cache the feature map data and a weight data pool to cache the weight data, and the plurality of read controllers comprises 2M read controllers, wherein a first set of M read controllers is to read M feature map convolution windows simultaneously, and a second set of M read controllers is to read M weight convolution windows simultaneously to implement a feature map convolution operation with a degree of parallelism of M in the parallel computing module, wherein M is an integer greater than 2.

8. The computing platform of claim 7, further comprising:

and the filtering module is connected between the plurality of reading controllers and the parallel computing module and is used for filtering out null values in the feature map data and/or the weight data read by the reading controllers and/or non-null value data, corresponding to the null values of the corresponding convolution calculation results, in the feature map data and the weight data.

9. The computing platform of claim 8, wherein the filtering module retains values for which the same position in both the feature map convolution window and its corresponding weight convolution window is non-zero and filters out other values.

10. The computing platform of claim 9, wherein the filtering module comprises a level 0cache and M asynchronous fifo queues, the level 0cache comprising or divisible into 2M convolution window cache pools, the 2M read controllers storing the M signature convolution windows and the M weight convolution windows into the 2M convolution window cache pools of the level 0cache, and filtering out values having at least one zero value at the same location in the signature convolution windows and their corresponding weight convolution windows in a one-to-one correspondence, and storing the retained values into the M asynchronous fifo queues, respectively.

11. The computing platform of claim 6, wherein the data consumption parameter is a position coordinate on the feature map of feature map data read in the read controller.

12. The computing platform of claim 6, further comprising:

13. The computing platform of claim 7, wherein the M is determined based at least on a bandwidth of reading data from the external memory and a size of the convolution window.

14. The computing platform of claim 6, further comprising:

and the calculation result caching module is connected with the parallel calculation module and is used for caching convolution calculation result data of calculation operation performed by the parallel calculation module and storing the calculation result data back to the external memory.

15. The computing platform of claim 6, in which the level one cache is implemented as a register file.

16. A computing platform implemented method for a convolutional neural network, comprising:

reading feature map data and weight data from the external memory into the level one cache using the computing platform of any one of claims 1-15;

the reading controllers respectively read feature map data and/or weight data required by a plurality of single convolution calculation operations; and

and the parallel computing module simultaneously performs high-parallelism convolution computing operation on the multiple groups of feature map data and weight data read by the reading controllers.

17. The method of claim 16, wherein there is a high degree data multiplexing of the profile data and weight data read by the plurality of read controllers, the data multiplexing including at least one of:

multiplexing of feature map data caused by the length and width of the convolution window being greater than the step size; and

multiplexing of weight data resulting from performing convolution simultaneously on multiple locations in the feature map.

18. The method of claim 16, further comprising:

and a second-level cache connected with the first-level cache directly reads the feature map data and the weight data from the external memory, and writes the feature map data and the weight data into the first-level cache and reads new feature map data and weight data from the external memory based on the position coordinates of the feature map data read from the read controller on the feature map, wherein the time interval for writing the data into the first-level cache by the second-level cache is less than or equal to the time interval for reading the data from the external memory by the second-level cache.

19. The method of claim 18, wherein the reading, by the plurality of read controllers, the feature map data and/or the weight data required for a plurality of single convolution calculation operations, respectively, comprises:

and reading the corresponding feature map convolution window and the corresponding weight convolution window according to the size of the convolution kernel to be subjected to the convolution operation.

20. The method as recited in claim 18, further comprising:

and filtering values of positions corresponding to zero values in convolution operation results in the feature map convolution window and the weight data convolution window, and sending the filtered feature map convolution window and the filtered weight convolution window to the parallel computing module.

21. The method of claim 18, wherein a time interval for the plurality of read controllers to read the profile data and the weight data each time is the same as a time period required for the parallel computation module to perform a single-pass M-bank computation, and is less than a time interval for the second level cache to write data to the first level cache each time.

22. A highly parallel computing system, comprising:

the computing platform of any one of claims 1-15;

a mass storage memory located external to the computing platform;

a processor coupled to the computing platform and the memory, for performing the method of any of claims 16-21.

23. The system of claim 22, wherein the parallel computing module is implemented at least in part by an FPGA, a GPU, or an ASIC.