WO2019085378A1

WO2019085378A1 - Hardware implementation device and method for high-speed full-connection calculation

Info

Publication number: WO2019085378A1
Application number: PCT/CN2018/080600
Authority: WO
Inventors: 康君龙; 张玉; 谢东亮
Original assignee: 北京深鉴智能科技有限公司
Priority date: 2017-10-30
Filing date: 2018-03-27
Publication date: 2019-05-09
Also published as: CN109740749A

Abstract

Provided are a hardware implementation device and method for high-speed full-connection calculation. According to the present invention, the hardware implementation device (200) for high-speed full-connection calculation comprises: a weight storage module (210) for storing weight data for calculation, wherein m sets of weight data are stored each time until the weight calculation of all output channels is completed; a vector storage module (220) for storing n pieces of input vector data; an output registration module (230) for implementing output caching of a calculation result; and a core calculation module (240) for multiplying the m sets of weight data input by the weight storage module with n pieces of input vector data input by the vector storage module, wherein each multiplication result is respectively added to the previous valid result, and adding a corresponding offset value to the result of the multiplication and addition calculation, and outputting the final calculation result to the output registration module.

Description

Hardware implementation device and method for high speed full connection calculation

Technical field

The present invention relates to an artificial neural network, and more particularly to a hardware implementation apparatus and method for high speed full connection calculation.

Background technique

The concept of Deep Learning is derived from the research of artificial neural network (ANN), which is a method based on the representation of data in machine learning. A multilayer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract high-level representation of attribute categories or features by combining underlying features to discover distributed feature representations of data.

Deep learning is a new field in machine learning research. Its motivation is to build and simulate a neural network for human brain analysis and learning. It mimics the mechanism of the human brain to interpret data such as images, sounds and texts.

Deep learning compared to the classic network AlexNet's network model is convolution layer (conv) + pooling layer (pooling) + full connection layer (fc) + softmax layer, the full connection layer to map the learned distributed feature representation to The role of the sample mark space.

Each node of the fully connected layer is connected to all nodes of the previous layer to integrate the previously extracted features. Figure 1 is a schematic diagram of a simple artificial neural network structure. As shown in Figure 1, the forward calculation process is a linear weighted summation process. Each output of the fully connected layer can be viewed as each node of the previous layer multiplied by a weight coefficient W, and finally added a The offset value b is obtained. The matrix form can be expressed as:

In the full connection layer (FC) operation of the neural network, each input is multiplied and added with all the weights, so the amount of data involved in calculation is large, and the hardware bandwidth requirement is high. Reasonable design for this feature will achieve the goal of reducing the hardware bandwidth requirements of data and improving computational efficiency.

Summary of the invention

As described above, the present invention is rationally designed for the above-mentioned features in the fully connected layer operation of the neural network, so as to achieve the purpose of reducing the hardware bandwidth requirement of the data and thereby improving the computational efficiency.

The invention designs a special circuit for implementing neural network full connection operation. An object of the present invention is to provide an apparatus for implementing an FC accelerator, so that FC data has high data multiplexing, low interface requirements, and high computational power and high performance.

In order to achieve the above object, in combination with the feature that the calculation amount of the FC is large, the present invention provides a hardware implementation apparatus and method for high-speed FC calculation.

According to a first aspect of the present invention, a hardware implementation apparatus for high speed full connection calculation is provided, which may include: a weight storage module for storing weight data for calculation, storing m sets of weight data each time until all output channels The weight calculation is completed; the vector storage module is configured to store n input vector data; the output registration module is configured to implement an output buffer of the calculation result; and the core calculation module is configured to make the m group weight data input by the weight storage module Multiplying with the n input vector data input by the vector storage module, each multiplication result is added to the previous valid result, and the corresponding offset value is added to the result of the multiplication and addition calculation, and the final calculation is performed. The result is output to the output registration module.

In a hardware implementation apparatus for high speed full connection calculation according to the first aspect of the present invention, for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate in the core calculation module The calculation result is stored, and a ping-pong buffer can be used.

In the hardware implementation apparatus of the high speed full connection calculation according to the first aspect of the present invention, the core calculation module may include m*n computation cores, so that the multiplication operation of the m sets of weight data and the n input vectors can be simultaneously performed.

In the hardware implementation apparatus of the high speed full connection calculation according to the first aspect of the present invention, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m= 4, n = 8.

According to a second aspect of the present invention, a hardware implementation method for high-speed full-connection calculation is provided, which may include: (1) loading m sets of weight data into a weight storage module for storage; (2) requesting input of vector data and receiving The n input vector data are stored in the vector storage module; (3) when the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and n are respectively read from the two modules. Input vector data and send it to the core computing module; (4) the core computing module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, in turn The pipeline completes the data multiplication and addition operation of the input channel; (5) adds the multiplication and addition operation result of step (4) and the corresponding offset data, and completes all the full connection operations of the input channel corresponding to the current calculation, and the operation result Output to the output registration module; (6) the output registration module outputs the result data to the target interface; (7) repeat steps (1) through (6) until all full connection operations are completed.

In a hardware implementation method of high speed full connection calculation according to the second aspect of the present invention, for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate in the core calculation module The calculation result is stored, and a ping-pong buffer can be used.

In the hardware implementation method of the high-speed full-connection calculation according to the second aspect of the present invention, the core calculation module may include m*n computation cores, so that m group weight data and n input vectors can be simultaneously implemented in step (4). Multiplication operation.

In the hardware implementation method of the high-speed full-connection calculation according to the second aspect of the present invention, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m= 4, n = 8.

According to a third aspect of the present invention, a computer readable medium for recording an instruction executable by a processor, the instruction, when executed by a processor, causes the processor to perform a hardware implementation method of high speed full connection calculation, The method includes the following operations: (1) loading m group weight data into a weight storage module for storage; (2) requesting input vector data and storing the received n input vector data in a vector storage module; (3) when the weight storage When both the module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module; (4) the core computing module pair The separately received weight data and the input vector data are multiplied, and each multiplication result is added to the previous valid result, and the data multiplication and addition operation of the input channel is sequentially performed by the pipeline; (5) multiplication of the step (4) The addition operation result is added to the corresponding offset data, and all the full connection operations of the input channel corresponding to the current calculation are completed, and the operation result is output to the output registration module; (6) Said output register module outputs a result data to the target interface; (7) repeating steps (1) to (6), until all operations are completed fully connected.

In the computer readable medium according to the third aspect of the present invention, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m=4, n= 8.

DRAWINGS

The invention will now be described in connection with the embodiments with reference to the accompanying drawings. In the drawing:

Figure 1 is a schematic diagram of a simple artificial neural network structure;

2 is a schematic diagram of a hardware implementation apparatus for high speed full connection calculation in accordance with the present invention;

3 is a flow chart of a hardware implementation method of high speed full connection calculation in accordance with the present invention;

4 is a schematic diagram of a hardware implementation apparatus in accordance with a first preferred embodiment of the present invention;

Figure 5 is a schematic illustration of a hardware implemented device in accordance with a second preferred embodiment of the present invention.

Detailed ways

The drawings are for illustrative purposes only and are not to be construed as limiting. The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments.

In order to achieve the object of the present invention, in combination with the feature that the full connection (FC) is computationally intensive, the present invention provides a high speed FC computing device including, but not limited to, a weight storage module, a vector storage module, an output registration module, and a core computing module. Wait.

2 is a schematic diagram of a hardware implemented device for high speed full connection calculation in accordance with the present invention.

As shown in FIG. 2, the composition of the hardware implementation device 200 of the high speed full connection calculation according to the present invention is described below.

Weight storage module 210: This module stores weight data for calculation. This design implements the FC function by means of weight sharing, that is, each time the weight of the part is completed with all the inputs, the weight data is updated until all the input vectors are calculated. Preferably, the design employs 4 sets of weights per cache, ping-pong buffer, until the weight calculations for all output channels are completed.

Vector storage module 220: Due to the weight sharing implementation adopted by the design, the demand for vector storage is low. When, for example, 4 input data is valid, the calculation can be started, and the calculation pipeline is performed, so that only some registrations need to be added in the module. A small amount of data can be registered, the design uses two sets of registers, each group can register 4 data, ping-pong. When the bandwidth of the input data interface is insufficient, the cache can be appropriately expanded to prevent the calculation efficiency.

The output registration module 230 is similar in design to the input storage module, and realizes an output buffer of the calculation result. The output buffer size can be adjusted according to the interface bandwidth to prevent the FC operation from being back-pressured due to the non-issuance of the result, thereby affecting the calculation efficiency.

The core calculation module 240: the module implements multiplication and addition of input and weight, and adds a corresponding offset bias to the calculation result to complete. In order to achieve higher computational power, it is possible to perform core calculations with different numbers, as the interface bandwidth allows. Preferably, the design uses 16 computational cores, and simultaneously implements 4 weight data and 4 input vector operations, ie 4*4=16.

A calculation example is further given in the preferred embodiment below.

According to the above description, the following can be summarized. The hardware implementation apparatus for high-speed full-connection calculation according to the present invention may include: a weight storage module for storing weight data for calculation, storing m sets of weight data each time until weight calculation of all output channels is completed; a vector storage module, For storing n input vector data; an output registration module for implementing an output buffer of the calculation result; and a core calculation module for causing the m sets of weight data input by the weight storage module and the input by the vector storage module The n input vector data are multiplied, and the respective multiplication results are respectively added to the previous valid result, and the corresponding offset value is added to the result of the multiplication and addition calculation, and the final calculation result is output to the output registration module.

In general, the core computing module can include m*n computing cores so that the multiplication operations of the m sets of weight data and the n input vectors can be implemented simultaneously.

It should be noted here that the values of m and n are often determined according to the actual computing hardware. Those skilled in the art should understand how to properly set the values of m and n in order to utilize existing computing hardware resources. Get the ideal computing power. As mentioned above, in a preferred embodiment, the values of m and n can be m=4 and n=4, respectively. Those skilled in the art should understand that in actual implementation, the values of m and n may also be other cases. Further details can be found in the preferred embodiments discussed further below.

Further, for the weight data storage in the weight storage module, the input vector data storage in the vector storage module, and the intermediate calculation result storage in the core calculation module, a ping-pong buffer may be employed. This can also be taken as a preferred embodiment of the invention.

The invention also provides a method for high speed FC calculation, the specific steps are:

Step 1: Load weight data into the weight storage module;

Step 2: request vector data and store the received data in a vector storage module;

Step 3: When both the weight storage module and the vector storage module have data that can be calculated, four data are respectively read from the above modules and sent to the core calculation module;

Step 4: The core calculation module multiplies the received data by two and two, and adds the result to the previous valid result, and sequentially completes the data multiplication and addition operation of the input channel (input channel);

Step 5: adding the multiplication and addition operation result of step 4 and the corresponding offset data, completing all calculations of the input channel FC corresponding to the current calculation, and outputting the operation result to the output registration module;

Step 6: output the registration module output result data to the target interface;

Step 7: Repeat steps 1 through 6 until all FC operations are complete.

Based on the design of high efficiency calculation, the ping-pong design of the cache unit is respectively performed, including weight storage, vector storage, and calculation result storage. Thus, step 1 and steps 2 and 6 can be started in parallel at the same time as in step 4. This will fully utilize the performance of the core computing module, achieve complete flow calculation, and achieve high computing power and high performance.

In accordance with the above description, more generally, reference may be made to FIG. 3 is a flow chart of a hardware implementation method of high speed full connection calculation in accordance with the present invention.

As shown in FIG. 3, the hardware implementation method 300 of the high speed full connection calculation according to the present invention may begin in step S310, in which m sets of weight data are loaded into a weight storage module for storage.

At the same time, in step S320, input vector data is requested and the received n input vector data is stored in the vector storage module.

As mentioned above, it should be noted here that the values of m and n are often determined according to the actual computing hardware. Those skilled in the art should understand how to properly set the values of m and n in order to utilize existing ones. Calculate hardware resources to achieve the desired computing power. For example, the values of m and n can be m=4, n=4.

When the weight storage module and the vector storage module have data that can be calculated, in step S330, m sets of weight data and n input vector data are respectively read from the two modules and sent to the core calculation module.

Next, in step S340, the core calculation module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, and sequentially completes the data multiplication of the input channel by the pipeline. Add operation.

In method 300, the core computing module can include m*n computational cores such that the multiplication of the m sets of weight data and the n input vectors can be performed simultaneously in step S340.

In step S350, the multiplication and addition operation result of step S340 is added to the corresponding offset data, and all the full connection operations of the input channels corresponding to the current calculation are completed, and the operation result is output to the output registration module.

As described above, for the weight data storage in the weight storage module, the input vector data storage in the vector storage module, and the intermediate calculation result storage in the core calculation module, a ping-pong buffer may be employed.

In step S360, the output registration module outputs the result data to the target interface.

Thereafter, in step S370, it is determined whether all the full join operations are completed. If the result of the determination in step S370 is negative, that is, if the full connection operation is not performed, the method 300 returns to step S310, and steps S310 through S360 are repeated. On the other hand, if the result of the determination in step S370 is affirmative, that is, all the full join operations have been completed, the method 300 may end.

Two preferred embodiments are seen below.

4 is a schematic diagram of a hardware implemented device in accordance with a first preferred embodiment of the present invention.

As shown in FIG. 4, in the first preferred embodiment, if the number of input channels participating in the operation is 2048, the number of batches is 210, and the number of output channels is 30, each The data is 32 bits. Firstly, the weight data 4*2048*32bit is loaded into the weight storage module, and after loading part of the weight data, the vector data is requested to be loaded into the vector storage module, and the amount of data loaded each time is 4*32bit. When the weight data and the vector data are ready, The calculation is started and the calculation uses 16 (4*4) DSPs. Calculate complete flow, share weight data, vector data flow input, and flow calculation. When the interface clock frequency is 300MHz, the data rate of the vector input interface is 4.8GB/s, and the calculated operation can reach 9.6Gops, which achieves higher computational efficiency.

As shown in FIG. 5, in the second preferred embodiment, if the interface data rate can be doubled, the weight data for each participating calculation is 4, the vector data is 8, or the buffer amount of the weight data is increased. The interface rate remains unchanged, so that the weight data involved in the operation is eight, and the vector data is four. When 32 (4*8 or 8*4) DSPs are simultaneously involved in the operation, the computing power can be as high as 19.2Gop. It can be seen that the device has good adaptability and can realize high performance calculation to a large extent.

In other words, according to the second preferred embodiment, in the apparatus 200 and the method 300 of the present invention, the values of m and n may also be m=8, n=4 or m=4, n=8.

In combination with the first and second preferred embodiments, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m=4, n=8.

One of ordinary skill in the art will recognize that the method of the present invention can be implemented as a computer program. As described above in connection with FIG. 3, the method in accordance with the above-described embodiments can execute one or more programs, including instructions to cause a computer or processor to perform the algorithms described in connection with the figures. These programs can be stored and provided to a computer or processor using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of the non-transitory computer readable medium include magnetic recording media (such as floppy disks, magnetic tapes, and hard disk drives), magneto-optical recording media (such as magneto-optical disks), CD-ROM (Compact Disc Read Only Memory), CD-R, CD-R /W and semiconductor memory (such as ROM, PROM (programmable ROM), EPROM (rewritable PROM), flash ROM and RAM (random access memory)). Further, these programs can be provided to a computer by using various types of transient computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The transitory computer readable medium can be used to provide a program to a computer via a wired communication path such as a wire and an optical fiber or a wireless communication path.

Thus, in accordance with the present invention, a computer program or a computer readable medium for recording instructions executable by a processor that, when executed by a processor, causes the processor to perform a high speed full connection calculation The hardware implementation method includes the following operations: (1) loading m group weight data into a weight storage module for storage; (2) requesting input vector data and storing the received n input vector data in a vector storage module; (3) When the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module; (4) The core computing module multiplies the separately received weight data and the input vector data, and adds each multiplication result to the previous valid result, and sequentially completes the data multiplication and addition operation of the input channel; (5) the step ( 4) The multiplication and addition operation result is added to the corresponding offset data, and all the full connection operations of the input channel corresponding to the current calculation are completed, and the operation result is output to the input. Storage module; (6) the output register module outputs the resulting data to the target access; (7) repeating steps (1) to (6), until all operations are completed fully connected.

According to a preferred embodiment of the present invention, in the computer readable medium as above, the values of m and n may be one of the following: m=4, n=4; m=8, n=4; or m =4, n=8.

Various embodiments and implementations of the invention have been described above. However, the spirit and scope of the present invention are not limited thereto. Those skilled in the art will be able to make further applications in accordance with the teachings of the present invention, and such applications are within the scope of the present invention.

That is, the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations or modifications of the various forms may be made by those skilled in the art in light of the above description. There is no need and no way to exhaust all of the implementations. Any modifications, substitutions or improvements made within the spirit and scope of the invention are intended to be included within the scope of the appended claims.

Claims

A hardware implementation device for high-speed full-connection computing, comprising:

a weight storage module, configured to store weight data for calculation, and store m sets of weight data each time until weight calculation of all output channels is completed;

a vector storage module for storing n input vector data;

An output registration module for implementing an output buffer of the calculation result;

a core calculation module, configured to multiply the m sets of weight data input by the weight storage module and the n input vector data input by the vector storage module, and the respective multiplication results are respectively added to the previous valid results, And adding a corresponding offset value to the result of the multiplication and addition calculation, and outputting the final calculation result to the output registration module.
The apparatus of claim 1, wherein the ping-pong buffer is employed for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate calculation result storage in the core computing module.
The apparatus of claim 1, wherein the core computing module comprises m*n computing cores to simultaneously implement multiplication operations of m sets of weight data and n input vectors.
The apparatus of claim 1 wherein the values of m and n are one of:

m=4, n=4;

m=8, n=4; or

m=4, n=8.
A hardware implementation method for high-speed full-connection computing, including:

(1) loading the m group weight data into the weight storage module for storage;

(2) requesting input vector data and storing the received n input vector data in a vector storage module;

(3) when both the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module;

(4) The core calculation module performs multiplication operation on the weight data and the input vector data respectively received, and adds each multiplication result to the previous effective result, and sequentially performs data multiplication and addition operations on the input channel;

(5) adding the multiplication and addition operation result of step (4) and the corresponding offset data, completing all the full connection operations of the input channel corresponding to the current calculation, and outputting the operation result to the output registration module;

(6) the output registration module outputs the result data to the target interface;

(7) Repeat steps (1) through (6) until all full join operations are completed.
The method of claim 5 wherein the ping-pong buffer is employed for weight data storage in the weight storage module, input vector data storage in the vector storage module, and intermediate calculation result storage in the core computing module.
The method of claim 5, wherein said core computing module comprises m*n computational kernels such that at step (4) simultaneous multiplication of m sets of weight data and n input vectors is performed.
The method of claim 5 wherein the values of m and n are one of the following:

m=4, n=4;

m=8, n=4; or

m=4, n=8.
A computer readable medium for recording instructions executable by a processor, the instructions, when executed by a processor, cause a processor to perform a hardware implementation of a high speed full connection calculation, comprising the operations of:

(1) loading the m group weight data into the weight storage module for storage;

(2) requesting input vector data and storing the received n input vector data in a vector storage module;

(3) when both the weight storage module and the vector storage module have data that can be calculated, the m sets of weight data and the n input vector data are respectively read from the two modules and sent to the core computing module;

(4) The core calculation module performs multiplication operation on the weight data and the input vector data respectively received, and adds each multiplication result to the previous effective result, and sequentially performs data multiplication and addition operations on the input channel;

(5) adding the multiplication and addition operation result of step (4) and the corresponding offset data, completing all the full connection operations of the input channel corresponding to the current calculation, and outputting the operation result to the output registration module;

(6) the output registration module outputs the result data to the target interface;

(7) Repeat steps (1) through (6) until all full join operations are completed.
The computer readable medium of claim 9, wherein the values of m and n are one of the following:

m=4, n=4;

m=8, n=4; or

m=4, n=8.