Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In the prior art, a computing architecture based on a systolic array is usually designed for a neural network by using a Vivado tool, but the problem of serial computing of arrays divided for a network layer in the designed computing architecture is more, so that the length of a critical path is longer, the computing efficiency is not high, and the use efficiency of the systolic array architecture is further reduced.
As shown in fig. 1, the neural network includes 5 network layers, a computing architecture designed for the neural network by using an existing design tool, as shown in fig. 2, on the FPGA, an array a is divided for the network layer 1 and the network layer 2, an array b is divided for the network layer 4 and the network layer 5, and an array c is separately divided for the network layer 3, since an output of the network layer 1 is to be used as an input of the network layer 2, the network layer 3, and the network layer 4, respectively, an input of the network layer 5 includes an output of the network layer 2, an output of the network layer 3, and an output of the network layer 4, a serial computation of the network layer 1 and the network layer 2 is required in the array a, a serial computation of the network layer 4 and the network layer 5 is required in the array c, which results in a long critical path length, and the array a needs to perform data transfer with both the array b and the array c, which results in a lot of data exchange between different arrays, this can be inferred that the computing architecture is computationally inefficient.
In order to solve the technical problem, the invention provides a neural network calculator generation scheme based on an FPGA (field programmable gate array). the neural network calculator is obtained by determining the dependency relationship of each network layer in a neural network, grouping the network layers, wherein the dependency relationship of the network layers in each group is the same, then determining the size of an array block required by each group, generating a binary file according to the dependency relationship of the network layers in each group and the size of the array block required by each group, and deploying the binary file into the FPGA.
Based on the above description, by analyzing the dependency relationship of the neural network, network layers that are not dependent (i.e., the dependency relationships are all the same) are divided into a group, so that parallel computation can be realized in the array blocks allocated to the network layers, the length of the critical path is effectively shortened, and the computation efficiency is improved. In addition, the positions of array blocks required by each group are arranged on the FPGA according to the dependency relationship, so that data exchange among arrays can be reduced, and the calculation efficiency is further improved.
The specific embodiment of the invention will be described in detail below for the FPGA-based neural network calculator generation scheme.
Fig. 3A is a flowchart illustrating an embodiment of a method for generating an FPGA-based neural network calculator, according to an exemplary embodiment of the present invention, where the method for generating an FPGA-based neural network calculator can be applied to an electronic device (e.g., a terminal, a server, etc.). As shown in fig. 3A, the method for generating the neural network calculator based on the FPGA includes the following steps:
step 301: and determining the dependency relationship of each network layer in the neural network.
In one embodiment, the dependency relationship of each network layer may be determined according to the data transfer relationship between each network layer in the neural network.
In an exemplary scenario, as described above for the neural network shown in fig. 1, the data transfer relationship between the network layers is: the output of the network layer 1 is to be used as the input of the network layer 2, the network layer 3 and the network layer 4, respectively, and the input of the network layer 5 comprises the output of the network layer 2, the output of the network layer 3 and the output of the network layer 4, so that the dependency relationship of the network layer 1 can be determined as the output to the network layer 2, the network layer 3 and the network layer 4; the dependence relationship of the network layer 2 is that the input is the network layer 1 and the output is the network layer 5; the dependence relationship of the network layer 3 is that the input is the network layer 1 and the output is the network layer 5; the dependence relationship of the network layer 4 is that the input is the network layer 1 and the output is the network layer 5; the network layer 5 has the dependencies of the inputs network layer 2, network layer 3 and network layer 4.
Step 302: and grouping the network layers, wherein the network layers in each group have the same dependency relationship, and determining the size of the array block required by each group.
In the invention, independent network layers (namely network layers with the same dependency relationship) are taken as a group, and network layers with different dependency relationships are taken as a group, so that the purpose of parallel calculation of each network layer can be realized by an array block allocated for the group in an FPGA, and the length of a critical path is effectively shortened.
Based on the exemplary scenario shown in step 301, since the network layer 2, the network layer 3, and the network layer 4 have the same dependency relationship and are independent of each other, the network layer 2, the network layer 3, and the network layer 4 may be regarded as one packet 1, the network layer 1 and the network layer 5 have the different dependency relationship, the network layer 1 may be regarded as one packet 2, and the network layer 5 may be regarded as one packet 3, so that the network layer in the neural network shown in fig. 1 may be divided into three packets.
In an embodiment, since computation of each network layer requires a certain amount of computation resources, the size of the array block required by each packet needs to be determined to ensure that an array block satisfying the computation resources required by the network layer is allocated in the FPGA.
Based on this, for the process of determining the required array block size for each packet, the number of computational resources required by the network layer in the packet may be determined for each packet, and then the required array block size may be determined according to the number of computational resources required by the packet.
In addition, a large number of multiply-add operations are involved in the operation process of the network layer, so the amount of computing resources required by the network layer may be the sum of the required number of multiply operations and the required number of add operations.
In the systolic array, the minimum processing units are PEs (processing elements), and each PE is a multiplier-adder MAC (multi-accumulator Unit), i.e. a multiplication operation or an addition operation is represented. The array block size required for each packet thus refers to the number of processing element PEs that the array block needs to contain, and the number of computational resources that the array block can provide also refers to the number of contained PEs.
It follows that each PE can compute in parallel, since each PE can share peripherals and a front-end. When there are multiple independent network layers in a packet, the multiple network layers in the array block allocated to the packet can be computed in parallel to effectively shorten the critical path length.
In one example, for the process of determining the amount of computing resources required by the network layers in the packet, the amount of computing resources required may be determined by obtaining network parameters included in each network layer in the packet from the neural network and determining the amount of computing resources required according to the network parameters included in each network layer.
For network layers with different functions, the network parameters included therein are different, and further the required computing resources are different, so that the amount of the computing resources required by each network layer needs to be determined respectively.
Illustratively, the network parameters involved for the convolutional layer are convolutional kernel size, the network parameters involved for the pooling layer are pooling layer size and step size, the network parameters involved for the fully-connected layer are output channel number, and so on.
Step 303: and deploying the FPGA according to the dependency relationship of the network layer in each group and the size of the array block required by each group to obtain the calculator of the neural network.
In an embodiment, when the FPGA is deployed, a binary file may be generated according to a dependency relationship of a network layer in each packet and an array block size required by each packet, and then the binary file is loaded into the FPGA to implement the deployment.
Exemplary, exemplary scenarios as shown in steps 301 and 302 above:
the dependency relationships of the network layers in the packet 1 are input as the network layer 1 and output to the network layer 5; the dependency relationships of the network layers in the packet 2 are all output to the network layer 2, the network layer 3 and the network layer 4; the network layer dependency relationships in the group 3 are all input as the network layer 2, the network layer 3 and the network layer 4, so that when the FPGA is deployed, the array block divided for the group 1 and the array block divided for the group 2 have a connection relationship, and the array block of the group 2 and the array block divided for the group 3 have a connection relationship, so that the array block of the group 1 is adjacent to the array block of the group 2 in position, and the array block of the group 2 is adjacent to the array block of the group 3 in position.
As shown in fig. 3B, in the structure of the calculator of the neural network shown in fig. 1, the array block of group 1 is connected to the array block of group 2, and the array block of group 2 is further connected to the array block of group 3.
Comparing fig. 3B with the conventional computing architecture shown in fig. 2, the computing architecture obtained by the present invention has fewer inter-array connections and fewer inter-array data exchanges, so that the computing efficiency is high.
It should be noted that, during the deployment process, the network parameters included in the network layer in each packet need to be added to the array block allocated to the packet according to the operation rule.
In this embodiment, the calculator of the neural network is obtained by determining the dependency relationship of each network layer in the neural network, grouping the network layers, determining the size of an array block required by each group, and deploying the FPGA according to the dependency relationship of the network layers in each group and the size of the array block required by each group.
Based on the above description, by analyzing the dependency relationship of the neural network, network layers that are not dependent (i.e., the dependency relationships are all the same) are divided into a group, so that parallel computation can be realized in the array blocks allocated to the network layers, the length of the critical path is effectively shortened, and the computation efficiency is improved. In addition, the positions of array blocks required by each group are arranged on the FPGA according to the dependency relationship, so that data exchange among arrays can be reduced, and the calculation efficiency is further improved.
Fig. 4 is a hardware block diagram of an electronic device according to an exemplary embodiment of the present invention, the electronic device including: a communication interface 401, a processor 402, a machine-readable storage medium 403, and a bus 404; wherein the communication interface 401, the processor 402 and the machine-readable storage medium 403 communicate with each other via a bus 404. The processor 402 can execute the FPGA-based neural network calculator generation method described above by reading and executing machine-executable instructions in the machine-readable storage medium 403 corresponding to the control logic of the FPGA-based neural network calculator generation method, and the details of the method are described in the above embodiments and will not be described again here.
The machine-readable storage medium 403 referred to in this disclosure may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: volatile memory, non-volatile memory, or similar storage media. In particular, the machine-readable storage medium 403 may be a RAM (random Access Memory), a flash Memory, a storage drive (e.g., a hard disk drive), any type of storage disk (e.g., an optical disk, a DVD, etc.), or similar storage medium, or a combination thereof.
Corresponding to the embodiment of the neural network calculator generating method based on the FPGA, the invention also provides an embodiment of a neural network calculator generating device based on the FPGA.
Fig. 5 is a flow chart illustrating an embodiment of an FPGA-based neural network calculator generating device according to an exemplary embodiment of the present invention, which can be applied to an electronic device. As shown in fig. 5, the FPGA-based neural network calculator generating apparatus includes:
a first determining module 510, configured to determine a dependency relationship of each network layer in the neural network;
a second determining module 520, configured to group network layers, where dependency relationships of the network layers in each group are the same, and determine a size of an array block required by each group;
a generating module 530, configured to deploy an FPGA to obtain a calculator of the neural network according to the dependency relationship of the network layer in each packet and the size of the array block required by each packet.
In an optional implementation manner, the first determining module 510 is specifically configured to determine the dependency relationship of each network layer according to a data transfer relationship between each network layer in the neural network.
In an optional implementation manner, the second determining module 520 is specifically configured to, in the process of determining the size of the array block required by each packet, determine, for each packet, the number of computing resources required by the network layer in the packet; the required array block size is determined according to the amount of computational resources required for the packet.
In an optional implementation manner, the second determining module 520 is specifically configured to, in a process of determining the number of computing resources required by the network layer in the packet, obtain network parameters included in each network layer in the packet from the neural network; and determining the required amount of computing resources according to the network parameters contained in each network layer.
In an alternative implementation, the array block size required for each packet refers to the number of processing element PEs that the array block needs to contain.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.