Background
Convolutional layers are the core computation modules in convolutional neural networks, and usually the convolutional layer computation amount accounts for more than 90% of the whole convolutional network computation. Fig. 1 shows a process of convolution calculating an output feature map, each input feature map corresponds to a convolution kernel, the dotted boxes of different colors in the input map correspond to different outputs, and each output is obtained by adding products of the same position of different input maps and the convolution kernel. Each output is an input local information processing result and reflects local characteristic information, and the same input characteristic diagram uses the same convolution kernel for characteristic processing, which is a weight sharing mechanism in a convolution network. The convolution layer can complete the local feature extraction of the input feature map, and complete the processing of the whole input feature map through the sliding calculation of the convolution kernel on the input feature map.
The calculation formula of the convolutional layer is as formula 1,
wherein out (f)0X, y) denotes the f-th0The values corresponding to the x and y positions in the output feature map, W is the convolution kernel weight matrix, in represents the input feature map, b represents the bias of the convolution layer, K is the convolution kernel size, and S is the sliding step size of the convolution kernel.
As can be seen from formula 1, the convolution includes a large number of multiply-add operations, the calculation amount is very large, the calculation efficiency is very low when the convolution is implemented in a pure software manner, and a neural network acceleration algorithm and an external hardware processor need to be developed to cooperate with an AI instruction to implement algorithm acceleration.
Disclosure of Invention
The embodiment of the invention provides a neural network acceleration coprocessor, a processing system and a processing method, which are used for solving the problem of very low operation efficiency of processing convolution layer calculation by a pure software mode at present.
The embodiment of one aspect of the invention provides a neural network acceleration coprocessor, which comprises a control module, an address generation module, a multiply-accumulate module and an output saturation module;
the address generation module is used for matching storage addresses for input data and corresponding output data;
the multiply-accumulate module is used for carrying out neural network convolution operation;
the output saturation module is used for limiting the range of output data and outputting an operation result;
the control module is used for receiving an expansion instruction sent by the main processor, controlling the address generation module to match addresses of input data and corresponding output data according to the expansion instruction, reading data from the memory according to the matching addresses, controlling the multiplication and accumulation module to carry out convolution calculation on the read data, controlling the output saturation module to output a calculation result, and storing the output result into the memory according to the matched output data address.
Preferably, the extended instruction comprises a configuration instruction for initializing convolution parameters and an operation instruction for executing convolution operation;
the configuration instruction is a single-cycle instruction and is used for configuring addresses and parameters of input and output data and parameters of a convolution kernel;
the operation instruction is a variable multi-cycle instruction, and the execution cycle number of the instruction is determined by the parameter of the convolution operation set in the primary instruction.
Preferably, in any one of the above embodiments, the configuration instruction includes the following first to sixth instructions;
the first instruction is used for setting the number of input and output tensor channels;
the second instruction is used for setting the sizes of input and output tensors;
the third instruction is used for setting the size and the step size of the convolution kernel;
the fourth instruction is used for setting a filling size and a filter weight data starting address;
the fifth instruction is used for setting the initial addresses of input and output data;
the sixth instruction is used to set the offset and offset data start address.
In any of the foregoing embodiments, it is preferable that the control module further decodes the extended instruction, reads an operand according to the decoded extended instruction, and writes the read operand into the register; the operand is used for transmitting the storage address of the input data in the memory; input data is read from the memory according to the deposit address pointed to by the operand.
The invention also provides a neural network accelerated processing system, which comprises the coprocessor, a main processor and a memory; the above-mentioned
A memory for storing data;
the main processor is used for sending an expansion instruction;
the coprocessor is used for receiving the expansion instruction sent by the main processor, reading input data from the memory according to the received expansion instruction, performing neural network calculation on the input data to obtain output data, and storing the output data into the memory.
The invention also provides a neural network acceleration processing method, which is applied to the processing system and comprises the following steps:
the main processor sends an expansion instruction to the coprocessor;
the coprocessor receives an expansion instruction sent by the main processor, reads input data from the memory according to the received expansion instruction, performs neural network operation on the read input data to obtain output data, and writes the output data into the memory;
and the main processor reads the output data stored by the coprocessor from the memory to complete the neural network algorithm processing.
In any of the above embodiments, preferably, the main processor sends an expansion instruction to the coprocessor, and controls the coprocessor to perform configuration of initialization convolution parameters and execution of convolution operation;
the initialization convolution parameter configuration comprises addresses and parameters of single-period configuration input and output data and parameters of a convolution kernel; the execution period of the convolution operation is determined by the parameters of the convolution operation set in the primary instruction.
In any of the above embodiments, preferably, when configuring the initialization convolution parameters, the method includes the following operations:
setting the number of input and output tensor channels; setting the sizes of input and output tensors; setting the size and step length of a convolution kernel; setting a filling size and a filter weight data start address; setting initial addresses of input and output data; an offset and a deviation data start address are set.
Preferably, in any one of the above embodiments, the method further includes writing each extended instruction according to the following encoding format:
a. a first bit interval of the instruction is an Opcode coding segment; b. three bits are set for controlling whether reading the source register and writing the destination register is required.
In any of the above embodiments, preferably, when the received extended instruction is used to read input data from the memory, the coprocessor decodes the extended instruction, reads an operand according to the decoded extended instruction, and writes the read operand into the register; the operand is used for transmitting the storage address of the input data in the memory; input data is read from the memory according to the deposit address pointed to by the operand.
Advantageous effects
1. According to the neural network accelerated processing method, the coprocessor and the processing system, when the neural network is processed, the coprocessor is arranged, time-consuming operation in the convolutional neural network is processed by the coprocessor, the main processor controls the coprocessor to perform neural network calculation on input data through an expansion instruction, the utilization rate of a CPU is reduced, and compared with pure software, the efficiency of convolutional operation is improved by more than 20 times;
2. when the coprocessor is configured through the extended instruction, the method simplifies the extended mode of the coprocessor instruction set, sets 7 extended instructions, initializes parameters and executes convolution operation, is simple in algorithm, improves the robustness and stability of the system, and meets the flexible and changeable calculation requirements of the algorithm.
3. When the coprocessor decodes the extended instruction, the operand can be directly read out and sent to the register, the operand is used for carrying out address transmission, and compared with a direct read-write data address, the read-write speed is higher, and the utilization rate of a buffer structure is higher.
4. According to the expansion instruction coding format, a first bit interval (lower seven bits) of an instruction is used as an Opcode coding section, each instruction group and an extra coding space of the coding section are used, more coprocessor instructions can be coded, and the expandability of the coprocessor is greatly improved.
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
As shown in fig. 2, an embodiment of an aspect of the present invention provides a neural network acceleration coprocessor, which includes a control module 301, an address generation module 302, a multiply-accumulate module 303, and an output saturation module 304;
the address generation module 302 is configured to match storage addresses for input data and corresponding output data;
the multiply-accumulate module 303 is used for performing neural network convolution operation;
the output saturation module 304 is configured to limit a range of output data and output an operation result;
the control module 301 is configured to receive an expansion instruction sent by the main processor, control the address generation module to match addresses of input data and corresponding output data according to the expansion instruction, read data from the memory according to the matching addresses, control the multiply-accumulate module to perform convolution calculation on the read data, control the output saturation module to output a calculation result, and store the output result in the memory according to the matched output data address.
The expansion instruction comprises a configuration instruction for initializing convolution parameters and an operation instruction for executing convolution operation; further, the configuration instruction is a single-cycle instruction and is used for configuring addresses and parameters of input and output data and parameters of a convolution kernel;
the operation instruction is a variable multi-cycle instruction, and the execution cycle number of the instruction is determined by the parameter of the convolution operation set in the first-stage instruction.
The configuration instructions include the following first to sixth instructions;
the first instruction is used for setting the number of input and output tensor channels;
the second instruction is used for setting the sizes of input and output tensors;
the third instruction is used for setting the size and the step size of the convolution kernel;
the fourth instruction is used for setting a filling size and a filter weight data starting address;
the fifth instruction is used for setting the initial addresses of input and output data;
the sixth instruction is used to set the offset and offset data start address.
As shown in fig. 5, in the present embodiment, 7 expansion instructions are defined for the implementation of the convolution coprocessor, where 6 configuration instructions are INIT _ CH for setting the number of channels of the input/output tensor, INIT _ IM for setting the size of the input/output tensor, INIT _ FS for setting the size and step size of the filter kernel, INIT _ PW for setting the fill size and the start address of the filter weight data, INIT _ ddr ima for setting the start address of the input/output data, and INIT _ BIAS for setting the start address of the offset and the offset data, respectively; instructions for initializing convolution parameters; several instructions for parameter initialization are single-cycle instructions, and when the corresponding instruction is received, the operand is read out and sent to the register for subsequent calculation.
Specifically, the control module decodes the extended instruction, reads an operand according to the decoded extended instruction, and writes the read operand into a register; the operand is used for transmitting the storage address of the input data in the memory; input data is read from the memory according to the deposit address pointed to by the operand.
The operation instruction is an instruction for performing a convolution operation, which is a LOOP instruction. The LOOP instruction for performing the convolution operation is a variable multi-cycle instruction, and the execution cycle number of the instruction is determined by the parameter related to the convolution operation.
The specific definition of each instruction is shown in table 1.
Table 1 list of instructions for convolution coprocessor
In the above table, Opcode refers to the code segment, using the code custom-0, custom-1, custom-2, and custom-3 instruction groups. The xs1, xs2, and xd bits are used to control whether reading the source register and writing the destination register, respectively, is required. The funct7 interval can be used as extra coding space for coding more instructions, so a Custom instruction group can code 128 instructions by using the funct7 interval.
In the embodiment, by arranging the coprocessor, after the coprocessor receives the extended instruction, the time-consuming operation in the convolutional neural network is processed, the neural network convolution calculation is performed on the input data, the convolution, pooling and activation operations of the convolutional neural network are flexibly combined, and the method is suitable for various light convolutional neural networks.
As shown in fig. 3, the present invention further provides a neural network accelerated processing system, which includes the above coprocessor 3, further includes a main processor 1 and a memory 2; the memory 2 is used for storing data; a main processor 1, configured to send an expansion instruction; and the coprocessor 3 is used for receiving the expansion instruction sent by the main processor, reading input data from the memory according to the received expansion instruction, performing neural network calculation on the input data to obtain output data, and storing the output data into the memory.
As shown in fig. 4, the present invention further provides a neural network acceleration processing method, which is applied to the processing system, and includes the following steps:
s1, the main processor sends an expansion instruction to the coprocessor;
s2, the coprocessor receives the expansion instruction sent by the main processor, reads input data from the memory according to the received expansion instruction, performs neural network operation on the read input data to obtain output data, and writes the output data into the memory;
and S3, the main processor reads the output data stored by the coprocessor from the memory to complete the neural network algorithm processing.
Wherein in S1 or S2, the extended instruction includes a configuration instruction for initializing convolution parameters and an operation instruction for performing a convolution operation; the configuration instruction is a single-cycle instruction and is used for configuring addresses and parameters of input and output data and parameters of a convolution kernel; the operation instruction is a variable multi-cycle instruction, and the execution cycle number of the instruction is determined by the parameter of the convolution operation set in the first-stage instruction.
When configured, the method comprises the following operations: setting the number of input and output tensor channels; setting the sizes of input and output tensors; setting the size and step length of a convolution kernel; setting a filling size and a filter weight data start address; setting initial addresses of input and output data; an offset and a deviation data start address are set.
As shown in table 1 above, in this embodiment, 7 expansion instructions are defined for the implementation of the convolution coprocessor, where 6 configuration instructions are INIT _ CH for setting the number of channels of the input/output tensor, INIT _ IM for setting the size of the input/output tensor, INIT _ FS for setting the size and step size of the filter kernel, INIT _ PW for setting the fill size and the start address of the filter weight data, INIT _ ddr for setting the start address of the input/output data, and INIT _ BIAS for setting the offset and the start address of the offset data, respectively; instructions for initializing convolution parameters; several instructions for parameter initialization are single-cycle instructions, and when the corresponding instruction is received, the operand is read out and sent to the register for subsequent calculation. The operation instruction is an instruction for performing a convolution operation, which is a LOOP instruction. The LOOP instruction for performing the convolution operation is a variable multi-cycle instruction, and the execution cycle number of the instruction is determined by the parameter related to the convolution operation.
As shown in the above table, the encoding format of the extended instruction includes 1, and the first bit interval of the instruction is an Opcode encoding segment; 2. three bits are set for controlling whether reading the source register and writing the destination register is required. Specifically, in the above table, Opcode refers to the code segment, using the code custom-0, custom-1, custom-2, and custom-3 instruction groups. The first bit to the 6 th bit are generally adopted, the first bit interval is used as an Opcode coding segment, and the xs1, xs2 and xd bits are respectively used for controlling whether reading the source register and writing the target register are needed. The funct7 interval can be used as extra coding space for coding more instructions, so a Custom instruction group can code 128 instructions by using the funct7 interval.
When the method is applied to the system for testing, the embodiment shown in FIG. 6 is obtained, and the Cifar-10 data set is the data set collected by two of the hiton's hiki Alex Krizhevsky and Ilya Sutskeeper for the convenience of general object recognition. The data set contains 10 types of image sets, airplane, car, bird, cat, deer, dog, frog, horse, boat and truck, respectively, for a total of 60000 pictures, wherein each type of image set contains 6000 pictures. Each picture in the data set has a pixel size of 32 x 32 and the number of channels is RGB3 channels.
The results of the treatment are shown in the following table,
treatment stage
|
Coprocessor-based
|
Pure software
|
Acceleration ratio
|
Convolution with a bit line
|
4675254
|
94595789
|
20.23 |
Table 2 is a list of the results of the treatment
In this embodiment, by comparison in the convolution calculation stage, the current pure software processing method requires 94595789 cycles, whereas the coprocessor-based processing method provided by the embodiment of the present invention requires 4675254 cycles, and the speed-up ratio is 20.23.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be appreciated by those skilled in the art that the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or equivalence to the invention are intended to be embraced therein.