CN111860819B

CN111860819B - Spliced and sectionable full-connection neural network reasoning accelerator and acceleration method thereof

Info

Publication number: CN111860819B
Application number: CN202010731785.5A
Authority: CN
Inventors: 李丽; 张永刚; 傅玉祥; 黄延; 宋文清; 何书专
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2023-11-07
Anticipated expiration: 2040-07-27
Also published as: CN111860819A

Abstract

The invention relates to a spliced and sectionable full-connection neural network reasoning accelerator and an acceleration method thereof, wherein the accelerator comprises three functional modules, namely a control module, a storage module and a calculation module, and the control module has three modes: configuration mode, transport mode, calculation mode. The invention fully utilizes the parallelism of the full-connection calculation and the sharability of the weights, and supports multi-batch processing and multi-path parallel calculation. The control module controls the realization and acceleration of the whole full connection through the jump of the mode. The invention can realize the splicing of independent addresses of the input neuron and the weight, namely, the full connection which needs to be calculated for many times can be combined into one full connection operation; and secondly, the method can realize the split of the full-connection calculation, and segment the full-connection calculation by temporarily storing the intermediate result, so that the calculation of the large-scale full-connection neural network is realized under the condition of limited hardware resources.

Description

Spliced and sectionable full-connection neural network reasoning accelerator and acceleration method thereof

Technical Field

The invention relates to the field of hardware implementation of artificial intelligent algorithms, in particular to a spliced and sectionable full-connection neural network reasoning accelerator and an acceleration method thereof.

Background

The deep neural network has strong performance in visual, voice and other target recognition tasks, the deeper the network is, the stronger the learning ability is, but the learning ability is enhanced at the cost of the increase of memory and other computing resources, the fully connected neural network is usually one layer of the convolutional neural network, and the fully connected layer is characterized in that each node is connected with all nodes of the upper layer, and plays a role of a classifier in the whole convolutional neural network.

Under the limitation of hardware resources and cost, the existing neural network accelerator does not have parallelism and sectionality, so that the computing performance is reduced, and the computing period is prolonged.

Disclosure of Invention

The invention aims to: under the basic condition of limited computing resources and storage resources, multipath parallelism, data splice and full-connection computing segmentability are fully utilized, a spliced and segmentable full-connection neural network reasoning accelerator is provided, and the further aim is to provide an acceleration method based on the full-connection neural network reasoning accelerator, so that the problems in the prior art are solved.

The technical scheme is as follows: a spliced and sectionable full-connection neural network reasoning accelerator comprises a control module, a storage module and a calculation module. The control module controls the whole full-connection accelerator by jumping in three modes of a configuration mode, a carrying mode and a calculation mode.

In a further embodiment, in the configuration mode, the control module not only can control the input of configuration information, but also can control the implementation of the input neuron splicing, the input neuron splicing is to splice two input vectors with independent addresses into one input vector, namely, two full-connection calculation is combined into one input vector, the input neuron splicing adopts a mode of alternately splicing neurons, namely, the neurons at corresponding positions of the independent addresses are spliced, and meanwhile, the corresponding weights are spliced, the input neuron splicing can carry the input neurons and the weights once less, and the time for implementing the whole full-connection calculation is saved by carrying the input neurons and the weights once less, reading the source data, outputting the result and starting the calculation module once less.

In a further embodiment, under the condition that the input neuron and the output neuron exceed a certain length under the fixed storage capacity, under the premise of multi-batch processing and multi-path parallel, the source data area stores the intermediate result into the intermediate result area, the intermediate result is taken out during subsequent calculation, and multiplication accumulation is carried out on the basis of the last time through an adder to complete the calculation of a complete full-connection algorithm.

In a further embodiment, the implementation of the full-connection algorithm based on the hardware architecture (computing PE) can support at most 32 batches of full-connection image processing, meanwhile, the full-connection computation can be performed in 32 paths in parallel in a single computing PE, in the specific implementation based on a computing module, namely the computing PE, 32 computing PEs in the computing module perform 32 paths of parallel pipeline computation to obtain 32 outputs of 32 images, and the ping-pong operation finally obtains complete outputs.

In a further embodiment, the full connection algorithm requires a large number of weight coefficients, and in the handling mode, the data bit width of the memory module data writing and reading is set to 256 bits; in the calculation mode, the data format of the calculation module is 16bit fixed point data, when the data length is not an integer multiple of 16, in order to avoid carrying confusion between input batches and carrying confusion of weight matrixes corresponding to each input during carrying, a zero filling mode is adopted to fill the weights of single input neurons and each input neuron into multiples of 16.

In a further embodiment, the acceleration method for realizing the full-connection algorithm by using the full-connection neural network reasoning accelerator comprises the following steps:

step 1) the control module controls the normal operation of other modules through the jump of the mode, under the configuration mode, the control module controls the splicing of the input neurons of two independent addresses according to whether the splicing of input vectors is required or not, and simultaneously completes the configuration and the transmission of information;

step 2) after the information configuration and transmission are completed, the control module jumps to a carrying mode, in the carrying mode, the storage module controls the generation of addresses and the enabling of banks according to the characteristics of data types, data lengths and the like, carries data into the SRAM, and then sends the data into the calculation module;

step 3) after the storage module sends the data into the calculation module, the control module enters a calculation mode, the calculation module starts full-connection reasoning calculation after receiving the source data, the weight data and the input data according to the information transmitted by the main controller, supports multi-batch processing and single-batch 32-path parallel calculation, and finally selects an offset or intermediate result to be added by whether segmentation calculation is needed or not, and the current calculation result is output in a running way;

after the calculation of the step 4) is completed, controlling the output of the calculation module to be stored in an intermediate result area or a final result area according to whether the segmentation calculation is needed;

and 5) after the whole calculation is finished or the result area is full, the control module enters a carrying mode, the storage module receives a signal for starting carrying out, carries out the whole result or a part of result, and completes the calculation of the whole full connection through sectional carrying out when the output is overlarge.

In a further embodiment, the handling at the time of segmentation calculation comprises the steps of:

step 5-1, firstly, carrying 32 batches of input neurons, wherein the carrying length of each input neuron is sram_num, the sram_num is the maximum number of input neurons which can be stored in one Bank, storing the largest number of input neurons into 32 source data banks, and when carrying the weight, carrying the sram_num columns of 32 rows of weights into 32 banks respectively, and carrying the sram_num columns into the corresponding banks in an offset manner.

And 5-2, calculating an intermediate result by the algorithm at the moment, and needing temporary storage. The input neurons and weights are multiplied and accumulated, and then bias is added (assuming that the length of the expansion bit of the adder is N and the length of the interception bit of the adder is M at the moment), because an activation function is not needed, the mode of the activation function is set to 0 at the moment, intermediate results are obtained, the intermediate results are stored in an intermediate result area, and 32 intermediate results are obtained by each calculation PE.

Step 5-3, carrying out second carrying on the neurons with the length of the next sram_num of each input neuron of 32 batches of input neurons, wherein during weight carrying, carrying out next sram_num columns of 32 rows of weights respectively in 32 banks, 32 MAC (media access control) output 32 results, reading out 32 x 32 intermediate results calculated last time from SRAM (static random access memory) in the calculated PE when the calculated result is obtained, and adding the intermediate results corresponding to the calculated result through an adder, wherein the adder adds the intermediate results last time at the moment, so that the adder bit expansion signal, the adder bit interception signal and the like are set to be M for ensuring the addition of corresponding bits.

Step 5-4, the subsequent calculation steps are the same as step 3, and the intermediate result area exists in the operation which is not the last time. I.e. the intermediate result area always holds 32 x 32 intermediate results before the output of one complete calculation result and is fetched and added after the next multiply-accumulate operation.

And step 5-5, still taking 32 x 32 intermediate results from the intermediate result area in the last time, setting the activation function mode as a required activation function at the moment, setting the adder bit-expanding signal and the adder bit-intercepting signal M when adding the intermediate results, obtaining complete final 32 outputs by each calculated PE at the moment, and storing the complete final 32 outputs into 32 result banks respectively.

The length of the output vector outputted at this time is 32. When the weight matrix is connected with 32 rows, the source data needs to be updated, so that the carrying is required to be repeated once, and finally, the calculation of a large-scale full-connection algorithm is realized under the condition of limited hardware resources by a segmentation method.

The beneficial effects are that: the invention relates to a spliced and sectionable full-connection neural network reasoning accelerator and an acceleration method thereof, wherein the accelerator comprises three functional modules, namely a control module, a storage module and a calculation module, and the control module has three modes: configuration mode, transport mode, calculation mode. The invention fully utilizes the parallelism of the full-connection calculation and the sharability of the weights, and supports multi-batch processing and multi-path parallel calculation. The control module controls the realization and acceleration of the whole full connection through the jump of the mode. The invention can realize the splicing of independent addresses of the input neuron and the weight, namely, the full connection which needs to be calculated for many times can be combined into one full connection operation; and secondly, the method can realize the split of the full-connection calculation, and segment the full-connection calculation by temporarily storing the intermediate result, so that the calculation of the large-scale full-connection neural network is realized under the condition of limited hardware resources.

Drawings

Fig. 1 is a schematic diagram of a fully connected neural network reasoning calculation.

FIG. 2 is a diagram of a tiled, segmentable, fully connected neural network reasoning accelerator hardware architecture.

Fig. 3 is a schematic diagram of input neuron stitching.

Fig. 4 is a schematic diagram of a weight matrix splice.

FIG. 5 is a schematic diagram of an internal computing process of computing PE.

Fig. 6 stores a partitioning diagram.

Fig. 7 is a small-dot ping-pong schematic.

Fig. 8 is a schematic diagram of a large-dot ping-pong.

FIG. 9 is a schematic diagram of a segment calculation scheme.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.

From fig. 1, it can be seen that the full-join layer analysis mainly involves multiply-accumulate operations. The main calculation operator of the full connection is multiply accumulation, a large amount of calculation can be performed through parallel path number and pipelining operation of each part, so that the calculation efficiency is improved, and the time is saved. Therefore, under the limitation of hardware resources and cost, in order to design a neural network accelerator with good performance and low power consumption, the problem of huge calculation amount of the neural network is solved, the parallelism of an algorithm and the sectionability of a fully connected neural network are fully utilized, and meanwhile, in order to improve the performance, the calculation period is reduced, and the joggling performance of the fully connected neural network can also be utilized.

The invention provides a hardware implementation of a splittable and sectionable full-connection neural network algorithm, which fully utilizes multipath parallelism, data splittability and full-connection calculation sectionability under the basis condition of limited calculation resources and storage resources, and has the following specific technical implementation scheme:

when the main control controls the input neurons to splice, a vector splicing mark is added into the API: vector_latching_flag. When vector_latching_flag=0, no splicing is required. And when the splicing mark is equal to 1, the input vector is spliced. Independent address input vector splicing is supported, and only one input_a is configured when input vector splicing is not needed. (two inputs are respectively denoted by input_a and input_b and have lengths of len_a and len_b), when the transfer input is a neuron, the data in DDR is formed by splicing the numbers in two input addresses, the data content is shown in figure 3, namely, when the data is transferred, the data with the length of len_a in input_a is transferred, then the data with the length of len_b in input_b is transferred, and then the data with the length of len_a in input_a is transferred (the address of the data is the address after the length of len_a is transferred); since the inputs are stitched, the weights corresponding to the inputs must also be stitched accordingly. The splicing mode is shown in fig. 4;

the fully connected neural network accelerator hardware architecture is shown in fig. 2. The system comprises a control module, a storage module and a calculation module. The control module has three modes: configuration mode, transport mode, calculation mode. The function required by the full-connection algorithm is realized mainly by using a calculation module, namely a multiply-accumulate unit and an addition unit in a calculation PE, wherein the reasoning parallelism is 32, and the full-connection calculation is realized by 32 MAC parallel calculation.

The hardware implementation of the spliced and sectionable full-connection neural network algorithm comprises the following steps:

(1) Control module

The control module controls execution of the algorithm flow, and under a configuration mode, the storage module and the calculation module are configured to form a full-connection calculation function; in the carrying mode, according to the algorithm execution stage parameters, the method is responsible for starting and controlling the address of the reading source data, the weight and the intermediate result; simultaneously controlling and starting the writing result and reading result functions; in the calculation mode, parameters of the execution stage of the sensory algorithm are calculated in full connection according to whether segmentation is required.

(2) Memory module

In the carrying mode, the storage module carries source data, weight and bias in the DDR into the SRAM according to the starting signal of the control module and the configuration information distributed by the main control module. The read data is divided into a read input neuron address, a read weight address, a read-write intermediate result address (when intermediate result is needed to be temporarily stored) and a write result address generation module, corresponding read-write addresses are generated according to the algorithm execution stage, the ping-pong mark and the mapping of data in the storage array respectively, 32 paths of source data read from a source data area are respectively sent to each calculation PE, and the 32 paths of source data including the weight area and bias data are shared to each calculation PE. After the calculation is finished, receiving 32 PE calculation results, and sending the PE calculation results into a corresponding intermediate result area or a final result area; if the result is distributed to the intermediate result area, the result is accumulated with the next intermediate result. And starting the reading result function according to the starting information transmitted by the control module.

(3) Calculation module

The interconnection structure in the calculation PE in the full connection operation is shown in fig. 5, and in the full connection operation, the full connection is realized mainly by multiply accumulation operation, an MAC unit in the calculation PE is used, when the full connection operation is carried out, multiplication input in the MAC is multiplied by a weight value in a weight matrix, and the addition is carried out to add the output and the result of each group, and finally, the intermediate result or bias on the adder machine is adopted. In order to beat the multiplication and accumulation results in error, the source data are needed to be beaten internally to form 32 paths of source data of different delay, and the source data are sent to each calculation PE; in the same way, the input 32 paths of weight data are required to be respectively beaten, and the beaten 32 paths of weight data are sent to the calculation PE.

The stored ping-pong scheme of the spliced and sectionable full-connection neural algorithm accelerator is as follows:

the input neuron and the output neuron need to ping-pong up and down and ping-pong left and right by weight when in small points; when the output is greater than 32, the weight Bank needs to ping-pong left and right, and the specific scheme is as follows (assuming that the data capacity stored by one Bank is sram_num):

under the small dot condition, the input is smaller than 1/2sram_num, the output is smaller than 32, at the moment, the input Bank and the output Bank are ping-pong up and down according to depth, batch processing is supported, and a ping-pong schematic diagram is shown in fig. 7.

Under the large-point condition, when the input is greater than 1/2sram_num and less than or equal to sram_num and the output is greater than 32, the source data of the single graph can be stored in one Bank, a group of weights can be placed in one weight Bank, but the weights need to be subjected to ping-pong operation, the intermediate result does not need to be temporarily stored, a group of outputs can be completed at a time, and the ping-pong scheme is shown in fig. 8.

The big point case, i.e. the case where segmentation calculation is required, is specifically shown in fig. 9. When the input is larger than the sram_num, since a group of outputs are connected with each group of inputs, at the moment, a group of outputs cannot be calculated, at the moment, intermediate results need to be temporarily stored, when input neurons with the length of the next sram_num and weights need to be calculated, the intermediate results which are temporarily stored last time need to be taken out and sent to the MAC, after calculating 32 rows of the weight matrix with the length of the sram_num, the MAC needs to multiply and accumulate the rest rows of the weight matrix corresponding to source data with the length of the sram_num with the source data, at the moment, the intermediate results need to be stored in SRAM (static random access memory) in an intermediate result area, and the size of the SRAM is pe_sram_num.

The implementation flow of the full connection algorithm is as follows:

fig. 1 is a fully connected calculation principle, which is implemented by multiplying a weight matrix by an input vector and then adding an offset, and in the implementation process, the ping-pong operation of carrying-operation needs to be fully considered.

Fig. 4 is an overall hardware architecture of a fully-connected neural network accelerator, where each module performs a series of operations such as configuration information input, data handling, computation, output, etc. through block collaboration according to a mode jump. The main control module is responsible for splicing input vectors, and the specific splicing mode is shown in fig. 2 and 3, namely, inputting neurons A and length LA; inputting neuron B, length LB; output neuron C, length LC; since the weight (la+lb) ×lc is input, the number of neurons input when vector concatenation is input is: l=la+lb.

Fig. 5 is a specific calculation flow in a single calculation PE, and when the calculation is implemented, the fully connected specific calculation is calculated by using a calculation module, and the calculation module is composed of 32 calculation PEs, including 32 MACs, and the 32 MACs execute 16-bit parallel multiply-accumulate calculation and perform fully connected operation according to whether segmentation calculation is performed or not.

FIG. 6 is a schematic diagram of the storage of source data, weights, offsets and results in an SRAM, where the input neurons are one-dimensional vectors, so that a Bank stores a group of vectors, the vectors are sent into the MAC to be multiplied and accumulated with the weights, the weights are taken out and then are beated and placed into 32 MAC in a calculation module to be multiplied and accumulated with the input data, then the offsets are taken from the offset Bank, the beated MAC sequentially outputs data according to beats and is added with the offsets, and 32 groups of outputs can be sequentially obtained. Since weights may be shared, the computing modules may share weights.

Fig. 7 and fig. 8 are schematic table tennis diagrams in specific implementation, in which the input Bank performs table tennis up and down according to depth, supports multiple batches of processing, only needs to perform table tennis around the weight Bank at the moment when the point is large, performs primary source data multiplication with weights in the weight network corresponding to the source data, temporarily stores an intermediate result into the SRAM in the intermediate result area, and updates the next source data after the weight corresponding to the group of source data is calculated. After a set of multiply-accumulate is calculated, 1 band is put off-set, and the MAC in PE takes the number from the band when the MAC is output by beat and sends the number to an adder for adding. And finally, storing the obtained result into an output Bank.

When the full connection layer performs calculation, the calculation flow is as follows:

broadcasting the weight to each computing PE; for each row, the input neuron values beat left to right into 32 MACs, the summation is generated at the local PE.

The following fully connected layers are examples: 3 neurons were input, 3 nodes of the second layer, and 3 processing batch.

Taking the PE0 computation case as an example, the input neurons are as in FIG. 1Flows into the computation PE, which corresponds to the weight position +.>… … each group of weights is replayed into a Bank, and the MAC finishes the calculation of one pixel point in the output neuron, and 32 MAC are input, so that the result can be sequentially output from the MAC and then sent to the adder together with the offset for addition.

The values of the weights can be continuously multiplexed; because of the multiple output pixels, input neurons are also constantly multiplexing. And carrying out according to the image sequence during carrying.

Fig. 9 is a schematic diagram of handling when segmentation calculation is required, and the calculation flow is as follows:

step 1), firstly, conveying 32 batches of input neurons, wherein the conveying length of each input neuron is sram_num, storing the input neurons in 32 source data banks, and when the input neurons are conveyed in weight, respectively storing the sram_num columns with 32 rows of weights in 32 banks, and carrying the input neurons in the corresponding banks in an offset manner.

Step 2) the algorithm calculates an intermediate result and needs temporary storage. The input neurons and weights are multiplied and accumulated, and then bias is added (assuming that the length of the expansion bit of the adder is N and the length of the interception bit of the adder is M at the moment), because an activation function is not needed, the mode of the activation function is set to 0 at the moment, intermediate results are obtained, the intermediate results are stored in an intermediate result area, and 32 intermediate results are obtained by each calculation PE.

Step 3) carrying out second carrying on the neurons with the length of the next sram_num of each input neuron of 32 batches of input neurons, wherein during weight carrying, carrying the next sram_num column of 32 rows of weights respectively into 32 banks, 32 MAC (media access control) output 32 results, reading out the 32 x 32 intermediate results calculated last time from SRAM (static random access memory) in the calculated PE when the calculated result is obtained, and correspondingly adding the calculated result to the calculated result through an adder, wherein the adder adds the last intermediate result at the moment to ensure the addition of corresponding bits, so that an adder bit expansion signal, an adder bit interception signal and the like at the moment are set to be M.

The calculation steps after step 4) are the same as step 3, and the intermediate result area exists in the calculation of the non-last time. I.e. the intermediate result area always holds 32 x 32 intermediate results before the output of one complete calculation result and is fetched and added after the next multiply-accumulate operation.

And 5) still taking 32 x 32 intermediate results from the intermediate result area in the last time, setting the activation function mode as a required activation function at the moment, setting the adder bit-expanding signal and the adder bit-intercepting signal M when adding the intermediate results, obtaining complete final 32 outputs by each calculated PE at the moment, and storing the complete final 32 outputs into 32 result banks respectively.

In summary, the invention is a spliced and sectionable full-connection neural network reasoning accelerator, firstly, the invention can splice independent addresses of input neurons and weights, namely, full connection which needs to be calculated for many times can be combined into one full-connection operation; and secondly, the method can realize the split of the full-connection calculation, and segment the full-connection calculation by temporarily storing the intermediate result, so that the calculation of the large-scale full-connection neural network is realized under the condition of limited hardware resources. The invention mainly comprises a control module, a storage module and a calculation module, wherein the control module has three modes: configuration mode, transport mode, calculation mode. In the configuration mode, the control module performs data distribution, information configuration, information transmission and control input neuron splicing; in the carrying mode, the storage module carries data from the DDR to the SRAM, carries the result out of the SRAM, and is responsible for reading the data from the SRAM and sending the data to the calculation module, and controls and outputs an intermediate result or a final result; the calculation module consists of 32 calculation PEs, and in a calculation mode, the calculation module performs full-connection calculation. In addition, the method and the device fully utilize the parallelism of the full-connection calculation and the sharability of the weights, and support multi-batch processing and multi-path parallel calculation. The control module controls the realization and acceleration of the whole full connection through the jump of the mode. The spliced and sectionable full-connection neural network reasoning accelerator supports the splicing of input neurons; segmentable supporting fully connected computing; supporting multiple batches of processing; supporting multipath parallel computing; support the neural network scale and parameter to be configurable; the method supports calculation of four activation functions of Relu, sigmod, tanh and Softmax, supports ping-pong operation, has high speed of full-connection reasoning, and can support a large-scale full-connection neural network.

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The full-connection neural network reasoning accelerator capable of being spliced and segmented is characterized by comprising the following modules:

the control module is used for executing the control algorithm flow;

the storage module is used for carrying source data, weight and bias in the DDR into the SRAM according to the starting signal of the control module and the configuration information distributed by the main control module;

the computing module is used for carrying out full-connection computation;

the control module jumps through a configuration mode, a carrying mode and a calculation mode, so that the whole full-connection accelerator is controlled;

in the configuration mode, the control module not only can control the afferent of configuration information, but also can control the input neuron to splice, the input vector of two independent addresses is spliced into an input vector, the two full connection calculation is combined into one time, the input neuron is spliced in an alternate neuron splicing mode, namely, neurons at corresponding positions of the independent addresses are spliced, and meanwhile, corresponding weights are spliced;

under the condition of fixed storage capacity, when the input neuron and the output neuron exceed a preset length, under the premise of multi-batch processing and multi-path parallel, the source data area does not store the next complete input neuron and the corresponding weight, the intermediate result is stored in the intermediate result area, the intermediate result is taken out in the subsequent calculation, and the multiplication accumulation is carried out on the basis of the last time through the adder, so that the calculation of a complete full-connection algorithm is completed.

2. The patchable, sectionable fully connected neural network reasoning accelerator of claim 1, wherein in a configuration mode, the control module controls the stitching of input neurons and weights; in the carrying mode, the control module accesses data according to the configuration information and the data characteristics; in the calculation mode, the 32 calculation PEs adopt a mode of interconnecting weight and offset access paths, and share the weight and the offset to calculate.

3. The spliced and sectionable full-connection neural network reasoning accelerator according to claim 2, wherein the control module starts the sectionalized calculation mode when one complete full-connection calculation is not available in the process of jumping through the configuration mode, the carrying mode and the calculation mode.

4. The spliced and sectionable full-connection neural network reasoning accelerator according to claim 2, wherein the computing PEs support no more than 32 batches of full-connection image processing, and meanwhile, full-connection computation is performed in parallel by 32 paths inside a single computing PE; and the 32 calculation PEs in the calculation module perform 32-path parallel pipeline calculation to obtain 32 outputs of 32 images, and the ping-pong operation finally obtains complete output.

5. The spliced and sectionable full-connected neural network reasoning accelerator according to claim 1, wherein in a handling mode, the data bit width of the memory module data writing and reading is set to 256 bits; in the calculation mode, the data format of the calculation module is 16bit fixed point data, when the data length is not an integer multiple of 16, a zero filling mode is adopted, and the weights of single input neurons and each input neuron are filled into multiple of 16.

6. The accelerating method of the spliced and sectionable full-connection neural network reasoning accelerator is characterized by comprising the following steps of:

step 1, a control module controls normal operation of other modules through mode jump, and in a configuration mode, the control module controls the splicing of input neurons of two independent addresses according to whether the input vectors are required to be spliced or not, and simultaneously completes configuration and transmission of information;

step 2, after information configuration and transmission are completed, the control module jumps to a carrying mode, in the carrying mode, the storage module controls the generation of addresses and the enabling of banks according to the characteristics of data types, data lengths and the like, carries data into the SRAM, and then sends the data into the calculation module;

step 3, after the storage module sends the data into the calculation module, the control module enters a calculation mode, the calculation module starts full-connection reasoning calculation after receiving the source data, the weight data and the input data according to the information transmitted by the main controller, supports multi-batch processing and single-batch 32-path parallel calculation, and finally selects an offset or intermediate result to be added by whether segmentation calculation is needed or not, and the current calculation result is output in a running way;

step 4, after the calculation is completed, controlling the output of the calculation module to be stored in an intermediate result area or a final result area according to whether the segmentation calculation is needed;

step 5, after the whole calculation is finished or the result area is full, the control module enters a carrying mode, the storage module receives a signal for starting carrying out, carries out the whole result or a part of result, and completes the calculation of the whole full connection through sectional carrying out when the output is overlarge;

7. The acceleration method of a pieceable and segmentable fully connected neural network inference accelerator of claim 6, wherein the handling during segmentation computation further comprises the steps of:

step 5-1, conveying 32 batches of input neurons, wherein the conveying length of each input neuron is sram_num, the sram_num is the maximum number of input neurons which can be stored in one Bank, storing the largest number of input neurons in 32 source data banks, and when the weight is conveyed, conveying the sram_num columns with the weight of 32 rows into the 32 banks respectively, and carrying the sram_num columns in an offset manner into the corresponding banks;

step 5-2, temporary storage of intermediate results: multiplying and accumulating the input neurons and the weights, adding bias, setting the activation function mode to 0 on the assumption that the bit interception length of the adder is M, obtaining intermediate results, storing the intermediate results into an intermediate result area, and calculating 32 intermediate results from PE (polyethylene) each;

step 5-3, carrying out second carrying on remaining sram_num length neurons of each input neuron of 32 batches of input neurons, wherein during weight carrying, the next sram_num columns of carrying 32 rows of weights are respectively stored in 32 banks, 32 MAC (media access control) come out 32 results, and when the calculation result comes out, reading out 32 x 32 intermediate results calculated last time from SRAM (static random access memory) in the calculation PE, adding the results corresponding to the calculation result through an adder, and setting an adder bit expansion signal, an adder bit interception signal and the like at the moment as M;

step 5-4, repeating step 5-3, wherein an intermediate result area exists in the operation of not the last time;

and 5-5, entering the last operation, taking out 32 x 32 intermediate results from the intermediate result area, setting the activation function mode as a required activation function, setting M for the adder bit expansion signal and the adder bit interception signal when adding the intermediate results, obtaining complete final 32 outputs by each calculated PE, and storing the complete final 32 outputs into 32 result banks respectively.