CN114968602A

CN114968602A - Architecture, method and apparatus for a dynamically resource-allocated neural network chip

Info

Publication number: CN114968602A
Application number: CN202210914053.9A
Authority: CN
Inventors: 贺新
Original assignee: Chengdu Picture Film And Television Technology Co ltd
Current assignee: Chengdu Picture Film And Television Technology Co ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-08-30
Anticipated expiration: 2042-08-01
Also published as: CN114968602B

Abstract

The embodiment of the application discloses a framework, a method and equipment of a resource dynamic allocation type neural network chip, and relates to the technical field of artificial intelligence. The framework of the dynamic resource allocation type neural network chip comprises: the framework of the resource dynamic distribution type neural network chip is provided with the on-chip resource optimization module, so that time arrays under different resource configuration modes can be obtained according to neural network configuration parameters, and the time arrays can reflect the operating efficiency under different resource configuration modes, so that the target resource configuration mode can be determined to be a better resource configuration mode according to the time arrays, and the data processing according to the target resource configuration mode can naturally improve the data processing efficiency of the chip.

Description

Architecture, method and apparatus for a dynamically resource-allocated neural network chip

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a framework, a method, and an apparatus for a dynamic resource allocation type neural network chip.

Background

The Neural Network, which is called entirely as Artificial Neural Network in english, abbreviated as ANN, is widely used in the field of machine learning, especially deep learning. No matter the integrated circuit chip is of SOC, GPU and ASIC structure, the design frame of the chip and the cost, function and method of the specific realization are firstly determined at the initial stage of the design, the kinds and the quantity of various control, operation and storage units are determined in the specific design process, and the chip is delivered to a chip factory to be produced and manufactured according to the design process after the design is finished and finally delivered to the user for use. After the chip design is completed, available computing resources in the chip are all fixed, and how to improve the data processing efficiency of the neural network under the limited resources of the chip becomes a technical problem which needs to be solved urgently.

It is noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the application and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The embodiment of the application provides a framework, a method and equipment of a dynamic resource allocation type neural network chip, and the data processing efficiency of a neural network is effectively improved.

In one aspect, an embodiment of the present application provides a framework of a dynamic resource allocation type neural network chip, including:

the data input cache module is used for caching first neural network data of the on-chip data bus and sending the first neural network data to the neural network computing execution module;

the data output caching module is used for caching the second neural network data processed by the neural network calculation execution module and sending the second neural network data to the on-chip data bus;

the system comprises an on-chip resource optimization module, a time calculation module and a time calculation module, wherein the on-chip resource optimization module is used for obtaining a plurality of first time arrays in different resource configuration modes according to different parameter values of a neural network configuration parameter in different resource configuration modes; obtaining a plurality of second time arrays based on a plurality of first time arrays; determining a target resource configuration mode based on a plurality of second time arrays; the first time array comprises first time for reading the sliced data from an external storage module and second time required for calculating the sliced data, and the second time array comprises third time for taking up the on-chip data bus by reading all data and fourth time consumed for calculating all data of the current network layer;

and the on-chip resource allocation module is used for configuring the data input cache module, the neural network computation execution module and the data output cache module according to the target resource configuration mode.

Optionally, the neural network configuration parameters include fixed parameters and configurable parameters; the on-chip resource optimization module is further configured to obtain a plurality of first time arrays according to the fixed parameters and the plurality of groups of configurable parameter values in different resource configuration modes.

Optionally, the first neural network data and the second neural network data are both image data.

Optionally, the fixed parameters include:

input image data width, input image data height;

inputting the number of data channels and outputting the number of data channels;

a filter size;

the number of on-chip parallel computing units and the number of computing unit groups;

calculating the cache size of the result on a chip;

the external storage module reads and writes the length of batch data;

on-chip data exchange is time consuming.

Optionally, the configurable parameters include: actual input image width, actual input image height, and number of image data buffer sets.

Optionally, the on-chip resource allocation module is respectively in communication connection with the data input cache module, the neural network computation execution module, the data output cache module, and the off-chip storage interface module through an on-chip configuration bus, and the off-chip storage interface module is in communication connection with the external storage module.

In another aspect, an embodiment of the present application provides a data processing method based on a dynamic resource allocation type neural network chip, including:

obtaining a plurality of first time arrays in different resource configuration modes according to different parameter values of the neural network configuration parameters in different resource configuration modes; the first time array comprises a first time for reading the sliced data from the external storage module and a second time required for calculating the sliced data;

obtaining a plurality of second time arrays based on a plurality of first time arrays;

determining a target resource configuration mode based on a plurality of second time arrays; the second time array comprises third time for occupying the on-chip data bus by reading all data and fourth time consumed for calculating all data of a current network layer;

and configuring the data input cache module, the neural network computing execution module and the data output cache module according to the target resource configuration mode.

Optionally, the neural network configuration parameters include fixed parameters and configurable parameters; the step of obtaining a plurality of first time arrays in different resource configuration modes according to different parameter values of the neural network configuration parameters in different resource configuration modes comprises the following steps:

and obtaining a plurality of first time arrays according to the fixed parameters and a plurality of groups of configurable parameter values under different resource configuration modes.

Optionally, the sliced data is image data.

In still another aspect, an embodiment of the present application provides a computer device, which includes a neural network chip, where the neural network chip includes the foregoing framework.

The embodiment of the application provides a framework, a method and equipment of a dynamic resource allocation type neural network chip, wherein the framework of the dynamic resource allocation type neural network chip comprises the following steps: the data input cache module is used for caching first neural network data of the on-chip data bus and sending the first neural network data to the neural network computing execution module; the data output caching module is used for caching the second neural network data processed by the neural network calculation execution module and sending the second neural network data to the on-chip data bus; the system comprises an on-chip resource optimization module, a time calculation module and a time calculation module, wherein the on-chip resource optimization module is used for obtaining a plurality of first time arrays in different resource configuration modes according to different parameter values of a neural network configuration parameter in different resource configuration modes; obtaining a plurality of second time arrays based on a plurality of first time arrays; determining a target resource configuration mode based on a plurality of second time arrays; the first time array comprises first time for reading the sliced data from an external storage module and second time required for calculating the sliced data, and the second time array comprises third time for taking up the on-chip data bus by reading all data and fourth time consumed for calculating all data of the current network layer; and the on-chip resource allocation module is used for configuring the data input cache module, the neural network computation execution module and the data output cache module according to the target resource configuration mode. That is, the framework of the dynamic resource allocation type neural network chip is provided with the on-chip resource optimization module, so that time arrays under different resource allocation modes can be obtained according to the neural network configuration parameters, and the time arrays can reflect the operation efficiency under different resource allocation modes, so that the target resource allocation mode can be determined to be the better resource allocation mode according to the time arrays, and the data processing efficiency of the chip can be improved by performing data processing according to the target resource allocation mode.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a framework of a dynamic resource allocation neural network chip according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a data processing method based on a dynamic resource allocation type neural network chip according to an embodiment of the present disclosure.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The main solution of the embodiment of the application is as follows: provided are a framework, a method and a device of a dynamic resource allocation type neural network chip, wherein the framework of the dynamic resource allocation type neural network chip comprises the following steps: the data input cache module is used for caching first neural network data of the on-chip data bus and sending the first neural network data to the neural network computing execution module; the data output caching module is used for caching the second neural network data processed by the neural network calculation execution module and sending the second neural network data to the on-chip data bus; the system comprises an on-chip resource optimization module, a time calculation module and a time calculation module, wherein the on-chip resource optimization module is used for obtaining a plurality of first time arrays in different resource configuration modes according to different parameter values of a neural network configuration parameter in different resource configuration modes; obtaining a plurality of second time arrays based on a plurality of first time arrays; determining a target resource configuration mode based on a plurality of second time arrays; the first time array comprises first time for reading the sliced data from an external storage module and second time required for calculating the sliced data, and the second time array comprises third time for taking up the on-chip data bus by reading all data and fourth time consumed for calculating all data of the current network layer; and the on-chip resource allocation module is used for configuring the data input cache module, the neural network computing execution module and the data output cache module according to the target resource configuration mode.

Through analysis, how to improve the computational efficiency of the neural network has become a critical issue in the design of the neural network chip. The idea of improving the calculation efficiency is to improve or optimize the efficiency of a system carrying the neural network, a neural network framework, the network calculation precision and the number of calculation units. The specific solution is how to improve the access speed of an external memory, the efficiency of data exchange between a chip and the external memory, the efficiency optimization of a data bus inside the chip, the compression of calculation data, the realization of the chip operation speed by improving the chip manufacturing process, and the calculation efficiency of the neural network by changing the network level of the neural network, reducing the number of channels in the network level and reducing the calculation data quantization precision. In addition, no matter the integrated circuit chip with SOC, GPU and ASIC structure, the design frame and the cost, function and method of the chip are firstly determined at the initial stage of design, the kinds and number of various control, operation and storage units are determined in the specific design process, and the chip is delivered to a chip factory to be manufactured according to the design process after the chip is designed, and finally delivered to the user for use. After the chip design is finished, available resources in the chip are fixed, and the on-chip resources can be configured only through the control part at the later stage to meet the requirements of realizing specific application functions and efficiency.

Therefore, the application provides a new method for dynamically allocating on-chip resources on the design of a neural network chip, which is used for improving the calculation efficiency of the neural network chip, and belongs to a method for improving the overall calculation performance by improving the calculation resource allocation of the existing neural network chip. Specifically, the framework of the resource dynamic allocation type neural network chip is provided with the on-chip resource optimization module, so that time arrays under different resource allocation modes can be obtained according to input parameter values (neural network configuration parameters), and the time arrays can reflect the operation efficiency under different resource allocation modes, so that the target resource allocation mode can be determined to be a better resource allocation mode according to the time arrays, and the data processing efficiency of the chip can be naturally improved by performing data processing according to the target resource allocation mode.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a computer device in a hardware operating environment according to an embodiment of the present application.

As shown in fig. 1, the computer apparatus may include: a processor 1001 such as a Central Processing Unit (CPU), a neural network chip GPU, etc., a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in FIG. 1 does not constitute a limitation of a computer device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and an electronic program.

In the computer device shown in fig. 1, the network interface 1004 is mainly used for data communication with the server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the computer device of the present application may be provided in a computer device, and the computer device calls the electronic program stored in the memory 1005 through the processor 1001 and executes the data processing method based on the dynamic resource allocation type neural network chip provided by the embodiment of the present application.

Referring to fig. 2, an embodiment of the present application provides a framework of a dynamic resource allocation type neural network chip, including:

It should be noted that, in the present embodiment, the present embodiment may be used in the image processing field, the framework of the present embodiment is used to process image data, and of course, the framework of the present embodiment may also be used to process other data, which is not described herein again.

As shown in fig. 2, the on-chip resource optimization module is connected to the on-chip resource allocation module, and the on-chip resource allocation module is connected to the on-chip data bus through the on-chip configuration bus. The data input cache module, the neural network calculation execution module, the data output cache module and the external storage module are all connected with the on-chip data bus. And the on-chip resource allocation module is respectively in communication connection with the data input cache module, the neural network computation execution module, the data output cache module and the off-chip storage interface module through an on-chip configuration bus, and the off-chip storage interface module is in communication connection with the external storage module.

The following description will be made for each module.

And the data input cache module is used for caching first neural network data of the on-chip data bus and sending the first neural network data to the neural network computation execution module.

In particular implementations, the neural network may be any neural network, such as a convolutional neural network. A neural network generally includes a plurality of layers, and an on-chip data bus refers to a bus of a single-layer neural network. In the neural network chip, for each neural network layer, an external storage module is required to read data and input the data to a neural network computation execution module for computation.

And the data output cache module is used for caching the second neural network data processed by the neural network calculation execution module and sending the second neural network data to the on-chip data bus.

In the implementation, the calculated data is returned to the on-chip data bus accordingly.

The system comprises an on-chip resource optimization module, a first time array and a second time array, wherein the on-chip resource optimization module is used for obtaining a plurality of first time arrays in different resource configuration modes according to different parameter values of a neural network configuration parameter in different resource configuration modes; obtaining a plurality of second time arrays based on a plurality of first time arrays; determining a target resource configuration mode based on a plurality of second time arrays; the first time array comprises first time for reading the sliced data from an external storage module and second time required for calculating the sliced data, and the second time array comprises third time for taking the on-chip data bus by reading all data and fourth time consumed for calculating all data of the current network layer.

In the specific implementation process, a certain time is consumed in the whole process of acquiring data, calculating and outputting from the outside, but different configuration parameters affect the time consumed in the whole calculation process, that is, the overall calculation efficiency of the chip is affected. As an optional embodiment, the neural network configuration parameters include fixed parameters and configurable parameters; the on-chip resource optimization module is further configured to obtain a plurality of first time arrays according to the fixed parameters and the plurality of groups of configurable parameter values in different resource configuration modes.

For the condition that the fixed parameters are determined, the on-chip resource optimization module can perform simulation calculation according to different resource configuration modes to obtain configurable parameters corresponding to better calculation efficiency, so that each execution module is configured, and the calculation efficiency can be improved for the chip with the determined total resources in a dynamic parameter configuration mode.

As an alternative embodiment, the first neural network data and the second neural network data are both image data. The configurable parameters include: an actual input image width AiMaxMbx, an actual input image height AiMaxMby, and an image data buffer group number MbiRamNm. The configurable parameter values corresponding to the configurable parameters can be obtained by calculation according to the neural network configuration parameters, and can also be set according to different resource configuration modes in order to improve the calculation efficiency. These resource allocation modes may be obtained according to experimental tests, including:

input image data width, input image data height;

a filter size;

calculating the cache size of the result on a chip;

the external storage module reads and writes the length of batch data;

on-chip data exchange is time consuming.

In a specific implementation process, the different resource configuration modes refer to resource configuration modes formed by different values of each configurable parameter and values of fixed parameters in the neural network configuration parameters. Specifically, for each resource configuration mode, a first time array (i.e., the time for the input data slice (MB) to read the slice data from the external storage module (MBCLKCYC _ DP) and the time for the neural network to calculate the slice data (MBCLKCYC _ CA)) may be calculated from the fixed parameter values and the different configurable parameter values, and a second time array (i.e., the total time for all data to be read (tottddr _ CLKCYC) and the total time consumed by the current network layer (TOT _ CLKTIMNS)) may be derived. The on-chip resource optimization module selects the optimal dynamic configuration parameters AiMaxMbx, AiMaxMby, and MbiRamNm (i.e., the target resource configuration mode in this embodiment) according to totdrjclkcyc and TOT _ CLKTIMNS.

In the specific implementation process, the optimal dynamic configuration parameters AiMaxMbx, AiMaxMby, MbiRamNm are obtained according to the neural network configuration parameters, so that the optimal (even optimal) dynamic configuration parameters also correspond to values of a group of neural network configuration parameters (including fixed parameters and configurable parameters), wherein the values include the width of input image data and the height of the input image data; inputting the number of data channels and outputting the number of data channels; a filter size; the number of on-chip parallel computing units, the number of computing unit groups and the like relate to the data input cache module, the neural network computing execution module and the data output cache module. Therefore, after the target resource configuration mode is obtained, the data input cache module, the neural network computation execution module and the data output cache module can be configured, so that a chip can run in a better mode, and the computation efficiency is improved.

It should be understood that the above is only an example, and the technical solution of the present application is not limited in any way, and those skilled in the art can set the solution based on the needs in practical application, and the solution is not limited herein.

As can be easily found from the above description, the framework of the resource dynamic allocation type neural network chip of this embodiment is provided with the on-chip resource optimization module, so that time arrays under different resource allocation modes can be obtained according to the neural network configuration parameters, and since the time arrays can reflect the operating efficiency under different resource allocation modes, it is determined that the target resource allocation mode can be a better resource allocation mode according to the time arrays, and data processing according to the target resource allocation mode can naturally improve the data processing efficiency of the chip.

Referring to fig. 3, based on the same inventive concept, an embodiment of the present application further provides a data processing method based on a dynamic resource allocation type neural network chip, where the method is implemented based on the dynamic resource allocation type neural network chip architecture in the foregoing embodiment, and the method includes:

s20, obtaining a plurality of first time arrays in different resource allocation modes according to different parameter values of the neural network allocation parameters in different resource allocation modes; the first time array comprises a first time for reading the sliced data from the external storage module and a second time required for calculating the sliced data;

s40, obtaining a plurality of second time arrays based on the plurality of first time arrays;

s60, determining a target resource allocation mode based on the plurality of second time arrays; the second time array comprises third time for occupying the on-chip data bus by reading all data and fourth time consumed for calculating all data of a current network layer;

and S80, configuring the data input cache module, the neural network computation execution module and the data output cache module according to the target resource configuration mode.

As an optional embodiment, the neural network configuration parameters include fixed parameters and configurable parameters; the step of obtaining a plurality of first time arrays in different resource allocation modes according to different parameter values of the neural network allocation parameters in different resource allocation modes comprises:

It should be noted that the execution main body of the method of this embodiment is the on-chip resource optimization module and the on-chip resource allocation module in the foregoing embodiment, and the execution method of the method is also consistent with the functions of the on-chip resource optimization module and the on-chip resource allocation module in the foregoing embodiment, therefore, various implementation manners and corresponding effects of the method of this embodiment may be parameterized with the foregoing embodiment, and are not described herein again.

However, in order to more clearly understand the technical solution and the achieved technical effects thereof, the following takes image data as an example, and comparison and display of the calculation efficiency under different values of configurable parameters are performed, so as to clarify the effect of improving the calculation efficiency of the framework and the data processing method of the present embodiment.

In this example, a comparison of TOTDDR _ CLKCYC and TOT _ CLKTIMNS for a single layer neural network is shown in tables 1, 2, 3 and 4 below.

TABLE 1 table of parameters configuration and TOTDDR _ CLKCYC condition of single-layer neural network processing image data without optimization

Parameter name	Parameter taking Value of	Parameter name	Parameter(s) Value taking
				imgW (input image data width)	384	ImgH (input image data height)	288
imgScale (image scaling factor)	2	Stribe (sampling interval during convolution)	2
				ICha (number of input channels of convolution layer)	32	Ocha (convolutional layer output channel number)	64
CuNum (number of parallel computing units on chip)	32
				FClaPClkNum (the number of counts per clock cycle)	1
DDRMode (DDR read mode)	2	Group（Number of computing units	2
				FILTERSize (convolution filter size)	9
PimgW (calculate actual processing image width according to ImgScale)	192	PimgH (calculate actual processed image according to ImgScale) High)	144
				PDatsbW (internal data bus bit width)	32	SCDbW (precision for processing big one data)	8
FParsProcbW (parameter data bit wide)	8	DinBus (number of single data per clock cycle)	4
				Burst TransLength (external storage Module batch data read-write) Length)	32	BurstInValidcTCyc (batch data in-line none) Effective data transmission period)	2
DataSigRamLen8b (on-chip computation result cache large)The size of the product is small, and the product is small, bytes）	1024	datatotramlen8b (total RAM occupied by computing units) Total number)	32768
				DataRamLength (Single pass output channel capable of RAM size)	512
expectMbxLen (expectation calculation input image width)	48	expectMbyHig (expected computing input image height)	20
				ExpCurMbxLen (actual calculation input image width)	48
Burst FixLen (selected fixed when reading data from DDR) Length = burstlength DinBus)	128	Burst UnValidTCyc (none when reading data from DDR) Effective period)	2
				AiMbZEdge (number of image boundary data)	2
AiMaxMbx (actual input image width)	48
				AiMaxMby (actual input image height)	20
MbiRamNm (number of image data buffer groups)	1
				MbLenDW (DWORD mode for processing read data at a time Total period)	352
FWbCUnLen (parameter attribute description word length)	1	Fx3CUnLen (calculating required parameter data length)	288
				SUNFParsLen (byte) (reading parameters from external DDR requires) Total period of (2)	298
DMbCUnLen (DW) (all data lengths required for read computation)	22528
				PMbCUnLen (DW) (all parameter lengths required for read computation)	4768
Emuif bus efficiency (DDR read efficiency)	0.75
				MBCLKCYC _ DP (in clock cycles required to read data or parameters) Maximum of the two)	30037
MBCLKCYC _ CA (clock period required for calculating block data)	30720
				CNN IMG CALCYC (clock cycles required to calculate the current image size Phase)	442368

				TOT _ CLKTIMNS (based on system clock and CNN IMG) CALCYC converted to computation time), ms	1.47
TOTDDR _ CLKCYC (parameters and numbers required for DDR read computation) According to the total clock period)	1,048, 166

TABLE 2 Single-layer neural network processing image data optimized parameter configuration and TOTDDR _ CLKCYC situation table

Parameter name	Value of parameter	Parameter name	Parameter(s) Value taking
				imgW (input image data width)	384	ImgH (input image data height)	288
imgScale (image scaling factor)	2	Stribe (sampling interval during convolution)	2
				ICha (number of input channels of convolution layer)	32	Ocha (convolutional layer output channel number)	64
CuNum (number of parallel computing units on chip)	32
				FClaPClkNum (the number of counts per clock cycle)	1
DDRMode (DDR read mode)	2	Group (number of computing units)	2
				FILTERSize (convolution filter size)	9
PimgW (calculate actual processed image Width according to ImgScale)	192	PimgH (calculate actual processed image according to ImgScale) High)	144
				PDatsbW (internal data bus bit wide)	32	SCDbW (precision for processing big one data)	8
FParsProcbW (parameter data bit wide)	8	DinBus (number of single data per clock cycle)	4
				Burst TransLength (external storage Module batch data read-write) Length)	32	BurstInValidcTCyc (introduced in batch data) Invalid data transfer period)	2
DataSigRamLen8b (on-chip computation result cache size, bytes）	1024	datatotramlen8b (total occupancy of computing units) RAM Total quantity)	32768
				DataRamLength (Single pass output channel capable of RAM size)	512
expectMbxLen (expectation calculation input image width)	48	expectMbyHig (expected computing input image height)	20
				ExpCurMbxLen (actual calculation input image width)	48
Burst FixLen (selected fixed when reading data from DDR) Length = burstlength DinBus)	128	burstUnValidTCyc (when reading data from DDR) Period of inactivity)	2
				AiMbZEdge (number of image boundary data)	2
AiMaxMbx (actual input image width)	48
				AiMaxMby (actual input image height)	10
MbiRamNm (number of image data buffer groups)	2
				MbLenDW (required for processing read data once in DWORD mode) Total period)	192
FWbCUnLen (parameter attribute description word length)	1	Fx3CUnLen (calculating required parameter data length)	288
				SUNFParsLen (byte) (reading parameters from external DDR requires) Total period of (2)	298
DMbCUnLen (DW) (all data lengths required for read computation)	6144
				PMbCUnLen (DW) (all parameter lengths required for read computation)	4768
Emuif bus efficiency (in DDR read time)Efficiency)	0.75
				MBCLKCYC _ DP (in clock cycles required to read data or parameters) Maximum of the two)	8192
MBCLKCYC _ CA (clock period required for calculating block data)	15360
				CNN IMG CALCYC (clock cycles required to calculate the current image size Phase)	442368

				TOT _ CLKTIMNS (based on system clock and CNN IMG) CALCYC foldCalculated as calculation time), ms	1.47
TOTDDR _ CLKCYC (parameters and numbers required for DDR read computation) According to the total clock period)	838,042

TABLE 3 parameter configuration and TOT _ CLKIMNS case table for processing image data by single-layer neural network without optimization

Parameter name	Parameter dereferencing	Parameter name	Parameter(s) Value taking
				imgW (input image data width)	384	ImgH (input image data height)	288
ImgScale (image scaling factor)	32	Stribe (sampling interval during convolution)	1
				ICha (number of input channels of convolution layer)	512	Ocha (convolutional layer output channel number)	256
CuNum (number of parallel computing units on chip)	32
				FClaPClkNum (the number of counts per clock cycle)	1
DDRMode (DDR read mode)	1	Group (number of computing units)	2
				FILTERSize (convolution filter size)	1
PimgW (calculating actual processing image width according to ImgScale)	12	PimgH (calculate actual processing map from ImgScale) High image)	9
				PDatsbW (internal data bus bit wide)	32	SCDbW (precision for processing big one data)	8
FParsProcbW (parameter data bit wide)	8	DinBus (number of single data per clock cycle)	4
				Burst TransLength (external storage Module batch data read-write) Length)	32	BurstInValidcTCyc (batch data import) Invalid data transmission period of (2)	2
DataSigRamLen8b (on-chip computation result cache size, bytes）	1024	datatotramlen8b (total occupancy of computing units) RAM Total quantity)	32768
				DataRamLength (Single pass output channel capable of RAM size)	1024
expectMbxLen (expectation calculation input image width)	48	expectMbyHig (expected computing input image height)	20
				ExpCurMbxLen (actual calculation input image width)	12
Burst FixLen (fixed length selected when reading data from DDR) Degree = burstTransLength DinBus)	128	BurstUnValidTCyc (when reading data from DDR) Period of inactivity of)	2
				AiMbZEdge (number of image boundary data)	2
AiMaxMbx (actual input image width)	12
				AiMaxMby (actual input image height)	9
MbiRamNm (number of image data buffer groups)	1
				MbLenDW (required for processing read data once in DWORD mode) Total period)	77
FWbCUnLen (parameter attribute description word length)	1	Fx3CUnLen (calculating required parameter data length)	32
				SUNFParsLen (byte) (reading parameters from external DDR requires) Total period of (2)	38
DMbCUnLen (DW) (all data lengths required for read computation)	315392
				PMbCUnLen (DW) (all parameter lengths required for read computation)	38912
Emuif bus efficiency (DDR read efficiency)	0.75
				MBCLKCYC _ DP (two clock cycles required to read data or parameters) Maximum value of person)	420523
MBCLKCYC _ CA (clock period required for calculating block data)	110592
				CNN IMG CALCYC (clock cycles required to calculate the current image size Phase)	210261.5

				TOT _ CLKIMNS (based on system clock and CNN IMG CALCYC) Converted to computation time), ms	0.7
TOTDDR _ CLKCYC (DDR read computation required parameters and data) Total period of clock (c)	472,405

TABLE 4 Single-layer neural network processing image data optimized parameter configuration and TOT _ CLKIMNS situation table

Parameter name	Value of parameter	Parameter name	Parameter(s) Value taking
				imgW (input image data width)	384	ImgH (input image data height)	288
imgScale (image scaling factor)	32	Stribe (sampling interval during convolution)	1
				ICha (number of input channels of convolution layer)	512	Ocha (convolutional layer output channel number)	256
CuNum (number of parallel computing units on chip)	32
				FClaPClkNum (the number of counts per clock cycle)	1
DDRMode (DDR read mode)	1	Group (number of computing units)	2
				FILTERSize (convolution filter size)	1
PimgW (calculate actual processing image width according to ImgScale)	12	PimgH (calculate actual processing map from ImgScale) Like high)	9
				PDatsbW (internal data bus bit wide)	32	SCDbW (precision of processing big one data)	8
FParsProcbW (parameter data bit wide)	8	DinBus (number of single data per clock cycle)	4
				Burst TransLength (external storage Module batch data read-write Length) Degree)	32	BurstInValidcTCyc (batch data import) Invalid data transmission period of (2)	2
DataSigRamLen8b (on-chip computation results cache size, bytes)	1024	Datatotramlen8b (total occupancy of computing units) RAM Total quantity)	32768
				DataRamLength (Single pass output channel capable of RAM size)	1024
expectMbxLen (expectation calculation input image width)	48	expectMbyHig (expected computing input image height)	20
				ExpCurMbxLen (actual calculation input image width)	12
Burst FixLen (fixed length selected when reading data from DDR) Degree = burstTransLength DinBus)	128	burstUnValidTCyc (when reading data from DDR) Period of inactivity of)	2
				AiMbZEdge (number of image boundary data)	2
AiMaxMbx (actual input image width)	12
				AiMaxMby (actual input image height)	9
MbiRamNm (number of image data buffer groups)	9
				MbLenDW (total required for processing read data once in DWORD mode) Period)	77
FWbCUnLen (parameter attribute description word length)	1	Fx3CUnLen (calculating required parameter data length)	32
				SUNFParsLen (byte) (required to read parameters from external DDR) Total period)	38
DMbCUnLen (DW) (all data lengths required for read computation)	35043.55 556
				PMbCUnLen (DW) (all parameter lengths required for read computation)	38912
Emuif bus efficiency (DDR read efficiency)	0.75
				MBCLKCYC _ DP (two clock cycles required to read data or parameters) Maximum value of person)	51883
MBCLKCYC _ CA (clock period required for calculating block data)	110592
				CNN IMG CALCYC (clock cycles required to calculate the current image size Phase)	55296

				TOT _ CLKTIMNS (from system clock and CNN IMG CALCYC) Converted to computation time), ms	0.18
TOTDDR _ CLKCYC (DDR read computation required parameters and data) Total period of clock (c)	98,607

In the above table, a convolutional neural network is taken as an example, wherein DDR is the external storage module in this embodiment.

The configurable parameter AiMaxMbx may be calculated according to PimgW and ExpectMbxLen, and specifically, a minimum value of PimgW and ExpectMbxLen is taken. Where PimgW is the input image data width. In this embodiment, in order to improve the calculation efficiency, values may be taken according to different resource allocation manners.

The configurable parameter AiMaxMby may be calculated as follows: MIN (PimgH, ExpectMbyHig, FLOOR (DataRamLength/AiMaxMbx,1)) is taken when DDRMode = 2; when DDRMode ≠ 2, MIN (PimgH, ExpectMbyHig) is taken, where PimgH denotes the actual processed image height calculated from ImgScale, and FLOOR is a rounded down function. In this embodiment, in order to improve the calculation efficiency, values may be taken according to different resource allocation manners.

The configurable parameter MbiRamNm may be calculated as follows: when DDRMode =0, take 1; when DDRMode ≠ 0, FLOOR (DataSigRamLen8 b/(AiMaxMbx AiMaxMby), 1)) is taken, and in the present embodiment, values may be taken according to different resource allocation schemes in order to improve the calculation efficiency.

As shown in table 1 and table 2, when the current network layer is not optimized in table 1, 1,048,166 clock cycle buses (DAC 1) need to be occupied by the current network layer, and table 2 shows that 838,042 clock cycle buses (DAC 2) need to be occupied by the current network layer after the AiMaxMby is optimized according to the method of this embodiment, so that the utilization efficiency of the current layer buses (on-chip internal bus and external storage bandwidth) is improved by about 25% ((DAC 1/DAC 2-1) × 100%); calculating the time needed by the current network layer without optimization;

as shown in table 3 and table 4, when the current network layer is not optimized, it takes 0.7 unit of computation time (DBC1) to calculate, and table 4 shows that the current network layer needs 0.18 unit of time (DBC 2) after the MbiRamNm is optimized according to the method of the present embodiment, and the computation speed of the current layer is increased by about 288% (DBC 1/DBC 2-1) × 100%). In addition, the current network layer in table 3 needs to occupy 472,405 clock cycle buses (DAC 1), i.e., time, and the optimized network layer in table 4 needs to occupy 98,607 clock cycle buses (DAC 2), so that the utilization efficiency of the current layer buses (on-chip internal bus and external storage bandwidth) is improved by about 379% ((DAC 1/DAC 2-1) × 100%).

Therefore, taking a 23-layer neural network as an example, when the framework of the embodiment is used for processing image data, the comprehensive calculation efficiency can be improved by more than 25%, and the bus utilization efficiency can be improved by more than 20%.

Therefore, by using the framework and the data processing method of the embodiment, under the condition that the on-chip computing storage resources with the same scale are used, the computing efficiency of the neural network can be improved, the occupation of an on-chip data bus (an off-chip memory) can be reduced, or the resources used by the chip can be reduced at the initial design stage under the condition that the efficiency meets the requirement, so that the chip area is reduced, the chip cost and the operation power consumption are reduced, and the method is simple and clear in implementation mode and has remarkable popularization value in the design of the artificial intelligent neural network chip.

In addition, in an embodiment, the present application provides a neural network chip, including the framework of the above-mentioned dynamic resource allocation type neural network chip.

Furthermore, in an embodiment, the present application further provides a computer storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of the method in the foregoing embodiments.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories. The computer may be a variety of computing devices including intelligent terminals and servers.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application or portions thereof contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., a rom/ram, a magnetic disk, an optical disk), and includes several instructions for enabling a multimedia terminal device (which may be a mobile phone, a computer, a television receiver, or a network device) to execute the method according to the embodiments of the present application.

While the invention has been described with reference to a particular embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A framework of a dynamic resource allocation type neural network chip, comprising:

2. The framework of claim 1, wherein the neural network configuration parameters include fixed parameters and configurable parameters; the on-chip resource optimization module is further configured to obtain a plurality of first time arrays according to the fixed parameters and the plurality of groups of configurable parameter values in different resource configuration modes.

3. The framework of claim 2, wherein the first neural network data and the second neural network data are both image data.

4. The framework of claim 3, wherein the fixed parameters include:

input image data width, input image data height;

a filter size;

calculating the cache size of the result on a chip;

the external storage module reads and writes the length of batch data;

on-chip data exchange loses time.

5. The framework of claim 3 or 4, wherein the configurable parameters comprise: actual input image width, actual input image height, and number of image data buffer sets.

6. The architecture of claim 1, wherein the on-chip resource allocation module is communicatively coupled to the data input cache module, the neural network computation execution module, the data output cache module, and an off-chip storage interface module, respectively, via an on-chip configuration bus, the off-chip storage interface module being communicatively coupled to the external storage module.

7. A data processing method based on a resource dynamic allocation type neural network chip is characterized by comprising the following steps:

according to different parameter values of the neural network configuration parameters in different resource configuration modes, obtaining a plurality of first time arrays in different resource configuration modes; the first time array comprises first time for reading the fragment data from an external storage module and second time required for calculating the fragment data;

determining a target resource configuration mode based on a plurality of second time arrays; the second time array comprises third time for taking up the on-chip data bus by reading all data and fourth time consumed for calculating all data of the current network layer;

and configuring the data input cache module, the neural network computation execution module and the data output cache module according to the target resource configuration mode.

8. The method of claim 7, wherein the neural network configuration parameters include fixed parameters and configurable parameters; the step of obtaining a plurality of first time arrays in different resource allocation modes according to different parameter values of the neural network allocation parameters in different resource allocation modes comprises:

9. The method according to claim 7, wherein the sliced data is image data.

10. A computer device, characterized in that it comprises a neural network chip comprising the framework of any of claims 1-6.