CN215182115U

CN215182115U - NVDLA artificial intelligence chip hardware system based on FPGA

Info

Publication number: CN215182115U
Application number: CN202121614214.XU
Authority: CN
Inventors: 刘之禹; 石晴文; 冯佳玮; 李述; 张经纬
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-12-14
Anticipated expiration: 2031-07-14

Abstract

The utility model provides a NVDLA artificial intelligence chip hardware system based on FPGA for solve prior art's open source accelerator and only to ASIC, use comparatively loaded down with trivial details defect, the system includes: the system comprises a storage module, a main control module, an acceleration module and a bus, wherein the storage module, the main control module, the acceleration module, a power supply module, a crystal oscillator module and the bus can send and receive data through the bus; the main control module and the acceleration module are both arranged on the FPGA chip; the power supply module is connected with the FPGA and the storage module; the crystal oscillator module is externally connected on the FPGA chip. The utility model discloses deploy the accelerator and realize on FPGA, both embodied FPGA reconfigurable characteristics, also embodied FPGA can be parallelly operated the characteristics fast. The utility model is suitable for an application of convolution neural network at the removal end.

Description

NVDLA artificial intelligence chip hardware system based on FPGA

Technical Field

The utility model relates to an artificial intelligence chip field, concretely relates to NVDLA artificial intelligence chip hardware system based on FPGA.

Background

Nvdla (NVIDIA Deep Learning accelerator) is an open source Deep Learning accelerator published by NVIDIA in 2017, and is used for solving the problem of data processing by convolutional neural networks.

The hardware accelerator is a hardware accelerator with a configurable parameter function, can configure parameters of each layer in a convolutional neural network, has a flexible structure, and is widely researched and commercially used due to the fact that the hardware accelerator is open.

However, the open source accelerator released by the imperial great official is only for ASIC, and requires complicated steps such as tape-out and the like for practical use. Therefore, a new hardware system architecture is needed.

SUMMERY OF THE UTILITY MODEL

The utility model discloses an it is only to ASIC, uses comparatively loaded down with trivial details defect to solve prior art's open source accelerator.

According to the first aspect of the utility model, the NVDLA artificial intelligence chip hardware system based on the FPGA is provided, which comprises a storage module, a main control module, an acceleration module, a power supply module, a crystal oscillator module and a bus, wherein the storage module, the main control module and the acceleration module can send and receive data through the bus; the main control module and the acceleration module are both arranged on the FPGA chip; the power supply module is connected with the FPGA and the storage module; the crystal oscillator module is externally connected on the FPGA chip.

Preferably, the model of the main control module is a Nios II embedded processor.

Preferably, the storage module is an SDRAM.

Preferably, the SDRAM is model number W9864G6 KH-6.

Preferably, the model of the acceleration module is NVDLA.

Preferably, the bus is an Avalon bus.

Preferably, the model of the FPGA chip is EP4CE115F29C7 of the CycloneiV series.

Preferably, the power supply module is a tps7a7001 power supply chip.

Preferably, the crystal oscillator module is a DSB535SG-50M crystal oscillator chip.

Preferably, the system also comprises a flash module which is connected with the FPGA chip and has the model of W25Q 64.

The beneficial effects of the utility model are that, with partial chip deployment on FPGA, provide a new hardware circuit system, need not during the use through loaded down with trivial details steps such as class piece.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic block circuit diagram of an embodiment of the invention;

FIG. 2 is a connection block diagram of a small NVDLA system;

fig. 3 is an overall circuit diagram of an embodiment of the present invention;

fig. 4 is a block diagram of a circuit structure according to an embodiment of the present invention;

FIG. 5 is a pin diagram of an SDRAM chip according to an embodiment of the present invention;

fig. 6 is a circuit diagram of an FPGA configuration according to an embodiment of the present invention;

fig. 7 is a chip pin diagram of bank1 region of the FPGA according to an embodiment of the present invention;

fig. 8 is a circuit diagram of a linear power module according to an embodiment of the present invention;

fig. 9 is a peripheral circuit structure diagram of an external flash chip of the FPGA according to an embodiment of the present invention;

fig. 10 is a peripheral circuit structure diagram of a crystal oscillator chip according to an embodiment of the present invention;

fig. 11 is a circuit diagram of a voltage distribution area of a PLL module according to an embodiment of the present invention;

fig. 12 is a circuit diagram of an FPGA downloader interface according to an embodiment of the present invention;

fig. 13 is a BDMA interface diagram of an embodiment of the invention;

fig. 14 is a circuit diagram of a MAC unit according to an embodiment of the present invention;

fig. 15 is a circuit diagram of an activation function module according to an embodiment of the present invention;

fig. 16 is a circuit diagram of a pooling module according to one embodiment of the present invention;

fig. 17 is a circuit diagram of a fully connected module according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: unless specifically stated otherwise, the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present invention.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The circuit schematic diagram of an embodiment of the present invention is shown in fig. 1, and includes a storage module, a main control module, an acceleration module, a power module, a crystal oscillator module and a bus, wherein the storage module, the main control module and the acceleration module can send and receive data through the bus; the main control module and the acceleration module are both arranged on the FPGA chip; the power supply module is connected with the FPGA and the storage module; the crystal oscillator module is externally connected on the FPGA chip.

In order to more convenient utilization NVDLA accelerator of ability, in this embodiment, the utility model discloses deploy this accelerator to FPGA on and realize, both embodied FPGA reconfigurable characteristics, also embodied FPGA can be parallelly connected the characteristics of calculating fast.

The utility model discloses a software and hardware collaborative design realizes this convolution neural network accelerator, has built and has made up a series of circuit structure to the hardware verification platform has been built and has been verified.

It should be noted that the present invention is only directed to providing a hardware circuit system, and the circuit connection itself has significant progress, and does not depend on software, and the description of the image and the calculation in the specification is only for illustrating the usage scenario, and the technical progress of the present invention is not directed to software but the hardware connection.

The weight parameters in the network are trained with tensorflow and then the data is put on the slave memory.

A LetNet-5 network is employed and a handwritten digital picture data set is placed on the accelerator to verify that the accelerator was successfully deployed.

The NVDLA model divide into large-scale and small-size, the application in the aspect of the small more suitable artificial intelligence chip to the energy consumption is less, consequently the utility model discloses deploy the small-size NVDLA model to FPGA on.

The FPGA acceleration system is composed of a main control part, an acceleration part and a storage part.

The main control core is composed of an NIOS II CPU, an on-chip memory, an external memory and the like, the acceleration part is composed of convolution operation modules, and the storage part is composed of a storage management module and a memory.

The general flow of acceleration is:

the NVDLA and the NIOS II processor work together, the interior of the NVDLA and the NIOS II processor corresponds to each module operated by the convolutional neural network,

parameter data for convolving the respective layer structures is initially input to the host processor,

the host processor may send the required parameters to the various configurable modules, which are then configured.

Each reconstructed module has a dual-port buffer which is specially used for storing parameters, so that each module can be related to each other according to requirements.

The configured modules can send out an interrupt after completing respective operations, transmit a signal of task completion to the processor,

then, the following operations are executed according to the instructions of the processor, and the flow is repeated until the convolution operations are all completed. A schematic diagram of a small NVDLA system is shown in fig. 2.

Hardware modules in the NVDLA hardware architecture are divided into 5 groups, namely a convolution operation module, an activation module, a pooling module, a full connection layer module and a local response normalization module, wherein each module corresponds to different specific operations.

The system is mainly composed of an FPGA chip connected with an off-chip SDRAM, an NIOS II CPU integrated in the FPGA chip,

nios II has the functional characteristics of an embedded CPU, so that the expansion of program functions can be carried out according to requirements, and the flexibility is high.

A product operation module, an activation module, a pooling module, a full connection layer module and a local response standardization module are designed in the FPGA through a hardware description language verilog.

And then enabling NIOS II to schedule data transmission of each module in the FPGA and external SDRAM through C language.

Fig. 3 is an overall circuit diagram of the system. The chip type is as follows:

an Fpga module: the model of fpga chip is EP4CE115F29C7 of CycleIV series, and it adopts the DE1-115 development board of Altera, and has 50M crystal oscillator, EPCS (nand flash) chip and its peripheral circuit.

A power supply module: 3.3V, 2.5V and 1.2V are mainly used as the bank voltage and the kernel voltage (1.2V) of the FPGA.

Sdram module: the high-speed bus interface mainly comprises a peripheral filter capacitor and an SDRAM chip, wherein the model of the SDRAM chip is W9864G6 KH-6.

SDRAM module brief introduction:

the circuit mainly comprises 13 address lines, 2 bank address lines, 16 data lines and 6 control lines, a clock line and a clock enable line.

The input and output of Sdram are mainly realized by 4 control lines, namely SDRAM _ CS, SDRAM _ CAS, SDRAM _ RAS and SDRAM _ WE. SDRAM _ DQ in fig. 3 is a bidirectional port that can be input or output, and is read when SDRAM _ WR is 1 and write when SDRAM _ WR is 0.

FIG. 4 is a block diagram of the top level overall circuit architecture showing the various structural units, including the NIOS II processor, SDRAM, FPGA on-chip logic, Avalon bus interconnect, and controller.

The controller is hung on the Avalon bus, receives the configuration information of the convolution layer and controls the calculation unit and the on-chip cache to work orderly, the input and output on-chip caches are respectively hung between the Avalon bus and the memory for data transmission, the calculation unit is connected with the on-chip cache, and the corresponding output result is obtained according to the input cache image and the weight data and is transmitted to the output cache.

FIG. 5 is a chip pin diagram of an SDRAM of an embodiment.

Fig. 6 is a configuration circuit of FPGA, where the upper four ports TDI, TDO, TCK, TMS are download interfaces, the middle MSEL is a download mode selected, nCE is a chip select signal, default is low, the lowest DCLK, CONF _ DONE, nCONF1G, and nSTATUS are PS configuration interfaces, a pull-up resistor is connected, and led is an indication signal after the above configuration is completed.

Fig. 7 shows bank1 regions of an FPGA of 8 types, where the interface standard of each bank is determined by its interface voltage VCCO. Each interface of Bank1 is sufficient to provide a connection to an external SDRAM, and the interface connections are mostly concentrated in the Bank1 region.

Fig. 8 shows the power module for supplying power to the FPGA, which is an Ido linear power supply, and generates a voltage of 3.3v2.5v1.2v by using a tps7a7001 power chip.

Fig. 9 is a flash chip externally connected to the FPGA, and is configured to store a program loaded on the FPGA, so that the program is not lost when the FPGA is powered on next time, and thus the accelerator can be directly operated when the FPGA is powered on each time. The flash model is W25Q64, the capacity is 64MB, and the standard SPI protocol is supported.

Fig. 10 shows an external crystal oscillator chip of the FPGA, the model of the chip is DSB535SG-50M, the chip is very stable and suitable for communication, and is beneficial to the operation of an accelerator, and the common frequency is 50MHz and 10 MHz.

Fig. 11 shows the interface and voltage connection of the PLL region of the FPGA, where the PLL can generate multiple clock signals from one clock input signal, and a stable PLL configuration is necessary because the accelerator includes some design parts crossing clock domains and data is transmitted in different clock domains by respective operations.

Fig. 12 is a downloader interface module of the FPGA, and 5X2 shows that there are two rows of pins, each row has 5 pins, and they are externally connected to a serial port circuit to implement serial communication between the FPGA and the outside, and load the program configuration on the FPGA.

The NVDLA has four interfaces which are respectively CSB: for driving NVDLA by communicating with processor

IRQ: an interrupt signal for signaling when NVDLA operation is completed or an internal error occurs

DDBIF: a trunk bus for NVDLA to access each memory by DMA mode

SRAMIF: and the optional RAM interface is used for connecting an SRAM outside the system and is mainly used for a cache system of the on-chip cache.

A controller, namely BDMA, for connecting internal and external data communication is designed on the FPGA again, and comprises three interfaces of CSB, MCIF and SRAMIF, wherein the CSB is responsible for reading and writing an external memory by a CPU configuration register, and the MCIF and the SRAMIF.

BDMA: the input images and processing results are stored in external DRAM, but the external DRAM bandwidth and latency are typically insufficient for NVDLA to fully utilize its MAC array. Thus, NVDLA is configured with an internal memory interface of on-chip SRAM. To utilize on-chip SRAM, NVDLA needs to move data between external DRAM and internal SRAM. BDMA is used for this purpose. There are two directions of data path transfer, one from external DRAM to internal SRAM and the other from internal SRAM to external DRAM for data copying. Neither direction can work simultaneously. The BDMA may also move data from external DRAM to external DRAM, or from internal SRAM to internal SRAM. The BDMA interface module is shown in fig. 13.

The core logic of the NVDLA is a convolution assembly line, namely a convolution operation module, the core of the convolution operation is MAC operation, the convolution operation is used for accelerating the convolution process, the parameter of each layer of variable convolution is supported, and the MAC efficiency is improved to improve the overall performance. The MAC module reads data and weight values to be subjected to convolution operation from the SDRAM and then performs corresponding operation on the MAC module.

The MAC circuit diagram is shown in fig. 14, and takes a convolution operation of 3 × 3 as an example. In the figure, the two-dimensional convolution calculation is decomposed into the accumulated sum of a plurality of shift convolution calculations, that is, when convolution kernels with the size of 3 × 3 and input pixels in a convolution window are convolved, the one-dimensional convolution of each row of pixels in an input image is calculated, and then the calculation results of 3 rows are accumulated to obtain an output result of the two-dimensional convolution. Therefore, when the computation of the two-dimensional convolution is accelerated, 3 rows of input pixels in the convolution window need to be acquired at the same time and stored in the RAM. Meanwhile, since the moving step of the convolution window is 1, different convolution windows are overlapped, so that a large amount of data is reused when the two-dimensional convolution calculation is carried out on the input image. Data reuse may reduce the number of accesses in memory of overlapping data within the next convolution window and increase the speed of two-dimensional convolution calculations. Therefore, the line buffer structure formed by the RAM on the left side of the figure is used for realizing the alignment of the lines and the columns of the image, and the pipeline structure after the convolution conversion is matched to form an efficient parallel pipeline convolution calculation basic circuit. After the convolution is finished, the pipeline structure registers the result in the result register after one data is calculated so as to provide a partial accumulated result of the previous clock period for the next stage of calculation. The circuit structure can fully develop data reuse, and the data reuse exists in a pipeline of one stage and one stage.

The activation function module adopts a sigmoid function, and the main performance of the sigmoid function is to output 0 when a very small negative number is entered and output 1 when a very large positive number is entered. The method comprises the steps of using FPGA to realize Sigmoid activation function, using piecewise linear function approximation method to divide nonlinear activation function into a plurality of sections, and then approximating curves on each section to be equivalent by a line segment. Therefore, the accuracy achieved by the method is related to the number of the divided intervals, and the greater the number of the intervals, the higher the accuracy achieved. When the method is implemented on the FPGA, only the end points of each interval, the slope and the intercept of each interval need to be stored in the RAM of the FPGA, only a small amount of lookup table resources are consumed, and meanwhile, only one multiplication and addition operation is needed, and the number of consumed multipliers and adders is small. The realization method not only occupies less logic resources, but also has higher operation speed.

The piecewise linear function needs to consider three aspects in the hardware implementation process. In the first aspect, the interval where the input independent variable is located is determined, the main operation is data comparison, and a comparator is needed when hardware is implemented correspondingly. The second aspect is the computation of the mapping of the piecewise linear function from the input to the output, which for each interval contains multiplication and addition operations, which are implemented in hardware requiring multipliers and adders. The third aspect is to ensure the working clock beats of each hardware module and the data transmission coordination, and the corresponding hardware structure is the data register module.

An RTL view for activating the whole module is shown in fig. 15, where data x is first input into the comparison module, and compared with a value stored in the comparison module RAM, a piecewise line segment function where the data x is located is found, and then values a and b of the function y ═ ax + b are output and sent to the multiplication module and the data registration module, respectively, because the data needs to be delayed by a certain time delay through the multiplier, the value b needs to be delayed by the same time delay and sent to the adder together with the multiplication result, and finally, a sigmoid function value is obtained.

The pooling layer module adopts a maximum pooling method, the maximum pooling method can output the pixel points with the most characteristic significance, the pixel characteristics are maximized, the resource occupation is less than that of an average pooling method, and the logic is simpler. The maximum pooling method is to compare the data in the pooling kernel one by one and find out the maximum value as the output result. For a pooled core of size nx N, typically N x N-1 comparators are required to achieve this maximum pooling. As shown, a maximum pooling calculation unit with a pooling kernel size of 3x 3. Firstly, buffering the data in the buffer area into a shift register, then simultaneously sending the data into a comparator for comparison, comparing the comparison results two by two, and finally obtaining the result. FIG. 16 is a circuit diagram of a pooling module.

Full connection layer: the full connection layer is different from the convolution multiplication and accumulation operation mode in connection structure, the full connection layer multiplication and accumulation operation does not process the characteristic diagram, the weight value and the input characteristic value are stored in the buffer area, the input of the MAC unit is connected to the output of the buffer area, and the multiplication and accumulation control module is responsible for the read-write transmission process of each module. The circuit diagram of the fully connected module is shown in fig. 17.

Hardware design verification:

after the hardware structure design is completed, next, it is verified whether the design is correct. The structure of the convolutional neural network is composed of an input layer, a hidden layer and an output layer. The input layer can process multidimensional image data, input images need to be subjected to normalization processing before learning data are transmitted to the network, input feature standardization is completed, and the learning efficiency and the expression capacity of the network are improved better. The hidden layer is constructed by a plurality of convolution layers, a pooling layer and a full-connection layer, and can effectively extract the characteristic information in the image. The network on the upper layer of the output layer is generally a full connection layer, so the structure and the principle of the network are the same as those of the output layer of the feedforward neural network, and the output result is directly a classification result, a label and the like.

Firstly, shape parameters of the Lenet-5 network are input to the FPGA, and the FPGA starts to map corresponding layer architecture logic on the FPGA according to the input parameters.

The network is tested by adopting an mnist data set, each picture is a matrix of 28 × 28, the pixel value is 0-255, the matrix is firstly normalized in tensoflow and then converted into 16 binary fixed point numbers, and the FPGA can only identify the fixed point numbers. And then storing the pixel value and the weight value trained by the tensoflow into corresponding memories of the FPGA and the peripheral equipment thereof. The waveform output results were observed by modelsim, compared to the input data. The modelsim waveform is shown in the following figure. xin is input pixel data input to the FPGA from an off-chip SDRAM, y is the final output classification result, numbers 0-9 are represented in an unique hot code mode, a sure output signal represents the result of comparing the output y value with the corresponding number of each input pixel matrix, if the output y is equal to the label of the input data, sure outputs a high level 1, and if the output y is not equal, then outputs a low level 0, and as shown in the figure, partial oscillograms meet the requirements.

Therefore, the utility model discloses a convolution neural network accelerator based on open source has been realized, hardware has been consulted and has been accelerated NVDLA's low-power consumption, configurable modular design characteristics, the convolution has been designed, each module such as pooling realizes on deploying FPGA with it, regard as experiment and verification platform with DE2-115 development board, adopt software and hardware collaborative mode to design through NIOS II soft core, the data communication process of FPGA with outside SDRAM has been transferred through NIOS II, and the communication of each interface. And finally, the simulation circuit is tested normally. The convolutional neural network accelerator can realize the applications such as handwritten number recognition, picture recognition, face recognition and the like by deploying different neural networks, and is extremely suitable for the application of the convolutional neural network at a mobile terminal.

Although certain specific embodiments of the present invention have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. An NVDLA artificial intelligence chip hardware system based on an FPGA is characterized by comprising a storage module, a main control module, an acceleration module, a power supply module, a crystal oscillator module and a bus, wherein the storage module, the main control module and the acceleration module can send and receive data through the bus; the main control module and the acceleration module are both arranged on the FPGA chip; the power supply module is connected with the FPGA and the storage module; the crystal oscillator module is externally connected on the FPGA chip.

2. The FPGA-based NVDLA artificial intelligence chip hardware system of claim 1, wherein the model of the main control module is a NiosII embedded processor.

3. The FPGA-based NVDLA artificial intelligence chip hardware system of claim 1, wherein the storage module is SDRAM.

4. The FPGA-based NVDLA artificial intelligence chip hardware system of claim 3, wherein the SDRAM model is W9864G6 KH-6.

5. The FPGA-based NVDLA artificial intelligence chip hardware system of claim 1, wherein the acceleration module name is NVDLA.

6. The FPGA-based NVDLA artificial intelligence chip hardware system of claim 1, wherein the bus is an Avalon bus.

7. The NVDLA artificial intelligence chip hardware system based on FPGA of claim 1, wherein the model of the FPGA chip is EP4CE115F29C7 of Cyclone IV series.

8. The FPGA-based NVDLA artificial intelligence chip hardware system of claim 1, wherein the power module is a tps7a7001 power chip.

9. The FPGA-based NVDLA artificial intelligence chip hardware system of claim 1, wherein the crystal oscillator module is a DSB535SG-50M crystal oscillator chip.

10. The NVDLA artificial intelligence chip hardware system based on FPGA of claim 1, further comprising a flash module connected with the FPGA chip and having a model number of W25Q 64.