CN113139519A

CN113139519A - Target detection system based on fully programmable system on chip

Info

Publication number: CN113139519A
Application number: CN202110529675.5A
Authority: CN
Inventors: 王明伟; 时凯胜; 陈凤兰; 黄叶祺; 闫瑞; 王钊; 王诗鹏; 罗宇; 迟青松; 田甜
Original assignee: Shaanxi University of Science and Technology
Current assignee: Shaanxi University of Science and Technology
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-07-20
Anticipated expiration: 2041-05-14
Also published as: CN113139519B

Abstract

The invention discloses a target detection system based on a fully programmable system on chip, which comprises a PL terminal and a PS terminal, wherein the PS terminal is used for acquiring a video stream or a picture and preprocessing the acquired video stream or picture, and the PL terminal adopts a deep neural network technology to carry out target detection according to the video stream or the picture. The PS end is realized based on ARM, and the PL end is realized based on FPGA, after the two technologies are combined, the size of the product is reduced, the power consumption of the product is reduced, the performance is strong, and the requirement of a mobile platform on target detection can be met.

Description

Target detection system based on fully programmable system on chip

Technical Field

The invention relates to the technical field of image data processing, in particular to a target detection system based on a fully programmable system on a chip.

Background

An artificial neural network, also called a neural network, is a core technology of artificial intelligence, and is also an adaptive system, which is a topological network structure designed by imitating the operation process of a biological neural network. The artificial neural network is formed by connecting a plurality of artificial neurons, each neuron is activated by an activation function, and when information outside the network is changed and the neuron is activated, signal flow can circulate by a new path, so that the self-adaption is completed.

However, in the image or video-based object detection technology, the storage resources and the operation resources consumed by the artificial neural network are very large, so that the processing of a video stream or a large number of pictures is often assisted by a GPU (graphics processor) or an APU (accelerated processor) server, and the GPU and the APU server are large in size and high in power consumption, so that the neural network-based object detection technology is not suitable for being deployed under a platform of a mobile terminal.

Disclosure of Invention

The embodiment of the invention provides a target detection system based on a fully programmable system on a chip, which is used for solving the problems that in the prior art, an artificial neural network has high resource consumption and is not suitable for deploying a target detection technology on a mobile terminal.

In one aspect, an embodiment of the present invention provides a target detection system based on a fully programmable system on a chip, including: the PL terminal and the PS terminal are respectively realized based on an FPGA and an ARM;

the PS terminal is used for acquiring video streams or pictures and sending the video streams or pictures to the PL terminal;

the PL terminal includes: the system comprises a communication module, a data transfer module and a deep neural network module;

the communication module is used for sending the video stream or the picture acquired by the PS terminal to the data transfer module;

the data transfer module is used for storing the received video stream or picture in the storage unit and sending the video stream or picture stored in the storage unit to the deep neural network module;

the deep neural network module is used for carrying out target identification according to the video stream or the picture.

In one possible implementation, the deep neural network module may include: a neuron module; the neuron module is used for reading the network parameters stored in the storage unit and training the deep neural network in the deep neural network module.

In one possible implementation, the communication module may include: an AXI Stream slave station, an AXI Lite slave station, and an AXI Stream master station; the AXI Stream slave station is used for receiving a transmission command from the user logic and controlling the transfer operation of the data transfer module according to the transmission command; a lookup table is embedded in the AXI Lite slave station, a neuron module reads data in the lookup table, and a coefficient of a deep neural network is processed on a neuron; the AXI Stream master station is used for transmitting the data output by the deep neural network module to the data transfer module.

In one possible implementation manner, the PL end may further include: a BRAM module; the BRAM module is used for accelerating the speed of data passing through the deep neural network module by adopting a double-channel input and output port technology.

In one possible implementation, the deep neural network module may include: a data path module and a control path module; the data path module includes: the input routing network is used for routing data input into the deep neural network module to a proper functional module, the functional module is used for solving arithmetic, logic and relational operation, and the result routing network is used for routing and storing the data output by the functional unit into the storage unit; the control path module is used for managing the execution sequence of each part in the data path module.

In one possible implementation manner, the PL end may further include: a control module; the control module is used for controlling the time sequence of each module in the deep neural network module.

In a possible implementation manner, the PS terminal obtains a video stream or a picture from the camera, performs preprocessing on the video stream or the picture, and sends a result after the preprocessing to the PL terminal.

In one possible implementation, the preprocessing of the video stream or picture by the PS side may include: and after converting the video stream into a picture, converting the picture into a gray image together with the picture acquired from the camera, compressing the gray image and sending the compressed gray image to a PL (provider line) terminal.

In one possible implementation, the memory unit may be a DDR memory.

In a possible implementation manner, the platform can further comprise a display, and the PL end can further comprise a display module; and after the data transfer module sends the target recognition result output by the deep neural network module to the display module, the display module controls the display to display the target recognition result.

The target detection system based on the fully programmable system on chip has the following advantages:

have ARM CPU and FPGA's characteristic concurrently to contained these two advantages, the collaborative design of especially adapted software and hardware, small powerful can, the consumption is lower moreover, especially adapted deploys on mobile platform such as car, unmanned aerial vehicle, robot, medical equipment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a functional block diagram of a target detection system based on a fully programmable system on a chip according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of the PL terminal;

FIG. 3 is a schematic diagram of a deep neural network module;

FIG. 4 is a schematic diagram of a communication module;

FIG. 5 is a block diagram of an FPGA.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, when the target detection technology is implemented by combining an artificial neural network, because the artificial neural network consumes relatively large hardware resources, a GPU or APU server needs to be used when target detection is performed by using a video stream or a large number of pictures, and the two servers have strong performance, large volume and high power consumption, so that the current target detection technology based on the artificial neural network cannot be applied to a complex environment, such as a mobile platform.

In order to solve the problems in the prior art, the invention provides a target detection system based on a fully programmable system on chip, which comprises a PL terminal and a PS terminal, wherein the PS terminal is used for acquiring a video stream or a picture and preprocessing the acquired video stream or picture, and the PL terminal adopts a deep neural network technology to perform target detection according to the video stream or the picture. The PS end is realized based on ARM, and the PL end is realized based on FPGA, after the two technologies are combined, the size of the product is reduced, the power consumption of the product is reduced, the performance is strong, and the requirement of a mobile platform on target detection can be met.

Fig. 1 is a functional module schematic diagram of a target detection system based on a fully programmable system on a chip according to an embodiment of the present invention, fig. 2 is a functional module diagram of a PL side, fig. 3 is a schematic diagram of a deep neural network module, fig. 4 is a schematic diagram of a communication module, and fig. 5 is a module schematic diagram of an FPGA. In an embodiment of the present invention, a target detection system based on a fully programmable system on a chip includes: the PL terminal and the PS terminal are respectively realized based on an FPGA and an ARM;

Illustratively, the PL side is Programmable Logic, abbreviated as Programmable Logic side, and the PS side is Processing System, abbreviated as Processing System side. Besides the functions, the PS terminal also has the functions of calling the deep neural network module and enabling and resetting the whole system through the instruction address.

In the embodiment of the present invention, Xilinx SDK (Software Development Kit) is used as Development Software for interaction between the PS side and the PL side. After the PS terminal based on ARM implementation acquires the video stream or the picture, the PL terminal based on FPGA implementation can be called to enter a working mode.

The data transfer module includes a MM2S (Memory Mapped to Memory Mapped) module and a S2MM (Memory Mapped to Memory Mapped) module, the MM2S module is used for transferring data from the storage unit to the AXI Stream domain, and the S2MM module is used for transferring data from the AXI Stream domain to the AXI domain, and further has a reset block and an error signal. The MM2S module and the S2MM module operate independently in full duplex mode, the size of the address file allocated by the data transfer module is limited to 4KB, partition scheduling can be automatically performed, and the function of using the bandwidth of all AXI4 streams and operating a plurality of transmission requests is realized. The data transfer module provides byte-level data transfers and allows read memory transfers to the location of the specified address. Each MM2S module and S2MM module has a separate command interface, and the received commands are added from one end in one clock cycle, and simultaneously, the width of a command word is optimized during design, and the compatibility of high-speed data transmission of each part is maintained. Specifically, if the system uses a 32-bit AXI address, the command word is 72 bits wide. However, if the system address space is greater than 32 bits, the width of the command word will be extended to the required byte width. For example, a 64-bit address system requires a command word that is 104 bits wide to accommodate the wider initial field. The command interface is an AXI4-Stream interface, so the system address space should be an integer multiple of 8. If the address space is configured with 33 bits, the partial address in the command should be padded with 40 bits, which is done to maintain the compatibility of high-speed data transmission, wherein the data stream formats of the MM2S module or the S2MM module are the same. The command format allows a single-bit data transfer from 1 byte to 8,388,607 bytes to be specified. The communication module automatically breaks down the large amount of data that needs to be transmitted into sizes that comply with the requirements of the AXI4 protocol.

In one possible embodiment, the deep neural network module may include: a neuron module; the neuron module is used for reading the network parameters stored in the storage unit and training the deep neural network in the deep neural network module.

Illustratively, the neuron module includes a multiplier, an accumulator, and a finite state machine, wherein the finite state machine is used as an activator. The neural module trains the deep neural network by using the network parameters to obtain the network weight and the network deviation, and the obtained network weight and the network deviation are converted into binary values by the neural module and stored in the storage unit.

In an embodiment of the present invention, the deep neural network in the deep neural network module is a multilayer perceptron, the multilayer perceptron is designed under a fixed-point digital system, and the digital type in the fixed-point digital system includes a positive number and a negative number, wherein the negative number indicates that the initial system is two complements. The multilayer perceptron is connected with a current neuron and a previous neuron, the input data is multiplied by a multiplier, the result is stored in an accumulator and then transmitted to the next neuron, and the iteration and the accumulation are carried out in sequence. In other embodiments, the deep neural network may also be a convolutional neural network, a yolo (young Look Only one) network, an ssd (single Shot multi box detector) network, and the like. The activation function used in the deep neural network is a sigmoid function.

The multilayer perceptron is composed of an input layer, a hidden layer, a full connection layer and an output layer, and in order to meet the actual resource and performance requirements of development of the board veneer, the number of nerve cells of each layer needs to be reasonably selected.

In one possible embodiment, the communication module may include: an AXI Stream slave station, an AXI Lite slave station, and an AXI Stream master station; the AXI Stream slave station is used for receiving a transmission command from the user logic and controlling the transfer operation of the data transfer module according to the transmission command; a lookup table is embedded in the AXI Lite slave station, a neuron module reads data in the lookup table, and a coefficient of a deep neural network is processed on a neuron; the AXI Stream master station is used for transmitting the data output by the deep neural network module to the data transfer module.

Illustratively, the AXI4 Bus protocol is the most important part of the amba (advanced Microcontroller Bus architecture)3.0 protocol proposed by ARM corporation, and is an on-chip Bus oriented to high performance, high bandwidth, and low latency. The bus commonly used comprises AXI4-Lite and AXI4-Stream, the AXI4-Lite is a lightweight address mapping word transmission interface, the occupied logic units are few, the AXI4-Stream is oriented to high-speed Stream data transmission, and the unlimited data burst transmission size is allowed because the address items are removed.

In the embodiment of the present invention, the AXI Lite slave station is a self-made module under Vivado software, in which a new mapping interface is set, and a lookup table is also embedded therein. In order to prevent the partial sentence from being omitted comprehensively, an (, don't _ touch ═ future') sentence is also added.

When the communication module is designed, the AXI4-Stream protocol and the AXI Memory mapping IP core are combined together, and the data can be sent to and from the storage unit by using the DMA (Direct Memory Access) technology and the IP core of the AXI protocol. In various portions of the communication module, the AXI Stream master and slave stations are memory mapped, and various IP cores interconnected with the Xilinx AXI include the AXI Stream master and slave stations, which may be used to exchange data between one or more AXI master-slave machines.

In one possible embodiment, the PL side further comprises: a BRAM module; the BRAM module is used for accelerating the speed of data passing through the deep neural network module by adopting a double-channel input and output port technology.

Illustratively, BRAM, i.e., Block Memory, is a PL-side RAM Memory of ZYNQ. The BRAM module is called by the PS end through an instruction address. Due to the adoption of the dual-channel input and output port technology, the speed of data passing through an IP core in the deep neural network module is two times faster than that of data passing through a single-channel BRAM.

The deep neural network needs to be initialized before being used, the initialization operation is controlled by a PS (packet switching) end, and the PS end loads the network deviation and the network weight into a header file of a C language by utilizing Matlab and Python scripts, so that the coefficients of the deep neural network can be conveniently initialized and called, and can be conveniently sent to a BRAM (block-independent cache management) module for loading.

In one possible embodiment, the deep neural network module includes: a data path module and a control path module; the data path module includes: the input routing network is used for routing data input into the deep neural network module to a proper functional module, the functional module is used for solving arithmetic, logic and relational operation, and the result routing network is used for routing and storing the data output by the functional unit into the storage unit; the control path module is used for managing the execution sequence of each part in the data path module.

Illustratively, the deep neural network module is designed by adopting a method of comprehensively designing a control unit and a control path, a data unit and a data path, so that the implementation system allows larger control behaviors, wherein each state of a finite state machine determines the state of one data path, and a condition is calculated in the data path to determine the next state of the state machine.

The interface included in the data path module is provided with an input data interface for inputting data to be processed; the output data interface provides an interface of a data processing result; a control interface for the control path to be taken out and used as a control signal for the data path; a state output interface that outputs its current state to the control path.

Since most of the computation is in the process of data conversion, and this process uses multiple processing steps, the control path module will perform these steps in multiple clocks within the hardware system implementing the process. The control path module comprises interfaces including a state input interface and a state output interface, wherein the state input interface inputs data from the control interface; the state output interface is used for informing the current state of the system to the off-chip environment.

In one possible embodiment, the PL side further comprises: a control module; the control module is used for controlling the time sequence of each module in the deep neural network module.

Illustratively, the control module includes a state machine implemented with 12 states, 3 independent processes, 2 sequential statements, and 1 combined statement. Among the 12 states of the state machine, part of the states are used as idle (integrated Development and Learning environment) states, some are used for synchronous states, and others are used as states for executing processing data.

In a possible embodiment, the PS terminal obtains a video stream or picture from the camera, performs preprocessing on the video stream or picture, and sends a result of the preprocessing to the PL terminal.

Illustratively, the preprocessing of the video stream or picture by the PS side includes: and after converting the video stream into a picture, converting the picture into a gray image together with the picture acquired from the camera, compressing the gray image and sending the compressed gray image to a PL (provider line) terminal.

Specifically, in the preprocessing process, the PS terminal converts the RGB format picture into a grayscale image of 0 to 255 grayscales, and scales the grayscale image at a scale value of 4.

In an embodiment of the present invention, Xilinx SDK tool is used to write pre-defined values such as input values, hidden layer neuron number, number of hidden layers, output values, network bias size and network weight size, and pointers to previous layers, current layers, network bias, network weights, hidden layer storage and various initial values of the pre-processed image.

In one possible embodiment, the memory unit is a DDR memory.

Illustratively, the DDR memory is Double Data Rate SDRAM, i.e., Double Rate synchronous dynamic random access memory.

In a possible embodiment, the platform further comprises a display, and the PL terminal further comprises a display module; and after the data transfer module sends the target recognition result output by the deep neural network module to the display module, the display module controls the display to display the target recognition result.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A system for target detection based on a fully programmable system on a chip, comprising: the PL terminal and the PS terminal are respectively realized based on an FPGA and an ARM;

the data transfer module is used for storing the received video stream or picture in a storage unit and sending the video stream or picture stored in the storage unit to the deep neural network module;

2. The system for target detection based on a fully programmable system on a chip of claim 1, wherein the deep neural network module comprises: a neuron module;

the neuron module is used for reading the network parameters stored in the storage unit and training the deep neural network in the deep neural network module.

3. The system for object detection based on system on a fully programmable chip as claimed in claim 2, wherein said communication module comprises: an AXI Stream slave station, an AXI Lite slave station, and an AXI Stream master station;

the AXI Stream slave station is used for receiving a transmission command from user logic and controlling the transfer operation of the data transfer module according to the transmission command;

a lookup table is embedded in the AXI Lite slave station, the neuron module reads data in the lookup table, and coefficients of the deep neural network are processed on neurons;

the AXI Stream master station is used for transmitting the data output by the deep neural network module to the data transfer module.

4. The system for target detection based on a fully programmable system on a chip of claim 1, wherein the PL side further comprises: a BRAM module;

the BRAM module is used for accelerating the speed of data passing through the deep neural network module by adopting a double-channel input and output port technology.

5. The system for target detection based on a fully programmable system on a chip of claim 1, wherein the deep neural network module comprises: a data path module and a control path module;

the datapath module includes: the input routing network is used for routing data input into the deep neural network module to the proper functional module, the functional module is used for solving arithmetic, logic and relational operation, and the result routing network is used for routing and storing the data output by the functional unit into the storage unit;

the control path module is used for managing the execution sequence of each part in the data path module.

6. The system for target detection based on system on a fully programmable chip as claimed in claim 5, wherein said PL side further comprises: a control module;

the control module is used for controlling the time sequence of each module in the deep neural network module.

7. The system according to claim 1, wherein the PS obtains the video stream or picture from a camera, pre-processes the video stream or picture, and sends the pre-processed result to the PL.

8. The system for object detection based on system on chip with full programming of claim 7, wherein the preprocessing of the video stream or picture by the PS end comprises: and after converting the video stream into a picture, converting the picture and the picture acquired from the camera into a gray image, compressing the gray image and sending the compressed gray image to the PL terminal.

9. The system for target detection based on system on chip with full programming of claim 1, wherein the memory unit is a DDR memory.

10. The system for target detection based on system on a fully programmable chip as claimed in claim 1, further comprising a display, wherein said PL side further comprises a display module;

and after the data transfer module sends the target recognition result output by the deep neural network module to the display module, the display module controls the display to display the target recognition result.