CN115457363B

CN115457363B - Image target detection method and system

Info

Publication number: CN115457363B
Application number: CN202210957661.8A
Authority: CN
Inventors: 骆爱文; 李媛; 路畅; 刘旭彬; 陈之奂; 郑烨; 易清明
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2023-08-04
Anticipated expiration: 2042-08-10
Also published as: CN115457363A

Abstract

The invention relates to the technical field of machine vision, and provides an image target detection method and system, wherein the method comprises the following steps: constructing a first network model, a second network model and a third network model for image target detection; the first network model, the second network model and the third network model comprise a feature extraction module, a feature fusion module and an output module; the feature extraction module in the first network model is obtained through network compression, the feature extraction module in the second network model is introduced into a bottleneck structure, and the feature extraction module and the feature fusion module in the third network model are of FPN structures; respectively generating corresponding IP cores according to the first network model, the second network model and the third network model, and then carrying the IP cores on a hardware system after the IP cores are designed; and acquiring an image to be detected, preprocessing the image, calling an IP core which is adapted to a hardware system according to the specification of the image to execute image target detection, and outputting a target detection result.

Description

Image target detection method and system

Technical Field

The invention relates to the technical field of machine vision, in particular to an image target detection method and system.

Background

Image processing is a key technology in the information age that provides electronic systems with the ability to perceive, analyze, and shape the world. Traditional image processing techniques have focused mainly on certain mathematical algorithms or feature descriptors, and since the beginning of this century, deep Neural Networks (DNNs) that were inspired by biology have gained widespread popularity. As one of the basic directions in the field of machine vision, target detection technology has been developed for decades, and a great deal of mature and complete research results have been obtained.

The current image target detection method mainly carries out deep learning through a neural network, more marked networks are generated in recent years in the field of the neural network algorithm for target detection, and the representative YOLO series algorithm shows good results on two important indexes of test speed and accuracy. Most of the current computing platforms for accelerating deep learning adopt GPUs, and the real-time processing can be realized, but the defects of high power consumption and low resource utilization rate are overcome. For large-size image target detection processing, the YOLO series algorithm needs to use convolution layers with more layers to perform feature extraction, as the depth of a network increases, middle-level features and high-level comprehensive features need to be extracted gradually, and the number of convolution layers required by feature extraction of each size reaches three or more layers, so that the current neural network algorithm for target detection is difficult to meet the requirements of low cost and low power consumption in some actual occasions (such as network edges and mobile application scenes).

Disclosure of Invention

The invention provides an image target detection method and system for overcoming the defects of high calculation cost and high power consumption of a neural network algorithm for image target detection in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

an image object detection method, comprising the steps of:

s1, constructing a first network model, a second network model and a third network model for image target detection; the first network model, the second network model and the third network model comprise a feature extraction module, a feature fusion module and an output module; the feature extraction modules in the first network model and the second network model both comprise a lightweight feature extraction backbone network; the feature extraction module in the second network model introduces a bottleneck structure on the lightweight feature extraction backbone network, and the feature extraction module and the feature fusion module in the third network model adopt an FPN structure;

s2, respectively generating corresponding digital integrated circuit IP cores according to the first network model, the second network model and the third network model, and then carrying the digital integrated circuit IP cores on a hardware system;

S3, acquiring an image to be subjected to target detection, preprocessing the image, calling a digital integrated circuit IP core adapted on the hardware system to execute image target detection operation according to the specification of the image, and outputting a target detection result.

Furthermore, the invention also provides an image target detection system which is applied to the image target detection method provided by the technical scheme. The image target detection system comprises a hardware system with a digital integrated circuit IP core, and is used for detecting an image target of an input image; the digital integrated circuit IP core comprises a calculation module which is based on any one of a first network model, a second network model and a third network model and is designed through digital circuit connection; the feature extraction modules in the first network model and the second network model both comprise a lightweight feature extraction backbone network; and the feature extraction module in the second network model introduces a bottleneck structure into the lightweight feature extraction backbone network, and the feature extraction module and the feature fusion module in the third network model adopt FPN structures.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: according to the invention, a lightweight network model for image target detection is constructed, acceleration optimization is carried out on a software algorithm part, and the acceleration optimization is transplanted to heterogeneous platform acceleration equipment with low power consumption and low cost, so that the calculation cost and the power consumption of a neural network algorithm for image target detection are reduced; the invention also realizes the calculation acceleration in a hardware design mode, and realizes the real-time target detection with low calculation cost and low power consumption on an embedded development board with limited resources.

Drawings

Fig. 1 is a flowchart of an image object detection method of the present invention.

Fig. 2 is a schematic diagram of a first network model Lite-1 according to the present invention.

FIG. 3 is a diagram of a second network model Lite-2 of the present invention.

FIG. 4 is a schematic diagram of a third network model Lite-3 according to the present invention.

Fig. 5 is a graph showing the comparison of the performance of the first network model and the second network model in example 1.

Fig. 6 is a graph of speed versus accuracy for the first network model and the second network model of example 1.

Fig. 7 is a flowchart of the overall design of the software and hardware of the object detection system in embodiment 2.

Fig. 8 is a flow chart of the software control program in embodiment 2.

Fig. 9 is a diagram of the hardware system in embodiment 2.

Fig. 10 is a diagram of an internal resource distribution architecture of a heterogeneous computing platform of arm+fpga in example 2.

Fig. 11 to 13 are graphs comparing the detection results of the pure software platform and the heterogeneous platform carrying the third network model in example 2.

Fig. 14 is a general architecture and a design flow chart of a hardware system based on HLS and portal access design in embodiment 3.

Fig. 15 is a comprehensive circuit structure of the object detection hardware system in embodiment 3.

FIG. 16 is a schematic diagram showing the connection between the PL logic top layer design and the external memory DDR in embodiment 3.

Fig. 17 is a schematic diagram of the functional module and the connection relationship inside the target detection hardware acceleration IP core in embodiment 3.

Fig. 18 is a schematic diagram of the internal structure of the object detection hardware acceleration IP core and its connection to the external IP core in embodiment 3.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

some well known structures in the drawings and possible omissions of description thereof are understood by those skilled in the art.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

The present embodiment proposes an image target detection method, as shown in fig. 1, which is a flowchart of the image target detection method of the present embodiment.

The image target detection method provided by the embodiment comprises the following steps:

s1, constructing a first network model Lite-1, a second network model Lite-2 and a third network model Lite-3 for image target detection.

S2, designing corresponding calculation modules of the first network model, the second network model and the third network model through HLS tools respectively, generating IP cores, and then carrying the IP cores on a hardware system after the IP cores are designed.

S3, acquiring an image to be subjected to target detection, preprocessing the image, calling an operation IP core adapted to the hardware system to execute image target detection according to the specification of the image, and outputting a target detection result.

In this embodiment, the feature extraction modules in the first network model and the second network model both include a lightweight feature extraction backbone network; and the feature extraction module in the second network model introduces a bottleneck structure into the lightweight feature extraction backbone network, and the feature extraction module and the feature fusion module in the third network model adopt FPN structures.

In a specific embodiment, the first network model is a YOLO v2 neural network subjected to network compression, the second network model is a YOLO v2 neural network introducing a bottleneck structure, and the third network model is a YOLO v4 neural network adopting an FPN structure. As shown in fig. 2 to 4, the first network model, the second network model, and the third network model are respectively constructed.

The feature extraction operation of the traditional YOLO v2 neural network on the picture is mainly realized by a dark-19 network, meanwhile, two layers of Route and Reorg are used, the middle-low level features extracted from the high-resolution feature map by the front end of the network are combined with the comprehensive high-level features extracted from the low-resolution feature map by the rear end of the network, the feature loss in the network iteration process is greatly reduced, and the recognition precision under the rapid recognition is ensured.

In an alternative embodiment, the lightweight feature extraction backbone network in the first network model and the second network model comprises a plurality of feature extraction convolution layers, wherein the feature extraction convolution layers of all resolution sizes are compressed into 1 layer.

In a specific embodiment, the traditional YOLO v2 is compressed, and in particular, a neural network compression method adopting a hierarchical clipping mode is adopted to improve the basic network YOLO v 2. In the first network model obtained after the hierarchical neural network is cut, the feature extraction convolution layers of all resolution sizes are simplified into one layer, so that the network parameter number and the calculation complexity are greatly reduced, and the time consumption from feature extraction, dimension reduction and dimension change to feature integration and classification positioning of the picture is greatly shortened.

In one embodiment, the network architecture parameters of the first network model are shown in Table 1 below. Wherein B is the number of the binding boxes, and C represents the number of the detectable target categories.

Table 1 network architecture parameters of the first network model

In an alternative embodiment, the second network model is obtained by introducing a Bottleneck structure (Bottleneck structure) based on the first network model instead of the standard convolution layer for feature extraction in its network.

The Bottleneck structure is an hourglass type structure and is currently implemented mainly by superimposing 1X 1 and 3X 3 convolutional layers. To achieve the same feature extraction results, the parameters required for the Bottleneck structure are much smaller than for the standard convolution structure. For example, the input feature map is 26×26×64 (length×width×channel number), and the channel number of the feature map is adjusted by 3×3 convolution (channel dimension reduction) of one layer of 32 channel number and 1×1 convolution (channel dimension increase/recovery) of 128 channels, so that the interaction of feature information of different channels can be realized.

In a specific embodiment, the architecture diagram of the second network model is shown in fig. 3, and the second network model replaces all the standard convolution layers for feature extraction in the first network model with the Bottleneck structure of three-layer convolution except that the input dimension of the first layer convolution itself is low and is only 3, and the function of a receptive field is achieved while feature extraction is achieved without change.

In the second network model of the embodiment, during each feature extraction, the dimension is reduced by a layer of 1×1 single-core convolution, the dimension is reduced to half of the dimension of the input feature map, and the calculation complexity of the subsequent convolution is reduced; performing 3×3 convolution operation on the feature map after the dimension reduction to realize feature extraction, wherein the input and output dimensions of the layer of convolution remain unchanged, and the parameters of the layer of convolution are lower than those of the standard convolution again; and in the last step, the dimension rising is realized by a layer of single-core convolution, so that the output dimension reaches the expected requirement, namely, the dimension rising process realizes the integration of the features, the extracted features can be extended to the normal depth, and the sufficient feature information is ensured. As shown in table 2 below, the network architecture parameters of the second network model of the present embodiment are shown. Wherein B is the number of the binding boxes, and C represents the number of the detectable target categories.

Table 2 network architecture parameters of the second network model

In one implementation, the software algorithm performance of the first network model (Lite-1) and the second network model (Lite-2) is verified. The results of performance verification and comparison for the two optimized networks are shown in table 3 and fig. 5 below.

Table 3 lightweight improved network performance contrast

As can be seen from the above table 3 and fig. 5, both Lite-1 and Lite-2 of the present embodiment are changed in the same direction in the network performance: the recognition accuracy is reduced, the loss function is increased, and the reasoning speed is greatly improved. The identification speed of Lite-2 after twice optimization reaches 111.5 pictures/second, which is nearly twice that before improvement; the recognition accuracy (mAP) can still reach 62.4%, and the recognition accuracy is still a higher level of recognition capability although the accuracy is somewhat reduced compared with that of YOLO v 2.

In order to more intuitively observe and compare the network performances of the Lite-1 and Lite-2 of the present embodiment, lite-1 and Lite-2 are compared with the current popular One-Stage (single-Stage) network, as shown in FIG. 6, which is a graph of target detection network speed versus accuracy. As can be seen, lite-2 of this embodiment has a speed far exceeding that of other networks, approaching 2 times that of the original YOLO v2 network, and more than 3.4 times that of the YOLO network. And is therefore more suitable for placement on hardware systems with limited resources and power consumption to achieve the desired effect.

In an alternative embodiment, the FPN structure in the third network model includes a bottom-up feature extraction path and a top-down feature fusion path; the feature extraction module is arranged on the feature extraction path to perform feature extraction and sampling on feature graphs with decreasing sizes, and the feature fusion module is arranged on the feature fusion path to perform feature fusion on the feature graphs with increasing sizes.

Further, in an alternative embodiment, the feature extraction module includes a Conv basic convolution unit and a ResBlock residual unit; the output module comprises a plurality of parallel output end structures and is used for generating feature map decoding information with corresponding sizes so as to realize target prediction; the output end structure at least comprises a characteristic map channel number adjusting layer, a normalization layer, a nonlinear conversion layer and a characteristic integration layer which are connected in sequence; and the output module obtains a final target detection result by carrying out fusion judgment on the feature map decoding information output by the plurality of output end structures. The ResBlock residual unit is a basic module for forming a residual neural network; the Conv basic convolution unit may be implemented using standard convolution or depth separable convolution, which is followed by two normal operations, respectively a BN batch normalization operation, a leak ReLU activation function, connected in sequence.

In a specific embodiment, the third network model is obtained by performing network model improvement based on a YOLOv4 network, wherein the feature extraction module comprises a Conv basic convolution unit and a Resblock residual unit, and the feature extraction module adopts a path from bottom to top of an FPN structure; the feature fusion module adopts a path from top to bottom of the FPN structure; the output module consists of a base convolution layer and a 3 x 3 convolution layer.

The network architecture of the third network model is shown in fig. 4, and the feature extraction module in the third network model is responsible for extracting rich feature information from the input image, and the feature map of the base layer is separated by copying the feature map of the base layer and then transmitting the copied feature map to the next stage aiming at the gradient vanishing problem. The third network model integrates the gradient change into the feature map from beginning to end, so that the model framework achieves conciseness and light weight on the basis of ensuring the original advantages.

The middle part is mainly used for fusing the characteristic information of the characteristic diagrams with different sizes. The Lite-3 network adopts a FPN (feature pyramid network) structure. The FPN is a feature extractor designed based on the bottom-up feature pyramid concept for improving accuracy and speed, unlike the spatial pyramid pooling network and path aggregation network used in YOLOv4, the FPN structure of this embodiment includes a bottom-up feature extraction path and a top-down feature fusion path. The path from bottom to top is a common process of extracting features using a convolutional network, in which the spatial resolution is gradually reduced, and the semantic information of each layer is increased after detecting the structure with higher dimension.

The network end output part is formed by combining the basic convolution (conv+bn+leak ReLU) with 3×3 Conv. The characteristic integration is mainly carried out by 3×3, the channel number is adjusted by 1×1 convolution, and the network output end consists of two convolution layers and an activation function layer. The multiple output end structures are respectively used for generating multiple corresponding binary files, namely, characteristic diagrams with multiple different proportion sizes, so as to carry out subsequent picture prediction, and the method is also a decoding process in the network. The two convolution operations (conv+bn+leak relu+3×3 Conv) at the end of the network, while having the least ratio in the network, can further reduce network parameters, which is one of the core contents of the lightweight network model.

In the embodiment, by constructing a lightweight network model for image target detection, acceleration optimization is performed on a software algorithm part, and the acceleration optimization is transplanted to heterogeneous platform acceleration equipment with low power consumption and low cost, so that the calculation cost and the power consumption of a neural network algorithm for image target detection are reduced.

Example 2

The present embodiment is improved on the basis of the image object detection method proposed in embodiment 1.

The tools or software used in this example are examples for clarity of illustration of the invention and are not limiting of the embodiments of the invention.

For the step S2, the IP core is generated through designing a corresponding computing module by an HLS tool, and is transplanted to a hardware system, and the method comprises the following steps:

s2.1, respectively designing FPGA hardware circuits by utilizing an HLS tool according to the first network model, the second network model and/or the third network model, and synthesizing into a digital integrated circuit IP core of HDL level after completing the hardware circuit design; importing the generated digital integrated circuit IP core into Vivado software, constructing corresponding FPGA engineering, creating a top layer file, adding computing resource constraint, performing comprehensive simulation, finally generating a bit stream file of the target detection hardware circuit, and exporting the bit stream file to SDK software.

S2.2, performing software part design in the SDK, performing system environment configuration according to the bit stream file, including creating a computing resource support package and building application engineering to perform function development and test, and generating an executable file in an elf format.

S2.3, downloading the bit stream file and the executable file to an on-board chip to obtain a hardware system for image target detection.

Fig. 7 is a flowchart of the overall design of the software and hardware of the target detection system in this embodiment.

And downloading the bit stream file and the executable file to a development board chip by utilizing the steps, wherein the obtained hardware system for detecting the image target is equipment subjected to acceleration optimization of software and hardware.

The software part program design is mainly used for generating control signals so as to control a data calculation end (FPGA end) to perform data calculation and transmission processing. The flow chart of the software control programming is shown in fig. 8, and mainly comprises the operations of initializing a network, processing image data, starting and calling a hardware IP core and displaying the result. And the FPGA end is mainly responsible for calculation of the convolution and sampling module. The embodiment carries out 16-bit fixed-point quantization, cyclic blocking and unfolding, pipeline design, input and output channel parallelism and ping-pong caching mechanism on the data based on the convolution and sampling operation of the FPGA end by using the HLS tool, and realizes the acceleration of convolution and sampling calculation by using the optimization design.

Further, in an alternative embodiment, the hardware system includes a state control terminal for data transmission and distribution, a functional operation terminal for convolution and sampling calculation, and a data bus for taking charge of data read-write and timing logic control. As shown in fig. 9, a hardware system architecture diagram of the present embodiment is shown. The state control end of the embodiment adopts ARM, the function operation end adopts FPGA, and the data bus adopts AXI bus.

The FPGA end comprises a convolution IP core and a sampling IP core which are used for accelerating network calculation, wherein the convolution IP core and the sampling IP core comprise functions of data loading, scale transformation, calculation, transmission and the like.

In one embodiment, the convolution IP core is primarily responsible for the convolution calculations of 3×3 and 1×1 in the third network model. The sampling IP core is mainly used for up-sampling and down-sampling computation in the third network model. As shown in fig. 10, an internal resource distribution architecture diagram of the heterogeneous computing platform of the arm+fpga of the present embodiment is shown. The embodiment of the invention realizes the hardware acceleration IP core of the target detection system based on the FPGA. In the embodiment, the hardware acceleration IP core includes a data input module, a data output module, an input/output buffer module, a weight buffer module, and a target detection calculation module; wherein the object detection calculation module comprises a convolution IP core and a sampling IP core. The sampling IP core further comprises a sampling calculation module for executing up-sampling and down-sampling calculation in the first network model, the second network model or the third network model, and the convolution IP core further comprises a convolution calculation module for executing the first network model, the second network model or the third network model.

Before all IP cores in the FPGA end are started to perform calculation acceleration, the ARM end performs initialization configuration on the IP cores through an AXI bus according to the structure information of a current network model, and an input data module stores input picture data and weight data into corresponding caches according to addresses respectively; the convolution calculation module and/or the sampling calculation module call data in the buffer to work, and the convolution and/or the sampling result is respectively stored in the corresponding buffer according to the address after being generated; the data output module outputs the result in the cache to the off-chip DDR memory through the AXI bus according to the address; after all data are output, the IP core outputs a stop signal through an AXI bus to pause or end the operation of the equipment.

In a specific embodiment, a third network model (Lite-3, (n=6)) is designed through an HLS tool to generate an IP core, and then the IP core is designed and then is mounted on a hardware system to construct an fpga+arm heterogeneous platform. Selecting a board card file with the model xc7z020clg400-1 on an HLS and Vivado tool to perform comprehensive simulation, and after synthesizing a Lite-3 (n=6) target detection network, obtaining an estimated resource result comprising: LUT lookup table resource consumption is 84%; the consumption of the memory resource on the BRAM chip is 57%; FF flip-flop consumption is 50%; DSP resource consumption was 71%.

The PYNQ-z2 development board has a dual-core ARM processor and FPGA architecture, and a 28nm process flow is adopted by a chip design Xilinx company, and the system clock frequency is set to be 100MHz based on the chip process. The target detection system based on Lite-3 (n=6) is built through Vivado software, and the power consumption evaluation result is obtained through simulation synthesis, wherein the dynamic power consumption and the static power consumption required by the chip are about 2.162W and 0.172W respectively, the total power consumption of the system is about 2.334W, and compared with other systems, the system has the obvious advantage of low power consumption.

Further, the detection speed and the accuracy of the target detection system after acceleration based on the existing pure software calculation mode and the FPGA+ARM heterogeneous platform used in the specific embodiment are respectively compared and analyzed.

In terms of detection speed, the FPGA+ARM-based calculation acceleration target detection model of the embodiment is compared with an existing target detection system using a pure software calculation mode. Based on CPU and adopting Dev-C++ tool to develop, building Lite-3 network by C language and detecting picture, the weight adopted is the same as the hardware system designed by the embodiment. Based on the prediction result of the pure software algorithm of the CPU, the processed corresponding output is compared with the heterogeneous platform output to verify the correctness of the output, and the average time required for processing one frame of image is about 8.865s.

The FPGA+ARM heterogeneous platform is realized by calling the IP core of the PL end at the PS end for a plurality of times, wherein the ARM main frequency is 650MHz, and the FPGA main frequency is 100MHz. And predicting the picture by adopting the Lite-3 (n=6) network after hardware optimization, and counting the average time of predicting one frame of image. Because of computational limitations, weights are directly using pre-trained weight files (dataset VOC2007+ VOC2012+ COCO). The prediction result of a single image is tested, the average detection time on the FPGA+ARM heterogeneous platform reaches 777.674ms, and the average acceleration ratio reaches 11.4 times.

In the aspect of the accuracy of picture identification, because 16-bit fixed-point quantization is adopted on the FPGA+ARM heterogeneous platform to break through in the aspect of speed, the obtained identification effect is slightly different from the identification accuracy under the traditional deep learning framework. In order to independently compare the change of the detection precision, the pure software platform of the embodiment adopts a Lite-3 (n=6) convolutional neural network built based on a Pytorch deep learning framework to perform target detection, and the comparison diagrams of detection results are shown in fig. 11-13. Fig. 11 (a), 12 (a), 13 (a) are detection results of pure software platforms, and fig. 11 (b), 12 (b), 13 (b) are detection results of heterogeneous platforms.

In fig. 11, the pure software platform detection results are: car:97%, dicyclohexyl: 94%, dog:92%, heterogeneous platform detection results are: car:96%, dicyclohexyl: 92%, dog:93%. In fig. 12, the pure software platform detection results are: car:96%, horse:76%, dog:97%, heterogeneous platform detection result is: person:100%, horse:74%, dog:97%. Fig. 13 shows the result of multi-object image object recognition, and it can be seen that, compared with the prediction of a few non-overlapping objects, in the detection of an image object with multiple objects and overlapping objects, the fixed-point quantization of the heterogeneous platform brings a small error, but the overall recognition accuracy is still higher, and the error is also within an acceptable range.

In summary, from the detection result, it can be seen that the target frame in the processing result of the heterogeneous platform acceleration target detection system of the embodiment is slightly shifted compared with the original target frame based on pure software, so that the accuracy of detecting the target is slightly reduced, but the detection effect is consistent with that before quantization as a whole.

As shown in table 4 below, the present embodiment compares the comprehensive performance parameters of the software platform and the heterogeneous platform, where the main frequency of the software platform is 1.80GHz, the running memory is 8GB, the main frequency of the heterogeneous platform is 650+100mhz, the memory is 512MB, and when the network of Lite-3 (n=6) is running, the characteristics of low power consumption of the arm+fpga platform when processing the picture can be seen by observing the power consumption and the time consumption, which is approximately one fifth of the power consumption of the CPU processor, and in terms of processing time, the time for detecting one picture is about 0.778s after accelerating by hardware, which is 11.4 times faster than that of the software platform without using the deep learning architecture.

Table 4Lite-3 (n=6) comparison of pure CPU software platform versus heterogeneous platform performance

In summary, in terms of speed and accuracy, the design of the embodiment keeps a good recognition effect, is more obvious in speed optimization, and greatly accelerates the operation time by calling the hardware module to accelerate basic operation in the convolutional sampling network and the like, and meanwhile ensures that the quantized network still has good recognition accuracy after hardware acceleration.

Example 3

For the step S2, the IP core is generated through designing a corresponding computing module by an HLS tool, and is transplanted to a hardware device, and the method comprises the following steps:

s2.1, designing calculation modules of the first network model, the second network model and the third network model by utilizing an HLS tool, verifying the calculation modules, and packaging the calculation modules into independent operation IP cores.

S2.2, building and connecting simulation is carried out on each operation IP core on the Vivado platform, and a hardware configuration file is generated.

S2.3, importing the hardware configuration file provided by the Vivado platform into the Jupyter Notebook platform through an Ethernet connection development board, and integrating and completing the hardware equipment for image target detection through the calling procedures of Python design data importing, image preprocessing, parameter setting, post-processing and feature extraction functions.

As shown in fig. 14, the overall architecture and the design flow chart of the hardware system designed based on HLS and portal access in this embodiment are shown.

In this embodiment, the calculation module of the neural network is designed through the C/c++ programming function of the Vivado HLS tool, and the operation IP core is generated. And then, carrying out wiring design on the hardware system on a Vivado 2019.1 platform, and finally, calling a functional module and resources on the hardware by utilizing a Jupyter Notebook of the PYNQ system to form the target detection system with complete functions.

According to the design of the embodiment, the browser is connected through the network port, the Jupyter Notebook tool can be opened through the browser, the network model is directly designed and adjusted, and network training parameters do not need to be burnt into the SD card.

Further, the hardware system of the embodiment includes a functional operation end FPGA for performing image object detection, and an off-chip DDR memory for storing weight and picture data. As shown in fig. 15, the integrated circuit configuration diagram of the object detection hardware system of the present embodiment is shown.

In this embodiment, the FPGA terminal includes a calculation module, a timing control module, a logic control module, and a transmission protocol module; the computing module comprises a digital integrated circuit IP core of the first network model, the second network model and/or the third network model, wherein the digital integrated circuit IP core comprises a data loading unit, a convolution unit, a pooling unit, a reforming unit, an output unit and a plurality of storage blocks. The transmission protocol module acquires image data and weight data from the off-chip memory through the Ethernet and transmits the image data and the weight data to the calculation module.

The FPGA side in this embodiment mainly realizes control and coordination of the whole board through the logic control board, and obtains data such as weights and pictures from the DDR memory through the transmission protocol part, and then performs computing operations such as feature extraction and dimension reduction in the first network model, the second network model and/or the third network model designed in this embodiment, and the timing sequence of the computing process is closely related to the timing sequence control module during reset and the like.

A schematic diagram of the connection of the PL logic top layer design to the external memory DDR is shown in FIG. 16. The calculation module belonging to the PL part in this embodiment is mainly implemented based on the HLS platform. The computing module receives the input image data and the weight data through different data loading units respectively, stores the image data in 4 storage blocks and stores the weight data in 2 storage blocks. The convolution unit extracts needed image data and weight data from the corresponding storage blocks to perform feature extraction on images with different sizes or convolution operation on feature images, and interacts with the pooling unit in the network iteration process to realize feature dimension reduction.

In the working process of the convolution unit and the pooling unit, the size and the dimension of the feature map are adjusted through the reforming unit, and finally the convolution unit transmits a data result containing the classification and positioning information of the image object to be detected to the output unit, and the output unit stores and outputs the detection result through the two storage blocks.

Further, the convolution unit operation procedure of the present embodiment includes:

1) Calculating an input feature map and a corresponding convolution kernel in a single channel to obtain a feature value, so that the convolution kernel traverses the whole image from the initial position of the feature map to obtain a new feature map;

2) Repeating the first operation on the input feature map, traversing all channels, and accumulating feature values of corresponding positions to form a complete usable single-channel output feature map;

3) Repeating the two steps for the convolution kernel, and traversing all the convolution kernels to obtain a complete multi-channel output characteristic diagram.

Since the input characteristic information and weight information are multidimensional variables, they cannot be extracted and transmitted quickly by a computer, and thus they need to be developed into one-dimensional variables in advance.

In addition, in order to fully utilize the advantage of parallel operation at the FPGA end, the embodiment cuts one-dimensional input characteristic information into four parts, and simultaneously realizes 4-path parallel multiplication and accumulation operation, thereby greatly accelerating the operation speed of the convolution unit.

Further, in this embodiment, the data generated by the convolution unit and the pooling unit are stored in an array form, the array is completely divided, and a register is allocated for each array element to store the data.

The present embodiment allows for the storage of data using a large number of arrays (arrays) in the convolution and pooling module of the HLS design, with each Array storing a large amount of data information. FPGAs typically use BRAMs to store arrays in HLS designs, where BRAMs have at most two read data ports, greatly limiting the rate of transmission of large-scale data.

An array of data is typically stored in a BRAM, maintaining data communication between at most two ports. After the array is divided, the data of the array can be stored by a plurality of BRAMs, so that the number of interfaces for external transmission of the data is increased, and the throughput is increased. In this embodiment, a complete (complete) manner is adopted for partitioning the array, and each array element is allocated to a register for storage, so that a large amount of on-board resources are consumed, and meanwhile, the operation efficiency is also improved to the greatest extent.

Further, the pooling unit in this embodiment is a maximum pooling layer for flexibly configuring parameters, where the parameters include: input data input, output data output, pooling size Kernel_size, pooling step Kernel_stride, output channel number TM, output feature height H, and output feature width W. The pooling operation does not change the channel number of the processed feature data, but only changes the size of the feature map, and the change of the size mainly depends on two parameters of Kernel_size and Kernel_stride.

Further, in the convolution unit and the pooling unit in this embodiment, the loop expansion technique and the pipeline technique are used to optimize the in-layer convolution and the pooling loop in parallel.

The input feature diagrams of the convolution unit and the pooling unit are three-dimensional variables (length, width and channel number), and the calculation result of each layer directly influences the result of the next layer, so that the calculation between layers cannot be performed simultaneously. However, there is no computational dependency between different positions and different channels of the feature map, so that the computational result is not affected, and parallelism exists between different convolution kernels, so that the intra-layer convolution and pooling circulation can be optimized in parallel by using a loop unfolding (unrell) technology and a PIPELINE (PIPELINE) technology, and the computational efficiency is improved.

Depending on the distinction between the unoll and PIPELINE instructions, in an alternative embodiment, an unoll operation is performed on a moderately fixed number of cycles, such as data loading, array assignment or value transformation, to maximize parallel operations. For the situations that the circulation nesting and the circulation times are more and the circulation process is not fixed due to the influence of external input, such as convolution calculation, pooling calculation and the like, the PIPELINE operation is optionally used, the parallelism of data processing is improved as much as possible under the condition that resources are available, and the purpose of exchanging a certain operation resource for higher efficient operation efficiency is achieved.

In this embodiment, the contribution to the optimization of the network computing acceleration part is that the pipeline technology, the cyclic expansion and the data splitting technology are used, so that the computing efficiency can be improved to the greatest extent, and meanwhile, the low computing cost and the low power consumption are maintained.

Further, in the step S2, the first network model, the second network model and/or the third network model which are designed and packaged are manufactured into an IP core, so that the method has high-efficiency convolutional neural network computing capability.

As shown in fig. 17, the functional module and the connection relationship of the functional module in the target detection hardware acceleration IP core according to this embodiment are shown.

In this embodiment, the IP core receives data stored in the DDR from the PS end through five AXI4 protocol ports, and one AXI4 protocol port receives a control instruction, and also has a dedicated clock receiving port and a system reset port. In the running process of the target detection system, the IP core is responsible for feature extraction and integration of classified positioning data. Firstly, loading weight data and PS preprocessed picture data through a receiving port, carrying out feature extraction in a convolution operation part, carrying out data alternation between the convolution part and a pooling part along with continuous iteration of features, realizing the transformation of the feature map size and the channel, obtaining data containing the target classification and the predicted position to be detected of the original picture until the operation is finished, and transmitting the data back to the PS part through five AXI ports for subsequent processing.

In addition, a system interface is required to control the information interaction between the PS part and the PL part of the IP verification development board, and the instruction control and parameter configuration of the PS end to the PL end; the method comprises the steps that an AXI interconnection IP core is connected with an AXI intelligent connection IP core to verify that a processing system is connected with a specific function of an FPGA hardware plate; and providing unified clock signals and reset operation for the system through the system reset IP and the clock IP.

As shown in fig. 18, in this embodiment, the target detection hardware acceleration IP core required by the FPGA hardware system at the target detection PL end is an internal structure for wiring and interconnecting the above modules according to the functional requirements and a connection schematic diagram with an external IP core. The IP core functions after wiring are integrated together, and the on-board resource arrangement of a hardware acceleration mode for large-scale calculation of the target detection network is realized through the mutual matching and coordination of various connecting line resources. The design is subjected to a Vivado bit stream generation (Generate Bitstream) function to derive a bit stream (Bitstream) file containing layout information corresponding to the PYNQ-z2 development board for the PL hardware portion design herein to control the PL portion by the host computer to perform computing operations through the PS.

In step S2, the system operation mode of the juyter Notebook connection mode is firstly connected with the upper computer through the network cable to perform ethernet connection, then the dedicated juyter Notebook interface is connected with the IP address of the PYNQ-z2 development board, the Bitstream file containing Overlay and IP design, the hwh file containing parameter configuration information and the TCL script file containing IP information are uploaded to the development board memory, and then the writing of Python language is performed to realize the adjustment and integration of the hardware function modules of the development board.

In an alternative embodiment, the hardware system runtime comprises the steps of:

(1) Data import: and importing a bit stream file, reading the hardware configuration file, and applying for memory for weight, image and bias data to be used.

(2) Image preprocessing: processing and reorganizing the read picture data by using a NumPy function library of Python language, scaling the picture with any size into a preset size, and storing the picture with any size as an array; and compressing the array for storing the picture data from a three-dimensional space variable to a one-dimensional variable, simultaneously carrying out numerical transformation, and transforming the floating point number to a 16-bit unsigned integer which can be calculated by hardware.

(3) Setting up a network and setting parameters: and carrying out hardware level calling and parameter setting required by network operation based on each module in the first network model, the second network model and the third network model, wherein the parameters comprise layer numbers, layer types, input and output sizes of feature graphs and hardware resource address information.

(4) Feature extraction: and carrying out convolution and pooling operation on the preprocessed pictures according to the first network model, the second network model, the third network model and the set parameters to obtain the classification and positioning information of the targets to be detected of the images.

(5) Image post-processing: analyzing the data result of the classification and positioning information of the target to be detected of the image, and eliminating a redundant prediction frame with higher coincidence degree; according to the classification and positioning information of the target to be detected of the image, calculating the predicted value of the predicted frame, correcting according to the predicted value, obtaining the accurate positioning of the detected target, and displaying the detected frame in the image.

In the specific implementation process, an ARM+FPGA heterogeneous platform (ZYNQ development board, XC7Z020 chip as a core chip) is adopted, connection between the ZYNQ development board and an upper computer (PC: personal computer) is realized based on Ethernet network connection, function/data control and call are carried out on the development board, image target detection is carried out by calling the hardware system, a target detection result can be output and obtained, and the detection result in the embodiment is directly displayed on the Jupyter platform.

In a specific implementation process, a PYNQ-z2 development board is selected as an equipment platform for accelerating the calculation of FPGA hardware, and an on-board ARM (i.e. CPU) is utilized for data transmission control. In the embodiment, the loop expansion and PIPELINE operation are carried out by adopting the unoll and PIPELINE technology, and the loop expansion is carried out, so that the parallel computing capability is greatly improved at the cost of increasing the resource occupation. Table 5 below shows the results of the test for the effect of unrell and PIPELINE operation on operating speed. Wherein FF is Flip-Flop (Flip-Flop), and LUT is Look-Up Table (Look-Up-Table).

TABLE 5 influence test results of unoll and PIPELINE operation on running speed

As can be seen from the above table, the occupancy rate of the hardware resources after the loop optimization is obviously increased except the BRAM, which means that the loop expansion and the pipeline operation need more calculation resources to support, and the reading and storage modes of the data are not changed, so that the BRAM storage module is not increased.

In order to increase the data throughput of the hardware system, the array interface is optimized herein. The large-size ARRAY of the loading data is split through the ARRAY_PARTION technology, and from the perspective of hardware, a single dual-port BRAM is replaced by a register for carrying a large amount of single-element data, so that a data transmission interface is increased, and the parallelism is improved. Table 6 below shows the results of the validation test for the several interface optimizations.

Table 6 verification test results for array interface optimization

As can be seen from the above table, after ARRAY PARTITION optimization by the ARRAY_PARTION technique, the BRAM usage slightly decreases and the trigger usage increases. In the embodiment, the one-dimensional array and the two-dimensional array are thoroughly segmented, the data loading mode is converted from BRAM to a Register, and the BRAM utilization rate (resource consumption) is greatly reduced; however, for three-dimensional or even four-dimensional arrays such as feature map data (three dimensions: width, height, number of channels), weight data (four dimensions: input channel, width, height, output channel), etc., it is difficult to efficiently store high-dimensional data using registers. Therefore, considering that the on-board resources of the low-cost edge device are limited, the embodiment adopts a partial segmentation mode to completely segment one dimension or two dimensions, and the data loading mode is expanded from a single BRAM to a plurality of BRAM memories. Meanwhile, it can be seen that although the resource utilization rate of the array after optimization is reduced to some extent, after the cyclic optimization and the array optimization are integrated, the resource utilization rate of the system is greatly increased, and the utilization rate of a lookup table (LUT) is improved.

After the above test and verification of each optimization scheme, the performance and function of the whole system will be tested as follows. Firstly, the power consumption of Lite-1 in this embodiment is estimated by the power consumption estimator XPE (Xilinx Power Estimator) of Xilinx, so as to implement the estimation of the information such as resource utilization, logic use condition, clock frequency and the like according to the system design scheme. The experimental simulation shows that the estimated total power consumption of the hardware system of the embodiment is 2.447W, wherein the dynamic power consumption is 2.272W and accounts for 93%, the static power consumption is 0.176W and accounts for 7% of the total power consumption, and the hardware system can meet the application scene of low power consumption. In the dynamic power consumption, it can be seen that the frame module (PS 7) such as the ARM processor of PYNQ occupies 1.284W, and the power consumption is the greatest. In addition, the power consumption of each module of the hardware acceleration system is sequentially from large to small: signal refresh (15%), clock and mixed clock (11%), combinational logic (9%), DSP computation (5%), BRAM memory access (2%).

The overall data result of this embodiment shows that the time required for implementing single-frame detection on the Lite-1-loaded hardware system is 2.6 seconds, where the FPGA calculates the time 1.2 seconds, the ARM takes 0.3 seconds for the preprocessing (size conversion, dimensional reformation, etc.) of the picture, and the time 1.1 seconds for the post-processing (NMS non-maximum suppression, determination of the detection frame, etc.) of the picture. The results of the comparison with other platform properties are shown in table 7.

Table 7 results of the comparison and simulation of the resource performance of each platform

As can be seen from the above table, the same convolutional neural network model detects the fastest speed on the GPU used in this embodiment, which can reach 0.018 seconds/frame, but its power consumption is also huge, up to 218W. In order to supply enough memory resources and power consumption, the GeForce RTX 2080Ti platform has large volume and power consumption and is not suitable for mobile application scenes with low power consumption and low cost; the power consumption of a dual-core processor ARM-A9 of the PYNQ-z2 embedded platform is extremely low and only needs 1.28W, but is limited by computational resources and parallel capacity, 154 seconds are required for target detection of a single picture, and the timeliness is poor due to low computational efficiency; the invention adopts a low-power consumption FPGA digital hardware circuit mode to accelerate the target detection system, and simultaneously utilizes the advantages of ARM and FPGA, so that the overall power consumption of the system is 2.447W, compared with the power consumption of a large-sized GPU, 98.8 percent, the single-frame image processing time is 2.6 seconds, compared with ARM-A9, the speed is improved by 59 times, and the advantages of low power consumption and rapidness are both realized.

Example 4

The present embodiment proposes an image target detection system, and the image target detection method described in embodiment 1, 2 or 3 is applied.

The image target detection system provided in this embodiment includes a hardware system with a digital integrated circuit IP core, and is configured to perform image target detection on an input image.

The digital integrated circuit IP core comprises a calculation module which is based on any one of a first network model, a second network model and a third network model and is designed through digital circuit connection.

In this embodiment, the first network model, the second network model, and the third network model include a feature extraction module, a feature fusion module, and an output module; the feature extraction modules in the first network model and the second network model both comprise a lightweight feature extraction backbone network; and the feature extraction module in the second network model introduces a bottleneck structure into the lightweight feature extraction backbone network, and the feature extraction module and the feature fusion module in the third network model adopt FPN structures.

Optionally, the lightweight feature extraction backbone network includes a number of feature extraction convolution layers therein, wherein the feature extraction convolution layers of all resolution sizes are compressed to 1 layer.

Optionally, the FPN structure in the third network model comprises a bottom-up feature extraction path and a top-down feature fusion path; the feature extraction module is arranged on the feature extraction path to perform feature extraction and sampling on feature graphs with decreasing sizes, and the feature fusion module is arranged on the feature fusion path to perform feature fusion on the feature graphs with increasing sizes.

Further, the feature extraction module comprises a Conv basic convolution unit and a ResBlock residual unit; the output module comprises a plurality of parallel output end structures and is used for generating feature map decoding information with corresponding sizes so as to realize target prediction; the output end structure at least comprises a characteristic map channel number adjusting layer, a normalization layer, a nonlinear conversion layer and a characteristic integration layer which are connected in sequence; and the output module obtains a final target detection result by carrying out fusion judgment on the feature map decoding information output by the plurality of output end structures.

In a specific implementation process, the hardware system respectively performs FPGA hardware circuit design by utilizing an HLS tool according to the first network model, the second network model and the third network model, synthesizes hardware IP cores of HDL level after completing the hardware circuit design, introduces the generated hardware IP cores into Vivado software to construct FPGA engineering, creates a top layer file, adds board level constraint, performs comprehensive simulation, finally generates a bit stream file, and exports the bit stream file to SDK software. And then, performing software part design in the SDK, creating a board-level support package, and building application engineering to develop and test software, so as to generate an executable file in the format of the elf. And finally, downloading the bit stream file and the executable file to a development board chip to obtain a hardware system for image target detection.

In another implementation process, the hardware system designs calculation modules of the first network model, the second network model and the third network model by using an HLS tool, verifies the calculation modules, packages the calculation modules into independent operation IP cores, and builds and connects each operation IP core on a Vivado platform to simulate, so as to generate a hardware configuration file. Finally, the hardware configuration file provided by the Vivado platform is imported into the Jupyter Notebook platform through the Ethernet connection development board, and the hardware system for image target detection is integrated through the calling procedures of Python design data import, image preprocessing, parameter setting, post-processing and feature extraction functions.

In another implementation process, hardware acceleration is performed on the first network model, the second network model or the third network model Lite-1/2/3 on the FPGA+ARM heterogeneous platform based on the PYNQ-z2 development board. The weight and the input and output are both quantized by adopting 16-bit fixed points based on HLS software, and the IP core is optimized by adopting optimization methods of cyclic expansion, cyclic blocking, pipelining, ping-pong buffer mechanism and input and output channel parallelism so as to be suitable for hardware realization. The method comprises the steps of developing and constructing system engineering in Vivado software, wherein two acceleration IP cores and ZYNQ cores generated by Vivado HLS are PS ends of the system, and are responsible for scheduling of the whole system, an AXI bus part is used for being responsible for reading and writing data, and a clock control part is responsible for time sequence logic of the system. Finally, the control of the upper computer on the development board is realized through two forms of serial ports or network ports, and the target detection function with low calculation cost and low power consumption is completed.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. An image target detection method, characterized by comprising the following steps:

the hardware system comprises a functional operation end for executing image target detection; the functional operation end comprises a calculation module, a time sequence control module, a logic control module and a transmission protocol module; wherein the computing module comprises a digital integrated circuit, IP, core of the first, second, and/or third network model; the digital integrated circuit IP core comprises a data loading unit, a convolution unit, a pooling unit, a reforming unit, an output unit and a plurality of storage blocks;

s3, acquiring an image to be subjected to target detection, preprocessing the image, calling a digital integrated circuit (IP) core adapted on the hardware system to execute image target detection operation according to the specification of the image, and outputting to obtain a target detection result; the process of calling the digital integrated circuit IP core adapted on the hardware system to execute the image target detection operation comprises the following steps:

the transmission protocol module acquires image data and weight data from the off-chip memory through the Ethernet and transmits the image data and the weight data to the calculation module;

The computing module receives the input image data and weight data through different data loading units respectively, and stores the image data and the weight data in a plurality of parallel storage blocks respectively;

the convolution unit extracts needed image data and weight data from the corresponding storage blocks to perform feature extraction on images with different sizes or perform convolution operation on feature images, and interacts with the pooling unit in the network iteration process to realize feature dimension reduction;

parallel optimization is carried out on the intra-layer convolution of the convolution unit by adopting a pipeline technology, and parallel optimization is carried out on the pooling circulation of the pooling unit by adopting a circulation unfolding technology;

storing data generated by the convolution unit and the pooling unit in an array form, completely dividing the array, and distributing a register for each array element to store the data;

in the working process of the convolution unit and the pooling unit, the size and the dimension of the feature map are adjusted through the reforming unit;

and finally, the convolution unit transmits a data result containing the classification and positioning information of the target to be detected of the image to an output unit, and the output unit stores and outputs the detection result through one or more storage blocks.

2. The method of claim 1, wherein the lightweight feature extraction backbone network includes a plurality of feature extraction convolution layers, wherein the feature extraction convolution layers of all resolution sizes are compressed to 1 layer.

3. The image object detection method of claim 1, wherein the FPN structure in the third network model comprises a bottom-up feature extraction path and a top-down feature fusion path; the feature extraction module is arranged on the feature extraction path to perform feature extraction and sampling on feature graphs with decreasing sizes, and the feature fusion module is arranged on the feature fusion path to perform feature fusion on the feature graphs with increasing sizes.

4. The image target detection method according to claim 1, wherein the feature extraction module includes a Conv basic convolution unit and a ResBlock residual unit; the output module comprises a plurality of parallel output end structures and is used for generating feature map decoding information with corresponding sizes so as to realize target prediction; the output end structure at least comprises a characteristic map channel number adjusting layer, a normalization layer, a nonlinear conversion layer and a characteristic integration layer which are connected in sequence; and the output module obtains a final target detection result by carrying out fusion judgment on the feature map decoding information output by the plurality of output end structures.

5. The image object detection method according to claim 1, wherein in the step S3, the hardware system is operated to perform the steps of:

(1) Data import: importing a bit stream file, reading a hardware configuration file, and applying for memory for weight, image and bias data to be used;

(2) Image preprocessing: processing and reorganizing the read picture data, scaling the picture with any size into a preset size, and storing the picture with any size into an array; compressing an array for storing the picture data from a three-dimensional space variable to a one-dimensional variable, simultaneously carrying out numerical conversion, and converting the floating point number to an unsigned integer number which can be calculated by hardware;

(3) Setting up a network and setting parameters: performing hardware level calling and parameter setting required by network operation based on each module in the first network model, the second network model and the third network model, wherein the parameters comprise layer numbers, layer types, feature map input and output sizes and hardware resource address access information;

(4) Feature extraction: the convolution and pooling operation of the preprocessed pictures are carried out according to the first network model, the second network model and the third network model and the set parameters, so that the classification and positioning information of the targets to be detected of the images are obtained;

6. The image object detection method according to any one of claims 1 to 5, characterized in that the S2 step includes:

respectively carrying out digital hardware circuit design according to the first network model, the second network model and/or the third network model, and synthesizing into a digital integrated circuit IP core after completing the hardware circuit design;

importing the generated digital integrated circuit IP core, constructing a corresponding digital circuit computing unit, creating a top file, adding computing resource constraint, and then performing comprehensive simulation to finally generate a bit stream file of the target detection hardware circuit;

performing system environment configuration according to the bit stream file, including creating a computing resource support package and creating application engineering for function development and test, and generating an executable file;

and downloading the bit stream file and the executable file to an on-board chip to obtain a hardware system for detecting the image target.

7. The method according to claim 6, wherein the hardware system comprises a state control terminal for data transmission and distribution flow, a functional operation terminal for convolution and sampling calculation, and a data bus for taking charge of data read-write and sequential logic control;

the functional operation end comprises a convolution IP core and a sampling IP core which are used for accelerating network calculation, wherein the convolution IP core and the sampling IP core comprise a data input module, a data output module, an input and output buffer module and a weight buffer module, the sampling IP core also comprises a sampling calculation module which is used for executing up-sampling and down-sampling calculation in a first network model, a second network model or a third network model, and the convolution IP core also comprises a convolution calculation module which is used for executing the first network model, the second network model or the third network model;

before all IP cores in the function operation end are started to perform calculation acceleration, the state control end performs initialization configuration on the current network model through a data bus according to the structure information of the current network model, and an input data module stores input picture data and weight data into corresponding caches according to addresses respectively;

The convolution calculation module and/or the sampling calculation module call data in the buffer to work, and the convolution and/or the sampling result is respectively stored in the corresponding buffer according to the address after being generated;

the data output module outputs the result in the cache to the off-chip memory through the data bus according to the address; after all the data are output, all the IP cores output stop signals through the data bus, and the operation of the equipment is paused or ended.

8. The image object detection method according to any one of claims 1 to 5, characterized in that the S2 step includes:

designing and verifying calculation modules of the first network model, the second network model and/or the third network model, and packaging the calculation modules into a digital integrated circuit IP core for independent operation;

building and connecting simulation is carried out on each independently operated digital integrated circuit IP core, and a hardware configuration file is generated;

and importing the hardware configuration file through an Ethernet connection development board, designing calling programs of data importing, image preprocessing, parameter setting, post-processing and feature extraction functions, and integrating to complete a hardware system for image target detection.

9. An image object detection system, characterized in that it is applied to the image object detection method according to any one of claims 1 to 8, the system comprising a digital integrated circuit IP core mounted with an object detection for an input image;

The digital integrated circuit IP core comprises a calculation module which is based on any one of a first network model, a second network model and a third network model and is designed through digital circuit connection;

the first network model, the second network model and the third network model comprise a feature extraction module, a feature fusion module and an output module; the feature extraction modules in the first network model and the second network model both comprise a lightweight feature extraction backbone network; and the feature extraction module in the second network model introduces a bottleneck structure into the lightweight feature extraction backbone network, and the feature extraction module and the feature fusion module in the third network model adopt FPN structures.