CN112950673A

CN112950673A - Target object detection and tracking system and method

Info

Publication number: CN112950673A
Application number: CN202110247972.0A
Authority: CN
Inventors: 徐雅楠; 张燕; 陈广辉; 陈�峰; 杨玉宽; 赵明建; 焉保卿
Original assignee: Shandong Vt Electron Technology Co ltd
Current assignee: Shandong Vt Electron Technology Co ltd
Priority date: 2021-03-06
Filing date: 2021-03-06
Publication date: 2021-06-11

Abstract

The application provides a detection tracker to target object, includes: the MPSoC system comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module which are processed on a chip; the holder is used for receiving a control command sent by the MPSoC module and driving the lens to rotate by 360 degrees in a horizontal range and/or rotate by pitching by 180 degrees in a vertical range; the lens is mounted on the holder and used for collecting the LVDS of the target object, and the lens is also used for sending the collected LVDS to a video image processing unit VIPU of the MPSoC module through an LVDS interface; the MPSoC module comprises a Processing System (PS) and a Programmable Logic (PL), wherein the PS comprises a quad-core processing system and an image processing unit (GPU), and the PL comprises the VIPU, a deep learning processing unit (DPU) and a Video Coding Unit (VCU); the communication module is used for wireless communication of the detection tracking system, and the communication module comprises an Ethernet module, a 5G module and a WiFi module; the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display.

Description

Target object detection and tracking system and method

Technical Field

The present application relates to the field of image processing, and more particularly, to a system and method for detecting and tracking a target object.

Background

In recent years, a research hotspot for detecting and tracking a target object is often based on a deep learning technology, but because deep learning requires high-performance calculation and a large operating memory to maintain good detection and tracking performance, the conventional X86 architecture system has the defects of large volume, poor mobility, high cost, high image transmission delay and the like; the other YOLO algorithm based on deep learning has good detection effect and high precision, but the positioning is not accurate enough in practical application, the time efficiency is low, and the real-time processing cannot be achieved.

Therefore, in order to solve the problems of low detection efficiency and high detection cost in the prior art, a system and a method for detecting and tracking a target object are needed.

Disclosure of Invention

The application provides a system and a method for detecting and tracking a target object, which can improve the efficiency of detecting and tracking the target object and reduce the system cost.

In a first aspect, a system for detecting and tracking a target object is provided, including: the MPSoC system comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module which are processed on a chip; the holder is connected with the MPSoC module through an RS485 interface, and is used for receiving a control command sent by the MPSoC module and driving the lens to rotate by 360 degrees in a horizontal range and/or rotate by pitching by 180 degrees in a vertical range, and is also used for adjusting the focal length of the lens; the lens is carried on the holder, is connected with the MPSoC module through a low-voltage differential signal LVDS interface and is used for collecting the LVDS of the target object, and is also used for sending the collected LVDS to a video image processing unit VIPU of the MPSoC module through the LVDS interface; the MPSoC module comprises a processing system PS and a programmable logic PL, wherein the PS comprises a quad-core processing system and an image processing unit GPU, and the quad-core processing system is used for running a C + + application program and a control program of the holder; the PL comprises the VIPU, a deep learning processing unit (DPU) and a Video Coding Unit (VCU), and the PS and the PL are connected through an AXI high-speed bus; the communication module is used for wireless communication of the detection tracking system, and the communication module comprises an Ethernet module, a 5G module and a WiFi module; the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the lens is an LVDS camera module, and is configured to acquire LVDS; the VIPU is used for converting LVDS of the target object into RGB video image data of 24 bits, the size of the RGB video image data is 1600 x 900, the VIPU is also used for converting the RGB video image data from an RGB format into BGR video image data, reducing the BGR video image data from 1600 x 900 to 608 x 608, and sending the reduced BGR video image data to the DPU; the DPU is used for receiving the reduced BGR video image data and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object; the four-core processor is used for screening N candidate frames meeting a threshold condition from the yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer; the four-core processor is used for determining coordinate information (x, y, w, h) of the target object according to the target frame; the four-core processor is used for reading the target object image stored in the memory, framing the target object in the target object image according to the coordinate information of the target object, and obtaining a video image of the target object after picture frame; the four-core processor is further used for calculating the horizontal and vertical rotation angles of the holder and the zoom factor of the lens according to the coordinate information of the target object, obtaining a control command for the holder, and sending the control command to the holder; the VCU is used for coding the video image of the target object after the picture frame to obtain an H264 coded video image of the target object, and sending the H264 coded video image of the target object to a background server through the communication module; and the holder is used for driving the lens to rotate and adjusting the focal length of the lens after receiving the control command.

With reference to the first aspect and the foregoing implementation manner, in a second possible implementation manner of the first aspect, a model of yolov3 algorithm of the target object is pre-stored in the DPU, and the DPU is configured to: carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target object, wherein a frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU; the yolov3 algorithm model of the target object is obtained by the background server according to the video image training of the target object, and the background server is specifically configured to: constructing a training data set according to the video image data of the target object; changing the frame regression function of the yolov3 method from IOU to CIOU; putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes;

determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame; outputting the category information of the target frame and the target through a forward propagation acquisition model; calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm; and performing iterative training for W times to obtain the yolo layer data of the target object, wherein the yolo layer data of the target object is a model of the yolov3 algorithm, and W is a positive integer.

With reference to the first aspect and the foregoing implementation manner, in a third possible implementation manner of the first aspect, before a training data set is constructed according to RGB image data of the target object, the background server is configured to select M images from video images of the target object, where the M images are discontinuously captured images, and M is a positive integer; the constructing of the training data set according to the video image data of the target object comprises: marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set; wherein, the CIOU regression loss is calculated as follows:

where ρ is²(b,b^gt) C represents the diagonal distance of the minimum closure area simultaneously containing the prediction frame and the real frame;

α is a weighting function defined as follows:

upsilon is a measure of similarity in aspect ratio, defined as follows:

where ω is the width of the real frame, h is the height of the real frame, and gt is the label data。

With reference to the first aspect and the foregoing implementation manner, in a third possible implementation manner of the first aspect, the memory is a DDR or SD card and is configured to store video image data of a target object acquired by the lens; the power supply is used for providing a working power supply for the detection tracking system; the keyboard is used for inputting an operation command to the detection tracking system; the mouse is used for inputting a positioning or selecting operation command by the detection tracking system; and the display is used for displaying the operation command and the processing result of the detection tracking system.

In a second aspect, a method for detecting and tracking a target object is provided, the target object detecting and tracking system comprises an on-chip multi-processing MPSoC module, a lens, a holder, a communication module and a peripheral module, wherein the cradle head is connected with the MPSoC module through an RS485 interface, the lens is carried on the cradle head, connected with the MPSoC module through a low voltage differential signaling LVDS interface, the MPSoC module comprises a processing system PS and programmable logic PL, the PS includes a quad-core processing system, an image processing unit (GPU), the PL includes the VIPU, a deep learning processing unit (DPU), and a Video Coding Unit (VCU), the PS and the PL are connected through an AXI high-speed bus, the communication comprises an Ethernet module, a 5G module and a WiFi module, the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display, the lens is an LVDS camera module, and the method comprises the following steps: the LVDS camera module collects LVDS of the target object; the VIPU converts LVDS of the target object into RGB video image data of 24 bits, the size of the RGB video image data is 1600 x 900, the RGB video image data is converted into BGR video image data from an RGB format, the BGR video image data is reduced from 1600 x 900 to 608 x 608, and the reduced BGR video image data is sent to the DPU; the DPU receives the reduced BGR video image data, and performs yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object; the four-core processor screens out N candidate frames meeting a threshold condition from yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer; the four-core processor determines coordinate information (x, y, w, h) of the target object according to the target frame; the four-core processor reads the target object image stored in the memory, and frames the target object in the target object image according to the coordinate information of the target object to obtain a video image of the target object after picture frame; the four-core processor is further used for calculating the horizontal and vertical rotation angles of the holder and the zoom factor of the lens according to the coordinate information of the target object, obtaining a control command for the holder, and sending the control command to the holder; the VCU encodes the video image of the target object after the picture frame to obtain an H264 encoded video image of the target object, and sends the H264 encoded video image of the target object to a background server through the communication module; and the holder is used for driving the lens to rotate and adjusting the focal length of the lens after receiving the control command.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the receiving, by the DPU, the reduced BGR video image data, and performing yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object includes: and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target, wherein the frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU.

With reference to the second aspect and the foregoing implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the method further includes: the background server constructs a training data set according to the RGB image data of the target object; changing the frame regression function of the yolov3 method from IOU to CIOU; putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes; determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame; outputting the category information of the target frame and the target through a forward propagation acquisition model; calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm; and performing iterative training for W times to obtain the yolo layer data of the target object, wherein the yolo layer data of the target object is a model of a yolov3 algorithm, and W is a positive integer.

With reference to the second aspect and the foregoing implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the constructing, by the backend server, a training data set according to the RGB image data of the target object includes: the background server selects M images from the video images of the target object, wherein the M images are discontinuously shot images, and M is a positive integer; marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set; wherein, the CIOU regression loss is calculated as follows:

α is a weighting function defined as follows:

upsilon is a measure of similarity in aspect ratio, defined as follows:

where ω is the width of the real frame, h is the height of the real frame, and gt is the label data.

Therefore, by adopting the detection and tracking system, the delay of the video image acquisition processing process obtained by combining the lens and the MPSoc module is low, about 10ms, and the whole system delay is greatly reduced; due to the adoption of an embedded scheme, the volume is small, the cost is low, and the carrying is convenient; in addition, in the aspect of image processing, the frame regression processing in the yolov3 algorithm is modified, the detection precision is improved, the deep learning algorithm is accelerated by using the FPGA, the time is only 5ms, the detection speed is greatly improved, and the time efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a schematic structural diagram of an embodiment of the present application;

FIG. 2 is a diagram of a hardware configuration according to another embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a method of one embodiment of the present application;

FIG. 4 is a system interface of a software terminal according to an embodiment of the present application;

fig. 5 is a system interface of a software terminal according to another embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The invention is further described with reference to the following figures and detailed description of embodiments.

Fig. 1 is a schematic structural diagram of an embodiment of the present application, and as shown in fig. 1, a system for detecting and tracking a target object includes a lens, a VIPU (Video Image processing Unit), a DPU (Deep learning processing Unit), an ARM, a VCU (Video coding Unit), a pan-tilt, and a background server.

The lens, namely the VLDS camera module, may be a Baseler module, for example, and is used to collect a low voltage differential signal LVDS;

the VIPU is used for processing the low-voltage differential signal LVDS;

the DPU completes the convolution part operation of the deep learning algorithm;

the ARM is mainly used for processing yolo layer data and generating control commands and operating image frames and the like;

the VCU is used for coding video images;

the holder mainly drives the lens to shoot and adjusts the focal length of the lens;

the background server is mainly used for completing initialization setting of the holder, storage and display of the coded video image and the like.

It should be understood that the embedded system refers to arm as CPU, fpga as coprocessor, Ubuntu system running, hardware device + software system forming the whole embedded system

Specifically, fig. 2 is a schematic diagram of a hardware structure according to another embodiment of the present application. As shown in the figure, the detection and tracking system for the target object provided by the application comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module.

The MPSoC (Chinese: on-Chip multiprocessing, English: Multi-Processor System on Chip) module comprises: PS (Processing System) and PL (Programmable Logic); wherein, PS includes: ARM Cortex-A53 quad-core Processing systems (rate up to 1.5GHz), ARM Cortex-R5 quad-core Processing systems (rate up to 600MHz), and GPUs (Graphics Processing units, rate up to 667 MHz); wherein PL includes: VIPU (Video Image Processing Unit), DPU (Deep learning Processing Unit), and VCU (Video coding Unit); the PS and PL are connected by an AXI high speed bus.

The ARM Cortex-A53 can be the four-core processor and mainly runs a C + + application program and a holder control program, and the ARM Cortex-R5 is also a four-core processor in the figure; the VIPU converts LVDS collected by the lens into 24-bit RGB video signals, stores the signals on one hand, performs chrominance space conversion on the signals, namely RGB is converted into BGR on the other hand, performs scale scaling, and reduces the signals from 1600 × 900 to 608 × 608; the DPU mainly runs convolution part operation of yolov3 algorithm to accelerate the algorithm; the VCU encodes the video image to be transmitted through the network, thereby greatly shortening the time taken for transmitting the video image;

the holder is connected with the MPSoC module through an RS485 interface, receives a control command sent by the MPSoC module, drives the lens to carry out 360-degree rotation shooting in a monitored horizontal range and 180-degree pitching shooting in a vertical range, and adjusts the focal length of the lens;

the lens is connected with the MPSoC module through an LVDS interface, and transmits the collected LVDS to the VIPU through the LVDS interface for further processing by the VIPU;

the communication module comprises Ethernet, 5G and WiFi and mainly realizes the online transmission of data;

the peripheral comprises a memory, a power supply, a keyboard, a mouse and a display;

the memory can be a DDR or SD card and is used for storing video image data;

the power supply provides a stable working power supply for the work of the system;

the keyboard is used as input equipment for inputting operation commands;

the mouse is used as an input device for positioning and selecting;

the display displays the operation command and the processing result in real time.

Fig. 3 is a flowchart of the present invention, and as shown in the drawing, the present invention provides an embedded target detection tracking method, in which a lens is driven by a pan-tilt to perform 360-degree rotation shooting within a monitored horizontal range and 180-degree tilt shooting within a vertical range. The lens shooting method includes the steps that LVDS is obtained through a lens shooting interface, the LVDS is transmitted to a VIPU through an LVDS interface to be processed, the LVDS is converted into 24-bit RGB data which can be transmitted through an AXIS bus, the size of the RGB data is 1600 x 900, meanwhile, the RGB data are stored in a memory to be used in subsequent data fusion, the RGB data are further processed and converted into BGR data with the size of 608 x 608, the processed data are transmitted to a DPU to be subjected to convolution processing to obtain yolo layer data, the yolo layer data are transmitted to an ARM through a bus, and the ARM program carries out post-processing on the data of the yolo layer to obtain coordinate positions (x, y, w, h), horizontal and vertical rotation angles of a cloud deck and lens multiple. And at the ARM end, the target coordinate position is superposed on the image read from the memory, the image is sent to the VCU for H264 video coding processing, and finally the coded video image is sent to a background server through a network for a worker to check in real time. The holder control program sends the horizontal and vertical rotation angles of the holder and the zoom number of the lens to the holder through a control data protocol, controls the holder to rotate, and adjusts the focal length of the lens, so that the target can be clearly locked in the lens. The specific implementation steps are as follows:

step 1: the method comprises the steps that a system is powered on, a background server is remotely logged in, initial configuration is carried out, and the initial configuration comprises equipment selection and connection, parameter setting and cradle head rotation, wherein the equipment selection comprises different cradle head IP addresses, as shown in figure 4, the parameter setting comprises equipment detailed information, adding equipment and system setting, wherein the system setting comprises the horizontal and vertical moving speeds of the cradle head, and video storage paths and names, as shown in figure 5;

step 2: manually controlling the holder to rotate to align the lens with the target;

and step 3: acquiring LVDS of a target by a lens;

and 4, step 4: the lens sends the LVDS to the VIPU through the LVDS interface;

and 5: VIPU converts LVDS differential video signals into 24-bit RGB video image data which can be transmitted through AXIS bus, and the size of the RGB video image data is 1600 x 900;

step 6, buffering RGB video image data in a memory on one hand to facilitate subsequent data superposition; on the other hand, further pretreatment is carried out;

and 7: RGB video image data is converted from RGB to BGR;

and 8: reducing the BGR video image data from 1600 × 900 to 608 × 608, and meeting the image size required by yolov3 algorithm;

and step 9: sending the reduced image to a DPU (dual-processing unit), and performing yolov3 algorithm convolution processing to obtain yolo layer data;

step 10: and the ARM receives the yolo layer data processed by the DPU and performs subsequent processing. Firstly, preliminarily screening candidate frames meeting a condition (setting a threshold value of 0.3); secondly, sorting the candidate frames according to the confidence degree; finally, performing final screening on the sorted candidate frames, mainly for eliminating overlapped frames of the same target; it should be understood that the set threshold may be other values, and the application is not limited thereto.

Step 11: obtaining coordinates (x, y, w, h) of the target after ARM processing, on one hand, reading an image stored in a memory, and framing the target in the image according to coordinate information to obtain a framed video image; on the other hand, the horizontal and vertical rotation angles of the holder and the zoom factor of the lens are calculated according to the coordinate information, and then the control command of the holder is obtained;

step 12: the VCU encodes the video image after the picture frame to obtain an H264 encoded video image, and then transmits the video image to a background server through a network;

step 13: and after the pan-tilt receives the control command, the pan-tilt drives the lens to rotate, adjusts the focal length of the lens and shoots the next frame.

In step 9, the training step of yolov3 algorithm includes:

step 901: constructing a data set;

step 902: modifying the frame regression function IOU in the yolov3 method into CIOU;

step 903: putting the training data set into a network of yolov3 algorithm to extract target features and candidate boxes;

step 904: determining the initial number and width and height of candidate frames anchor, learning image features from a training data set, and performing cluster analysis on the candidate frames by using a K-means clustering algorithm; taking the K value as the number of the candidate frames anchor; the CIOU is used as an evaluation standard for judging the position of the target boundary frame, the CIOU takes the distance between the target and the frame anchor, the overlapping rate, the scale and the penalty into consideration, so that the regression of the target frame becomes more stable, the CIOU threshold is set to be 0.45, and the confidence coefficient threshold is set to be 0.3;

step 905: obtaining the type information of a model output target boundary frame and a target through forward propagation;

step 906: calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm;

step 907: and after 20000 times of iterative training, obtaining a trained model of yolov3 algorithm.

In the step 901, the specific steps of constructing the training set include:

step 9011: collecting a video image of a target to be detected;

collecting targets to be detected under the conditions of different angles, different postures, different backgrounds, different distances, different shelters, different illumination, different weather and the like of the targets to be detected shot under a camera;

step 9012: 3000-4000 images are selected from the collected video images, and different images are non-continuously shot images, so that model training is prevented from being over-fitted; the number of images in each case is the same, so that the accuracy of the trained model is prevented from being low;

step 9013: marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set; wherein, the CIOU regression loss is calculated as follows:

α is a weighting function defined as follows:

upsilon is a measure of similarity in aspect ratio, defined as follows:

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a second device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A system for detecting and tracking a target object, comprising: the MPSoC system comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module which are processed on a chip;

the holder is connected with the MPSoC module through an RS485 interface, and is used for receiving a control command sent by the MPSoC module and driving the lens to rotate by 360 degrees in a horizontal range and/or rotate by pitching by 180 degrees in a vertical range, and is also used for adjusting the focal length of the lens;

the lens is carried on the holder, is connected with the MPSoC module through a low-voltage differential signal LVDS interface and is used for collecting the LVDS of the target object, and is also used for sending the collected LVDS to a video image processing unit VIPU of the MPSoC module through the LVDS interface;

the MPSoC module comprises a processing system PS and a programmable logic PL, wherein the PS comprises a quad-core processing system and an image processing unit GPU, and the quad-core processing system is used for running a C + + application program and a control program of the holder; the PL comprises the VIPU, a deep learning processing unit (DPU) and a Video Coding Unit (VCU), and the PS and the PL are connected through an AXI high-speed bus;

the communication module is used for wireless communication of the detection tracking system, and the communication module comprises an Ethernet module, a 5G module and a WiFi module;

the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display.

2. The system according to claim 1, wherein the lens is an LVDS camera module for capturing LVDS;

the VIPU is used for converting LVDS collected by the lens into 24-bit RGB video image data, the size of the RGB video image data is 1600 x 900, the VIPU is further used for converting the RGB video image data from an RGB format into BGR video image data, reducing the BGR video image data from 1600 x 900 to 608 x 608, and sending the reduced BGR video image data to the DPU;

the DPU is used for receiving the reduced BGR video image data and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object;

the four-core processor is used for screening N candidate frames meeting a threshold condition from the yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer;

the four-core processor is used for determining coordinate information (x, y, w, h) of the target object according to the target frame;

the four-core processor is used for reading the target object image stored in the memory, framing the target object in the target object image according to the coordinate information of the target object, and obtaining a video image of the target object after picture frame;

the four-core processor is further used for calculating the horizontal and vertical rotation angles of the holder and the zoom factor of the lens according to the coordinate information of the target object, obtaining a control command for the holder, and sending the control command to the holder;

the VCU is used for coding the video image of the target object after the picture frame to obtain an H264 coded video image of the target object, and sending the H264 coded video image of the target object to a background server through the communication module;

and the holder is used for driving the lens to rotate and adjusting the focal length of the lens after receiving the control command.

3. The detection and tracking system of claim 2, wherein the DPU pre-stores a model of the yolov3 algorithm for the target, the DPU configured to:

carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target object, wherein a frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU;

wherein the yolov3 algorithm model of the target object is obtained by the background server according to the video image training of the target object, and the background server is specifically configured to:

constructing a training data set according to the video image data of the target object;

changing the frame regression function of the yolov3 method from IOU to CIOU;

putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes;

determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame;

outputting the category information of the target frame and the target through a forward propagation acquisition model;

calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm;

and performing iterative training for W times to obtain the yolo layer data of the target object, wherein the yolo layer data of the target object is a model of the yolov3 algorithm, and W is a positive integer.

4. The detection and tracking system of claim 3, wherein the background server is configured to select M images from the video images of the target object before constructing the training data set according to the RGB image data of the target object, wherein the M images are non-consecutively taken images, and M is a positive integer;

the constructing of the training data set according to the video image data of the target object comprises:

marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set;

wherein, the CIOU regression loss is calculated as follows:

α is a weighting function defined as follows:

upsilon is a measure of similarity in aspect ratio, defined as follows:

5. The detection and tracking system of claim 4, wherein the memory is a DDR or SD card for storing video image data of the target object captured by the lens; the power supply is used for providing a working power supply for the detection tracking system; the keyboard is used for inputting an operation command to the detection tracking system; the mouse is used for inputting a positioning or selecting operation command by the detection tracking system; and the display is used for displaying the operation command and the processing result of the detection tracking system.

6. A method for detecting and tracking a target object is characterized in that a system for detecting and tracking the target object comprises an on-chip multi-processing MPSoC module, a lens, a holder, a communication module and an external module, wherein the cradle head is connected with the MPSoC module through an RS485 interface, the lens is carried on the cradle head, connected with the MPSoC module through a low voltage differential signaling LVDS interface, the MPSoC module comprises a processing system PS and programmable logic PL, the PS includes a quad-core processing system, an image processing unit (GPU), the PL includes the VIPU, a deep learning processing unit (DPU), and a Video Coding Unit (VCU), the PS and the PL are connected through an AXI high-speed bus, the communication comprises an Ethernet module, a 5G module and a WiFi module, the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display, the lens is an LVDS camera module, and the method comprises the following steps:

the LVDS camera module collects LVDS of the target object;

the VIPU converts LVDS of the target object into RGB video image data of 24 bits, the size of the RGB video image data is 1600 x 900, the RGB video image data is converted into BGR video image data from an RGB format, the BGR video image data is reduced from 1600 x 900 to 608 x 608, and the reduced BGR video image data is sent to the DPU;

the DPU receives the reduced BGR video image data, and performs yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object;

the four-core processor screens out N candidate frames meeting a threshold condition from yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer;

the four-core processor determines coordinate information (x, y, w, h) of the target object according to the target frame;

the four-core processor reads the target object image stored in the memory, and frames the target object in the target object image according to the coordinate information of the target object to obtain a video image of the target object after picture frame;

the VCU encodes the video image of the target object after the picture frame to obtain an H264 encoded video image of the target object, and sends the H264 encoded video image of the target object to a background server through the communication module;

7. The detection and tracking system according to claim 6, wherein the DPU receives the reduced BGR video image data and performs yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the object, and the method comprises:

and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target, wherein the frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU.

8. The detection and tracking system of claim 7, further comprising:

the background server constructs a training data set according to the RGB image data of the target object; changing the frame regression function of the yolov3 method from IOU to CIOU; putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes; determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame; outputting the category information of the target frame and the target through a forward propagation acquisition model; calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm; performing iterative training for W times to obtain yolo layer data of the target object, wherein the yolo layer data of the target object is a model of the yolov3 algorithm, and W is a positive integer;

a model of yolov3 algorithm for the target is pre-stored in the DPU.

9. The detection and tracking system of claim 8, wherein the backend server constructs a training data set from the RGB image data of the target object, comprising:

the background server selects M images from the video images of the target object, wherein the M images are discontinuously shot images, and M is a positive integer;

wherein the CIOU regression loss is calculated as follows:

α is a weighting function defined as follows:

upsilon is a measure of similarity in aspect ratio, defined as follows:

where ω is the width of the real box, h is the height of the real box,

gt is the annotation data.