CN112950673A - Target object detection and tracking system and method - Google Patents

Target object detection and tracking system and method Download PDF

Info

Publication number
CN112950673A
CN112950673A CN202110247972.0A CN202110247972A CN112950673A CN 112950673 A CN112950673 A CN 112950673A CN 202110247972 A CN202110247972 A CN 202110247972A CN 112950673 A CN112950673 A CN 112950673A
Authority
CN
China
Prior art keywords
target object
video image
module
target
image data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110247972.0A
Other languages
Chinese (zh)
Inventor
徐雅楠
张燕
陈广辉
陈�峰
杨玉宽
赵明建
焉保卿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Vt Electron Technology Co ltd
Original Assignee
Shandong Vt Electron Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Vt Electron Technology Co ltd filed Critical Shandong Vt Electron Technology Co ltd
Priority to CN202110247972.0A priority Critical patent/CN112950673A/en
Publication of CN112950673A publication Critical patent/CN112950673A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a detection tracker to target object, includes: the MPSoC system comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module which are processed on a chip; the holder is used for receiving a control command sent by the MPSoC module and driving the lens to rotate by 360 degrees in a horizontal range and/or rotate by pitching by 180 degrees in a vertical range; the lens is mounted on the holder and used for collecting the LVDS of the target object, and the lens is also used for sending the collected LVDS to a video image processing unit VIPU of the MPSoC module through an LVDS interface; the MPSoC module comprises a Processing System (PS) and a Programmable Logic (PL), wherein the PS comprises a quad-core processing system and an image processing unit (GPU), and the PL comprises the VIPU, a deep learning processing unit (DPU) and a Video Coding Unit (VCU); the communication module is used for wireless communication of the detection tracking system, and the communication module comprises an Ethernet module, a 5G module and a WiFi module; the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display.

Description

Target object detection and tracking system and method
Technical Field
The present application relates to the field of image processing, and more particularly, to a system and method for detecting and tracking a target object.
Background
In recent years, a research hotspot for detecting and tracking a target object is often based on a deep learning technology, but because deep learning requires high-performance calculation and a large operating memory to maintain good detection and tracking performance, the conventional X86 architecture system has the defects of large volume, poor mobility, high cost, high image transmission delay and the like; the other YOLO algorithm based on deep learning has good detection effect and high precision, but the positioning is not accurate enough in practical application, the time efficiency is low, and the real-time processing cannot be achieved.
Therefore, in order to solve the problems of low detection efficiency and high detection cost in the prior art, a system and a method for detecting and tracking a target object are needed.
Disclosure of Invention
The application provides a system and a method for detecting and tracking a target object, which can improve the efficiency of detecting and tracking the target object and reduce the system cost.
In a first aspect, a system for detecting and tracking a target object is provided, including: the MPSoC system comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module which are processed on a chip; the holder is connected with the MPSoC module through an RS485 interface, and is used for receiving a control command sent by the MPSoC module and driving the lens to rotate by 360 degrees in a horizontal range and/or rotate by pitching by 180 degrees in a vertical range, and is also used for adjusting the focal length of the lens; the lens is carried on the holder, is connected with the MPSoC module through a low-voltage differential signal LVDS interface and is used for collecting the LVDS of the target object, and is also used for sending the collected LVDS to a video image processing unit VIPU of the MPSoC module through the LVDS interface; the MPSoC module comprises a processing system PS and a programmable logic PL, wherein the PS comprises a quad-core processing system and an image processing unit GPU, and the quad-core processing system is used for running a C + + application program and a control program of the holder; the PL comprises the VIPU, a deep learning processing unit (DPU) and a Video Coding Unit (VCU), and the PS and the PL are connected through an AXI high-speed bus; the communication module is used for wireless communication of the detection tracking system, and the communication module comprises an Ethernet module, a 5G module and a WiFi module; the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the lens is an LVDS camera module, and is configured to acquire LVDS; the VIPU is used for converting LVDS of the target object into RGB video image data of 24 bits, the size of the RGB video image data is 1600 x 900, the VIPU is also used for converting the RGB video image data from an RGB format into BGR video image data, reducing the BGR video image data from 1600 x 900 to 608 x 608, and sending the reduced BGR video image data to the DPU; the DPU is used for receiving the reduced BGR video image data and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object; the four-core processor is used for screening N candidate frames meeting a threshold condition from the yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer; the four-core processor is used for determining coordinate information (x, y, w, h) of the target object according to the target frame; the four-core processor is used for reading the target object image stored in the memory, framing the target object in the target object image according to the coordinate information of the target object, and obtaining a video image of the target object after picture frame; the four-core processor is further used for calculating the horizontal and vertical rotation angles of the holder and the zoom factor of the lens according to the coordinate information of the target object, obtaining a control command for the holder, and sending the control command to the holder; the VCU is used for coding the video image of the target object after the picture frame to obtain an H264 coded video image of the target object, and sending the H264 coded video image of the target object to a background server through the communication module; and the holder is used for driving the lens to rotate and adjusting the focal length of the lens after receiving the control command.
With reference to the first aspect and the foregoing implementation manner, in a second possible implementation manner of the first aspect, a model of yolov3 algorithm of the target object is pre-stored in the DPU, and the DPU is configured to: carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target object, wherein a frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU; the yolov3 algorithm model of the target object is obtained by the background server according to the video image training of the target object, and the background server is specifically configured to: constructing a training data set according to the video image data of the target object; changing the frame regression function of the yolov3 method from IOU to CIOU; putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes;
determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame; outputting the category information of the target frame and the target through a forward propagation acquisition model; calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm; and performing iterative training for W times to obtain the yolo layer data of the target object, wherein the yolo layer data of the target object is a model of the yolov3 algorithm, and W is a positive integer.
With reference to the first aspect and the foregoing implementation manner, in a third possible implementation manner of the first aspect, before a training data set is constructed according to RGB image data of the target object, the background server is configured to select M images from video images of the target object, where the M images are discontinuously captured images, and M is a positive integer; the constructing of the training data set according to the video image data of the target object comprises: marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set; wherein, the CIOU regression loss is calculated as follows:
Figure BDA0002964847230000021
where ρ is2(b,bgt) C represents the diagonal distance of the minimum closure area simultaneously containing the prediction frame and the real frame;
α is a weighting function defined as follows:
Figure BDA0002964847230000022
upsilon is a measure of similarity in aspect ratio, defined as follows:
Figure BDA0002964847230000031
where ω is the width of the real frame, h is the height of the real frame, and gt is the label data。
With reference to the first aspect and the foregoing implementation manner, in a third possible implementation manner of the first aspect, the memory is a DDR or SD card and is configured to store video image data of a target object acquired by the lens; the power supply is used for providing a working power supply for the detection tracking system; the keyboard is used for inputting an operation command to the detection tracking system; the mouse is used for inputting a positioning or selecting operation command by the detection tracking system; and the display is used for displaying the operation command and the processing result of the detection tracking system.
In a second aspect, a method for detecting and tracking a target object is provided, the target object detecting and tracking system comprises an on-chip multi-processing MPSoC module, a lens, a holder, a communication module and a peripheral module, wherein the cradle head is connected with the MPSoC module through an RS485 interface, the lens is carried on the cradle head, connected with the MPSoC module through a low voltage differential signaling LVDS interface, the MPSoC module comprises a processing system PS and programmable logic PL, the PS includes a quad-core processing system, an image processing unit (GPU), the PL includes the VIPU, a deep learning processing unit (DPU), and a Video Coding Unit (VCU), the PS and the PL are connected through an AXI high-speed bus, the communication comprises an Ethernet module, a 5G module and a WiFi module, the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display, the lens is an LVDS camera module, and the method comprises the following steps: the LVDS camera module collects LVDS of the target object; the VIPU converts LVDS of the target object into RGB video image data of 24 bits, the size of the RGB video image data is 1600 x 900, the RGB video image data is converted into BGR video image data from an RGB format, the BGR video image data is reduced from 1600 x 900 to 608 x 608, and the reduced BGR video image data is sent to the DPU; the DPU receives the reduced BGR video image data, and performs yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object; the four-core processor screens out N candidate frames meeting a threshold condition from yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer; the four-core processor determines coordinate information (x, y, w, h) of the target object according to the target frame; the four-core processor reads the target object image stored in the memory, and frames the target object in the target object image according to the coordinate information of the target object to obtain a video image of the target object after picture frame; the four-core processor is further used for calculating the horizontal and vertical rotation angles of the holder and the zoom factor of the lens according to the coordinate information of the target object, obtaining a control command for the holder, and sending the control command to the holder; the VCU encodes the video image of the target object after the picture frame to obtain an H264 encoded video image of the target object, and sends the H264 encoded video image of the target object to a background server through the communication module; and the holder is used for driving the lens to rotate and adjusting the focal length of the lens after receiving the control command.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the receiving, by the DPU, the reduced BGR video image data, and performing yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object includes: and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target, wherein the frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU.
With reference to the second aspect and the foregoing implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the method further includes: the background server constructs a training data set according to the RGB image data of the target object; changing the frame regression function of the yolov3 method from IOU to CIOU; putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes; determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame; outputting the category information of the target frame and the target through a forward propagation acquisition model; calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm; and performing iterative training for W times to obtain the yolo layer data of the target object, wherein the yolo layer data of the target object is a model of a yolov3 algorithm, and W is a positive integer.
With reference to the second aspect and the foregoing implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the constructing, by the backend server, a training data set according to the RGB image data of the target object includes: the background server selects M images from the video images of the target object, wherein the M images are discontinuously shot images, and M is a positive integer; marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set; wherein, the CIOU regression loss is calculated as follows:
Figure BDA0002964847230000041
where ρ is2(b,bgt) C represents the diagonal distance of the minimum closure area simultaneously containing the prediction frame and the real frame;
α is a weighting function defined as follows:
Figure BDA0002964847230000042
upsilon is a measure of similarity in aspect ratio, defined as follows:
Figure BDA0002964847230000043
where ω is the width of the real frame, h is the height of the real frame, and gt is the label data.
Therefore, by adopting the detection and tracking system, the delay of the video image acquisition processing process obtained by combining the lens and the MPSoc module is low, about 10ms, and the whole system delay is greatly reduced; due to the adoption of an embedded scheme, the volume is small, the cost is low, and the carrying is convenient; in addition, in the aspect of image processing, the frame regression processing in the yolov3 algorithm is modified, the detection precision is improved, the deep learning algorithm is accelerated by using the FPGA, the time is only 5ms, the detection speed is greatly improved, and the time efficiency is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a schematic structural diagram of an embodiment of the present application;
FIG. 2 is a diagram of a hardware configuration according to another embodiment of the present application;
FIG. 3 is a schematic flow chart diagram of a method of one embodiment of the present application;
FIG. 4 is a system interface of a software terminal according to an embodiment of the present application;
fig. 5 is a system interface of a software terminal according to another embodiment of the present application.
Detailed Description
The technical solution in the present application will be described below with reference to the accompanying drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The invention is further described with reference to the following figures and detailed description of embodiments.
Fig. 1 is a schematic structural diagram of an embodiment of the present application, and as shown in fig. 1, a system for detecting and tracking a target object includes a lens, a VIPU (Video Image processing Unit), a DPU (Deep learning processing Unit), an ARM, a VCU (Video coding Unit), a pan-tilt, and a background server.
The lens, namely the VLDS camera module, may be a Baseler module, for example, and is used to collect a low voltage differential signal LVDS;
the VIPU is used for processing the low-voltage differential signal LVDS;
the DPU completes the convolution part operation of the deep learning algorithm;
the ARM is mainly used for processing yolo layer data and generating control commands and operating image frames and the like;
the VCU is used for coding video images;
the holder mainly drives the lens to shoot and adjusts the focal length of the lens;
the background server is mainly used for completing initialization setting of the holder, storage and display of the coded video image and the like.
It should be understood that the embedded system refers to arm as CPU, fpga as coprocessor, Ubuntu system running, hardware device + software system forming the whole embedded system
Specifically, fig. 2 is a schematic diagram of a hardware structure according to another embodiment of the present application. As shown in the figure, the detection and tracking system for the target object provided by the application comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module.
The MPSoC (Chinese: on-Chip multiprocessing, English: Multi-Processor System on Chip) module comprises: PS (Processing System) and PL (Programmable Logic); wherein, PS includes: ARM Cortex-A53 quad-core Processing systems (rate up to 1.5GHz), ARM Cortex-R5 quad-core Processing systems (rate up to 600MHz), and GPUs (Graphics Processing units, rate up to 667 MHz); wherein PL includes: VIPU (Video Image Processing Unit), DPU (Deep learning Processing Unit), and VCU (Video coding Unit); the PS and PL are connected by an AXI high speed bus.
The ARM Cortex-A53 can be the four-core processor and mainly runs a C + + application program and a holder control program, and the ARM Cortex-R5 is also a four-core processor in the figure; the VIPU converts LVDS collected by the lens into 24-bit RGB video signals, stores the signals on one hand, performs chrominance space conversion on the signals, namely RGB is converted into BGR on the other hand, performs scale scaling, and reduces the signals from 1600 × 900 to 608 × 608; the DPU mainly runs convolution part operation of yolov3 algorithm to accelerate the algorithm; the VCU encodes the video image to be transmitted through the network, thereby greatly shortening the time taken for transmitting the video image;
the holder is connected with the MPSoC module through an RS485 interface, receives a control command sent by the MPSoC module, drives the lens to carry out 360-degree rotation shooting in a monitored horizontal range and 180-degree pitching shooting in a vertical range, and adjusts the focal length of the lens;
the lens is connected with the MPSoC module through an LVDS interface, and transmits the collected LVDS to the VIPU through the LVDS interface for further processing by the VIPU;
the communication module comprises Ethernet, 5G and WiFi and mainly realizes the online transmission of data;
the peripheral comprises a memory, a power supply, a keyboard, a mouse and a display;
the memory can be a DDR or SD card and is used for storing video image data;
the power supply provides a stable working power supply for the work of the system;
the keyboard is used as input equipment for inputting operation commands;
the mouse is used as an input device for positioning and selecting;
the display displays the operation command and the processing result in real time.
Fig. 3 is a flowchart of the present invention, and as shown in the drawing, the present invention provides an embedded target detection tracking method, in which a lens is driven by a pan-tilt to perform 360-degree rotation shooting within a monitored horizontal range and 180-degree tilt shooting within a vertical range. The lens shooting method includes the steps that LVDS is obtained through a lens shooting interface, the LVDS is transmitted to a VIPU through an LVDS interface to be processed, the LVDS is converted into 24-bit RGB data which can be transmitted through an AXIS bus, the size of the RGB data is 1600 x 900, meanwhile, the RGB data are stored in a memory to be used in subsequent data fusion, the RGB data are further processed and converted into BGR data with the size of 608 x 608, the processed data are transmitted to a DPU to be subjected to convolution processing to obtain yolo layer data, the yolo layer data are transmitted to an ARM through a bus, and the ARM program carries out post-processing on the data of the yolo layer to obtain coordinate positions (x, y, w, h), horizontal and vertical rotation angles of a cloud deck and lens multiple. And at the ARM end, the target coordinate position is superposed on the image read from the memory, the image is sent to the VCU for H264 video coding processing, and finally the coded video image is sent to a background server through a network for a worker to check in real time. The holder control program sends the horizontal and vertical rotation angles of the holder and the zoom number of the lens to the holder through a control data protocol, controls the holder to rotate, and adjusts the focal length of the lens, so that the target can be clearly locked in the lens. The specific implementation steps are as follows:
step 1: the method comprises the steps that a system is powered on, a background server is remotely logged in, initial configuration is carried out, and the initial configuration comprises equipment selection and connection, parameter setting and cradle head rotation, wherein the equipment selection comprises different cradle head IP addresses, as shown in figure 4, the parameter setting comprises equipment detailed information, adding equipment and system setting, wherein the system setting comprises the horizontal and vertical moving speeds of the cradle head, and video storage paths and names, as shown in figure 5;
step 2: manually controlling the holder to rotate to align the lens with the target;
and step 3: acquiring LVDS of a target by a lens;
and 4, step 4: the lens sends the LVDS to the VIPU through the LVDS interface;
and 5: VIPU converts LVDS differential video signals into 24-bit RGB video image data which can be transmitted through AXIS bus, and the size of the RGB video image data is 1600 x 900;
step 6, buffering RGB video image data in a memory on one hand to facilitate subsequent data superposition; on the other hand, further pretreatment is carried out;
and 7: RGB video image data is converted from RGB to BGR;
and 8: reducing the BGR video image data from 1600 × 900 to 608 × 608, and meeting the image size required by yolov3 algorithm;
and step 9: sending the reduced image to a DPU (dual-processing unit), and performing yolov3 algorithm convolution processing to obtain yolo layer data;
step 10: and the ARM receives the yolo layer data processed by the DPU and performs subsequent processing. Firstly, preliminarily screening candidate frames meeting a condition (setting a threshold value of 0.3); secondly, sorting the candidate frames according to the confidence degree; finally, performing final screening on the sorted candidate frames, mainly for eliminating overlapped frames of the same target; it should be understood that the set threshold may be other values, and the application is not limited thereto.
Step 11: obtaining coordinates (x, y, w, h) of the target after ARM processing, on one hand, reading an image stored in a memory, and framing the target in the image according to coordinate information to obtain a framed video image; on the other hand, the horizontal and vertical rotation angles of the holder and the zoom factor of the lens are calculated according to the coordinate information, and then the control command of the holder is obtained;
step 12: the VCU encodes the video image after the picture frame to obtain an H264 encoded video image, and then transmits the video image to a background server through a network;
step 13: and after the pan-tilt receives the control command, the pan-tilt drives the lens to rotate, adjusts the focal length of the lens and shoots the next frame.
In step 9, the training step of yolov3 algorithm includes:
step 901: constructing a data set;
step 902: modifying the frame regression function IOU in the yolov3 method into CIOU;
step 903: putting the training data set into a network of yolov3 algorithm to extract target features and candidate boxes;
step 904: determining the initial number and width and height of candidate frames anchor, learning image features from a training data set, and performing cluster analysis on the candidate frames by using a K-means clustering algorithm; taking the K value as the number of the candidate frames anchor; the CIOU is used as an evaluation standard for judging the position of the target boundary frame, the CIOU takes the distance between the target and the frame anchor, the overlapping rate, the scale and the penalty into consideration, so that the regression of the target frame becomes more stable, the CIOU threshold is set to be 0.45, and the confidence coefficient threshold is set to be 0.3;
step 905: obtaining the type information of a model output target boundary frame and a target through forward propagation;
step 906: calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm;
step 907: and after 20000 times of iterative training, obtaining a trained model of yolov3 algorithm.
In the step 901, the specific steps of constructing the training set include:
step 9011: collecting a video image of a target to be detected;
collecting targets to be detected under the conditions of different angles, different postures, different backgrounds, different distances, different shelters, different illumination, different weather and the like of the targets to be detected shot under a camera;
step 9012: 3000-4000 images are selected from the collected video images, and different images are non-continuously shot images, so that model training is prevented from being over-fitted; the number of images in each case is the same, so that the accuracy of the trained model is prevented from being low;
step 9013: marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set; wherein, the CIOU regression loss is calculated as follows:
Figure BDA0002964847230000071
where ρ is2(b,bgt) C represents the diagonal distance of the minimum closure area simultaneously containing the prediction frame and the real frame;
α is a weighting function defined as follows:
Figure BDA0002964847230000081
upsilon is a measure of similarity in aspect ratio, defined as follows:
Figure BDA0002964847230000082
where ω is the width of the real frame, h is the height of the real frame, and gt is the label data.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a second device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. A system for detecting and tracking a target object, comprising: the MPSoC system comprises an MPSoC module, a lens, a holder, a communication module and a peripheral module which are processed on a chip;
the holder is connected with the MPSoC module through an RS485 interface, and is used for receiving a control command sent by the MPSoC module and driving the lens to rotate by 360 degrees in a horizontal range and/or rotate by pitching by 180 degrees in a vertical range, and is also used for adjusting the focal length of the lens;
the lens is carried on the holder, is connected with the MPSoC module through a low-voltage differential signal LVDS interface and is used for collecting the LVDS of the target object, and is also used for sending the collected LVDS to a video image processing unit VIPU of the MPSoC module through the LVDS interface;
the MPSoC module comprises a processing system PS and a programmable logic PL, wherein the PS comprises a quad-core processing system and an image processing unit GPU, and the quad-core processing system is used for running a C + + application program and a control program of the holder; the PL comprises the VIPU, a deep learning processing unit (DPU) and a Video Coding Unit (VCU), and the PS and the PL are connected through an AXI high-speed bus;
the communication module is used for wireless communication of the detection tracking system, and the communication module comprises an Ethernet module, a 5G module and a WiFi module;
the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display.
2. The system according to claim 1, wherein the lens is an LVDS camera module for capturing LVDS;
the VIPU is used for converting LVDS collected by the lens into 24-bit RGB video image data, the size of the RGB video image data is 1600 x 900, the VIPU is further used for converting the RGB video image data from an RGB format into BGR video image data, reducing the BGR video image data from 1600 x 900 to 608 x 608, and sending the reduced BGR video image data to the DPU;
the DPU is used for receiving the reduced BGR video image data and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object;
the four-core processor is used for screening N candidate frames meeting a threshold condition from the yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer;
the four-core processor is used for determining coordinate information (x, y, w, h) of the target object according to the target frame;
the four-core processor is used for reading the target object image stored in the memory, framing the target object in the target object image according to the coordinate information of the target object, and obtaining a video image of the target object after picture frame;
the four-core processor is further used for calculating the horizontal and vertical rotation angles of the holder and the zoom factor of the lens according to the coordinate information of the target object, obtaining a control command for the holder, and sending the control command to the holder;
the VCU is used for coding the video image of the target object after the picture frame to obtain an H264 coded video image of the target object, and sending the H264 coded video image of the target object to a background server through the communication module;
and the holder is used for driving the lens to rotate and adjusting the focal length of the lens after receiving the control command.
3. The detection and tracking system of claim 2, wherein the DPU pre-stores a model of the yolov3 algorithm for the target, the DPU configured to:
carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target object, wherein a frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU;
wherein the yolov3 algorithm model of the target object is obtained by the background server according to the video image training of the target object, and the background server is specifically configured to:
constructing a training data set according to the video image data of the target object;
changing the frame regression function of the yolov3 method from IOU to CIOU;
putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes;
determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame;
outputting the category information of the target frame and the target through a forward propagation acquisition model;
calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm;
and performing iterative training for W times to obtain the yolo layer data of the target object, wherein the yolo layer data of the target object is a model of the yolov3 algorithm, and W is a positive integer.
4. The detection and tracking system of claim 3, wherein the background server is configured to select M images from the video images of the target object before constructing the training data set according to the RGB image data of the target object, wherein the M images are non-consecutively taken images, and M is a positive integer;
the constructing of the training data set according to the video image data of the target object comprises:
marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set;
wherein, the CIOU regression loss is calculated as follows:
Figure FDA0002964847220000021
where ρ is2(b,bgt) C represents the diagonal distance of the minimum closure area simultaneously containing the prediction frame and the real frame;
α is a weighting function defined as follows:
Figure FDA0002964847220000022
upsilon is a measure of similarity in aspect ratio, defined as follows:
Figure FDA0002964847220000023
where ω is the width of the real frame, h is the height of the real frame, and gt is the label data.
5. The detection and tracking system of claim 4, wherein the memory is a DDR or SD card for storing video image data of the target object captured by the lens; the power supply is used for providing a working power supply for the detection tracking system; the keyboard is used for inputting an operation command to the detection tracking system; the mouse is used for inputting a positioning or selecting operation command by the detection tracking system; and the display is used for displaying the operation command and the processing result of the detection tracking system.
6. A method for detecting and tracking a target object is characterized in that a system for detecting and tracking the target object comprises an on-chip multi-processing MPSoC module, a lens, a holder, a communication module and an external module, wherein the cradle head is connected with the MPSoC module through an RS485 interface, the lens is carried on the cradle head, connected with the MPSoC module through a low voltage differential signaling LVDS interface, the MPSoC module comprises a processing system PS and programmable logic PL, the PS includes a quad-core processing system, an image processing unit (GPU), the PL includes the VIPU, a deep learning processing unit (DPU), and a Video Coding Unit (VCU), the PS and the PL are connected through an AXI high-speed bus, the communication comprises an Ethernet module, a 5G module and a WiFi module, the peripheral module comprises a memory, a power supply, a keyboard, a mouse and a display, the lens is an LVDS camera module, and the method comprises the following steps:
the LVDS camera module collects LVDS of the target object;
the VIPU converts LVDS of the target object into RGB video image data of 24 bits, the size of the RGB video image data is 1600 x 900, the RGB video image data is converted into BGR video image data from an RGB format, the BGR video image data is reduced from 1600 x 900 to 608 x 608, and the reduced BGR video image data is sent to the DPU;
the DPU receives the reduced BGR video image data, and performs yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the target object;
the four-core processor screens out N candidate frames meeting a threshold condition from yolo layer data of the target object; sorting the N candidate boxes from large to small according to the confidence degrees; screening out a target frame from the N sorted candidate frames, wherein N is a positive integer;
the four-core processor determines coordinate information (x, y, w, h) of the target object according to the target frame;
the four-core processor reads the target object image stored in the memory, and frames the target object in the target object image according to the coordinate information of the target object to obtain a video image of the target object after picture frame;
the four-core processor is further used for calculating the horizontal and vertical rotation angles of the holder and the zoom factor of the lens according to the coordinate information of the target object, obtaining a control command for the holder, and sending the control command to the holder;
the VCU encodes the video image of the target object after the picture frame to obtain an H264 encoded video image of the target object, and sends the H264 encoded video image of the target object to a background server through the communication module;
and the holder is used for driving the lens to rotate and adjusting the focal length of the lens after receiving the control command.
7. The detection and tracking system according to claim 6, wherein the DPU receives the reduced BGR video image data and performs yolov3 algorithm convolution processing on the reduced BGR video image data to obtain yolo layer data of the object, and the method comprises:
and carrying out yolov3 algorithm convolution processing on the reduced BGR video image data according to a yolov3 algorithm model of the target, wherein the frame regression function in the yolov3 algorithm convolution processing is modified to CIOU by an IOU.
8. The detection and tracking system of claim 7, further comprising:
the background server constructs a training data set according to the RGB image data of the target object; changing the frame regression function of the yolov3 method from IOU to CIOU; putting the training data set into a network of yolov3 algorithm to extract a target feature and the N candidate boxes; determining the initial number and width and height of candidate frames anchor, learning image features from the training data set, and performing cluster analysis on the N candidate frames by using a K-means clustering algorithm; determining the K value as the number of the candidate frames anchor; setting a CIOU threshold value of 0.45 and a confidence coefficient threshold value of 0.3 by taking the CIOU as an evaluation standard for judging the position of the target boundary frame; outputting the category information of the target frame and the target through a forward propagation acquisition model; calculating the total loss values of all loss functions according to the regression loss function of the detection frame information output result and the actual detection frame position information, the regression loss function of the object prediction center point and the actual center point position information, the classification loss function of the target class information and the actual class label and the regression loss function of the object prediction confidence coefficient and the actual object confidence coefficient, and adjusting the values of parameters in the yolov3 algorithm model according to a gradient descent algorithm and a back propagation algorithm; performing iterative training for W times to obtain yolo layer data of the target object, wherein the yolo layer data of the target object is a model of the yolov3 algorithm, and W is a positive integer;
a model of yolov3 algorithm for the target is pre-stored in the DPU.
9. The detection and tracking system of claim 8, wherein the backend server constructs a training data set from the RGB image data of the target object, comprising:
the background server selects M images from the video images of the target object, wherein the M images are discontinuously shot images, and M is a positive integer;
marking the position and the category of the target object by using a labelimg marking tool to obtain the training data set;
wherein the CIOU regression loss is calculated as follows:
Figure FDA0002964847220000041
where ρ is2(b,bgt) C represents the diagonal distance of the minimum closure area simultaneously containing the prediction frame and the real frame;
α is a weighting function defined as follows:
Figure FDA0002964847220000042
upsilon is a measure of similarity in aspect ratio, defined as follows:
Figure FDA0002964847220000043
where ω is the width of the real box, h is the height of the real box,
gt is the annotation data.
CN202110247972.0A 2021-03-06 2021-03-06 Target object detection and tracking system and method Pending CN112950673A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110247972.0A CN112950673A (en) 2021-03-06 2021-03-06 Target object detection and tracking system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110247972.0A CN112950673A (en) 2021-03-06 2021-03-06 Target object detection and tracking system and method

Publications (1)

Publication Number Publication Date
CN112950673A true CN112950673A (en) 2021-06-11

Family

ID=76229550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110247972.0A Pending CN112950673A (en) 2021-03-06 2021-03-06 Target object detection and tracking system and method

Country Status (1)

Country Link
CN (1) CN112950673A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113452918A (en) * 2021-07-20 2021-09-28 山东万腾电子科技有限公司 Target object detection and tracking system and method
CN116740507A (en) * 2023-08-02 2023-09-12 中科星图测控技术股份有限公司 ARM architecture-based space target detection model construction method
WO2024021484A1 (en) * 2022-07-25 2024-02-01 亿航智能设备(广州)有限公司 Onboard visual computing apparatus and aircraft

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008994A (en) * 2019-11-14 2020-04-14 山东万腾电子科技有限公司 Moving target real-time detection and tracking system and method based on MPSoC
CN111429486A (en) * 2020-04-27 2020-07-17 山东万腾电子科技有限公司 DNNDK model-based moving object real-time detection tracking system and method
CN111985621A (en) * 2020-08-24 2020-11-24 西安建筑科技大学 Method for building neural network model for real-time detection of mask wearing and implementation system
CN112287788A (en) * 2020-10-20 2021-01-29 杭州电子科技大学 Pedestrian detection method based on improved YOLOv3 and improved NMS

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008994A (en) * 2019-11-14 2020-04-14 山东万腾电子科技有限公司 Moving target real-time detection and tracking system and method based on MPSoC
CN111429486A (en) * 2020-04-27 2020-07-17 山东万腾电子科技有限公司 DNNDK model-based moving object real-time detection tracking system and method
CN111985621A (en) * 2020-08-24 2020-11-24 西安建筑科技大学 Method for building neural network model for real-time detection of mask wearing and implementation system
CN112287788A (en) * 2020-10-20 2021-01-29 杭州电子科技大学 Pedestrian detection method based on improved YOLOv3 and improved NMS

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113452918A (en) * 2021-07-20 2021-09-28 山东万腾电子科技有限公司 Target object detection and tracking system and method
WO2024021484A1 (en) * 2022-07-25 2024-02-01 亿航智能设备(广州)有限公司 Onboard visual computing apparatus and aircraft
CN116740507A (en) * 2023-08-02 2023-09-12 中科星图测控技术股份有限公司 ARM architecture-based space target detection model construction method

Similar Documents

Publication Publication Date Title
US11727594B2 (en) Augmented reality for three-dimensional model reconstruction
US20230154193A1 (en) License plate detection and recognition system
CN112950673A (en) Target object detection and tracking system and method
US10198823B1 (en) Segmentation of object image data from background image data
US9965865B1 (en) Image data segmentation using depth data
CN110400352B (en) Camera calibration with feature recognition
EP2874097A2 (en) Automatic scene parsing
US10242294B2 (en) Target object classification using three-dimensional geometric filtering
CN106529538A (en) Method and device for positioning aircraft
CN107589758A (en) A kind of intelligent field unmanned plane rescue method and system based on double source video analysis
CN106716443A (en) Feature computation in a sensor element array
CN109801265B (en) Real-time transmission equipment foreign matter detection system based on convolutional neural network
CN109117838B (en) Target detection method and device applied to unmanned ship sensing system
EP4050305A1 (en) Visual positioning method and device
CN111339976B (en) Indoor positioning method, device, terminal and storage medium
CN111008994A (en) Moving target real-time detection and tracking system and method based on MPSoC
CN112702481A (en) Table tennis track tracking device and method based on deep learning
CN112183148A (en) Batch bar code positioning method and identification system
CN117197388A (en) Live-action three-dimensional virtual reality scene construction method and system based on generation of antagonistic neural network and oblique photography
CN114241012A (en) High-altitude parabolic determination method and device
CN110992297A (en) Multi-commodity image synthesis method and device, electronic equipment and storage medium
CN113452918A (en) Target object detection and tracking system and method
CN115393962A (en) Motion recognition method, head-mounted display device, and storage medium
CN115035466A (en) Infrared panoramic radar system for safety monitoring
Tian Effective image enhancement and fast object detection for improved UAV applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210611