CN112598707A - Real-time video stream object detection and tracking method - Google Patents

Real-time video stream object detection and tracking method Download PDF

Info

Publication number
CN112598707A
CN112598707A CN202011532140.5A CN202011532140A CN112598707A CN 112598707 A CN112598707 A CN 112598707A CN 202011532140 A CN202011532140 A CN 202011532140A CN 112598707 A CN112598707 A CN 112598707A
Authority
CN
China
Prior art keywords
tracking
detection
subset
frames
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011532140.5A
Other languages
Chinese (zh)
Inventor
羊爱英
燕硕
梁劲
张亚斌
张泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Daoziling Electromechanical Equipment Co ltd
Original Assignee
Nanjing Daoziling Electromechanical Equipment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Daoziling Electromechanical Equipment Co ltd filed Critical Nanjing Daoziling Electromechanical Equipment Co ltd
Priority to CN202011532140.5A priority Critical patent/CN112598707A/en
Publication of CN112598707A publication Critical patent/CN112598707A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)

Abstract

The invention discloses a real-time video stream object detection and tracking method, which comprises the following steps: A. performing object detection on a first subset of frames of an input video; B. detecting an object and an object position in a first detection frame in a first subset of frames; C. after the first detection frame, tracking the detected object to update the object position on a second frame subset of the input video, wherein the first frame subset and the second frame subset are not overlapped; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.

Description

Real-time video stream object detection and tracking method
Technical Field
The invention relates to the technical field of video image analysis, in particular to a real-time video stream object detection and tracking method.
Background
Cameras are almost ubiquitous on mobile electronic devices such as cell phones. Images and video captured by a camera can generally be improved by understanding the content of the scene captured by the camera. For example, detection of an object such as a face may allow control of camera parameters such as focus and white balance based on the position, movement and lighting conditions of the detected object. However, reliable target detection techniques are typically a computationally intensive, power intensive, and offline process. The invention reasonably sets the distribution of the detection frame and the tracking frame, sets reasonable detection and tracking rules, reduces the comprehensive calculated amount, and accelerates the calculation speed and the tracking real-time performance; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.
Disclosure of Invention
The present invention is directed to a real-time video stream object detection and tracking method, so as to solve the problems mentioned in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a real-time video stream object detection and tracking method, comprising the steps of:
A. performing object detection on a first subset of frames of an input video;
B. detecting an object and an object position in a first detection frame in a first subset of frames;
C. after the first detected frame, the detected object is tracked to update the object position on a second subset of frames of the input video, wherein the first subset of frames and the second subset of frames are non-overlapping.
Preferably, the input video is divided such that the first subset of frames corresponds to every nth frame, N being the selected number, and the second subset of frames corresponds to the remaining frames.
Preferably, the tracking is ended when no object is detected in a selected number of consecutive frames after the first detection frame; the tracking is ended when the tracking score of the object is below a tracking threshold.
Preferably, a tracking score is assigned based on the features of the object detected in the first detection frame; a tracking threshold for each detected object is determined over the first subset of frames based on the attributes of the detected object.
Preferably, an ID is associated with the detection object of the first subset of frames; associating objects detected in different frames of the first subset based on the ID; determining a bounding box of the object over the first subset of frames; and determining a change in bounding box over the second subset of frames; when no object is detected on a second detection frame in the first subset, the object is tracked on the second detection frame.
Preferably, the real-time video stream object tracking system comprises an object detection unit, an object tracking unit and a data association unit; wherein the object detection unit is configured to perform object detection on a first subset of frames of the input video; an object tracking unit to track a position of an object previously detected by the detection unit on a second subset of frames of the input video based on a tracking threshold for each detected object; wherein the second subset and the first subset are mutually exclusive; the object detection unit comprises a frame memory, a neural network weight, a detection neural network and a cutting unit; the neural network is a structure optimized on the basis of a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, wherein the three sub-networks form a cascade structure; a detection frame stored in a frame memory is clipped based on a position of an object determined by a detection neural network, and a clipped object image is supplied to an object tracking unit, a tracking neural network, and an object analysis unit.
Preferably, the object tracking unit comprises a neural network weight and a tracking neural network, wherein weight information is a pre-trained parameter; the neural network is an optimized structure based on a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, and the three sub-networks form a cascade structure.
Preferably, the P-Net consists of four layers of convolution, the first layer of convolution kernel size is 3 × 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 1 multiplied by 1; the convolution of the fourth layer comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the R-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 2 multiplied by 2; the size of the convolution kernel of the fourth layer is 2 multiplied by 2; the fifth layer convolution also comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the O-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 3 multiplied by 3; the size of the convolution kernel of the fourth layer is 3 multiplied by 3; the fifth layer convolution in turn comprises two convolution layers. The first convolutional layer convolution kernel size is 1 multiplied by 1, a channel called confidence is output, the channel is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The second layer convolution kernel size is 1 x 1, and four channels are output, called offsets, which are activated by relu to determine the object position.
Preferably, the data association unit comprises an object analysis unit and a control unit; the object analysis unit may determine the attribute of an object other than the position on the detection frame by analyzing the object image provided by the cropping unit; the object properties determined by the object analysis unit may include: face shine, face pose or angle relative to the camera, eye position, and whether the eyes are closed or blinking; the control unit determines whether the object detected by the object detection unit in the first detection frame is the same as the object detected in another detection frame; the control unit further associates the object attribute determined by the object analysis unit on the detection frame with the object tracked by the object tracking unit on the non-detection frame.
Preferably, the input video is divided such that the frames of the first subset include every nth frame of the input video, N being a predetermined number, and the remaining frames are included in the second subset; when the object is not detected in a predetermined number of consecutive frames of the first subset, the tracking unit stops tracking the object; when the tracking score of the object is lower than the tracking threshold of the object, the tracking unit finishes tracking the object; wherein: the detection unit determines a tracking threshold value of each detection object on the first subset frame based on the attribute of the respective detection object and the background of the respective detection object; further comprising: a data association unit for associating the ID with the object detected in the first subset of frames; wherein: the detection unit determines a bounding box of the object on the first subset frame; the tracking unit determines a change in the bounding box over the second subset of frames; further comprising: when an object is not detected in the detection frame, the tracking unit tracks the previously detected object on the detection frame.
Compared with the prior art, the invention has the beneficial effects that: the invention reasonably sets the distribution of the detection frame and the tracking frame, sets reasonable detection and tracking rules, reduces the comprehensive calculated amount, and accelerates the calculation speed and the tracking real-time performance; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.
Drawings
Fig. 1 is a block diagram of a video image acquisition and analysis system according to the present invention.
FIG. 2 is a block diagram of the detection and tracking system of the present invention.
Fig. 3 is a detection and tracking flow diagram of the present invention.
Fig. 4 is a flow chart of a connection method of specific units of the present invention.
FIG. 5 is a diagram of an exemplary detection tracking of a video sequence with a moving facial object in accordance with the present invention.
FIG. 6 is a schematic diagram of an exemplary data set for trace termination in accordance with the present invention.
FIG. 7 is a schematic diagram of an exemplary data set for trace termination according to the present invention.
FIG. 8 is a block diagram of a cascaded neural network used to train the weights of the neural network in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a technical scheme that: a real-time video stream object detection and tracking method, comprising the steps of:
A. performing object detection on a first subset of frames of an input video;
B. detecting an object and an object position in a first detection frame in a first subset of frames;
C. after the first detected frame, the detected object is tracked to update the object position on a second subset of frames of the input video, wherein the first subset of frames and the second subset of frames are non-overlapping.
In the present invention, the input video is divided such that the first subset of frames corresponds to every nth frame, N being the selected number, and the second subset of frames corresponds to the remaining frames.
In the present invention, tracking is ended when no object is detected in a selected number of consecutive frames following the first detection frame; the tracking is ended when the tracking score of the object is below a tracking threshold.
In the present invention, a tracking score is assigned based on a feature of an object detected in a first detection frame; a tracking threshold for each detected object is determined over the first subset of frames based on the attributes of the detected object.
In the invention, an ID is associated with a detection object of a first frame subset; associating objects detected in different frames of the first subset based on the ID; determining a bounding box of the object over the first subset of frames; and determining a change in bounding box over the second subset of frames; when no object is detected on a second detection frame in the first subset, the object is tracked on the second detection frame.
Fig. 1 depicts a video image acquisition analysis system 100 of the present disclosure. The system 100 includes a camera 102, an image analysis unit 108, and a camera controller 110. The camera 102 may capture a video image 106 of a scene, which may contain objects 104.1 and 104.2, such as faces. The camera 102 may provide the captured images as a video 106 data stream to the image analysis unit 108, and the image analysis unit 108 may analyze the images in the video 106 and detect the predetermined object from within its content. The camera controller 110 may control the camera 102 in response to data output from the image analysis unit 108. The image analysis unit 108 may detect objects 104.1, 104.2 from within the captured video 106 and identify the location of the detected objects. The image analysis unit 108 may assign attributes to the object data. For example, when the detected object is a human, the attribute data may be a motion characteristic, a lighting of the face, a posture or angle of the face with respect to the camera, an eye position, and a state of the face (e.g., whether eyes on the face are closed or blinking, whether the face is smiling, or the like). The camera controller 110 may use the image analysis results (e.g., object properties) to control camera capture parameters, such as focus or capture time of other images.
Fig. 2 depicts a detection and tracking system framework 200 of the present disclosure. The image analysis system 200 includes an object detection unit 220 and an object tracking unit 240. The object detection unit 220 may process a subset of the frames of the input video 206 and identify a predetermined type of object (e.g., a human face, a human body, etc.) from its content. The object tracking unit 240 may be responsive to data from the object detection unit 220 and may track detected objects in other frames of the input video 206. The object tracking unit 240 may output data identifying the location of the tracked object in the input video.
The processing performed by the image analysis system 200 may save processing resources and reduce latency as compared to known techniques for image processing. Object detection 220 may require more processing resources, including power and longer latency, than object tracking 240. Thus, the object detection unit 220 intermittently detects frames of the input video 206, reducing the required processing resources and delay compared to detecting all frames. The operations performed by the object tracking unit 240 are expected to be of lower complexity and lower latency than the object detection unit 220, and therefore, the image analysis system 200 may provide position data for all frames of the input video sequence 206 without incurring the processing costs required to detect objects in all such frames. For example, the object tracking unit 240 may only require 10% resources and 10% delay to process a frame as compared to the object detection unit 220. For example, by operating such exemplary detection units only intermittently, power consumption and delay may be reduced by 65%.
The object detection unit uses a predetermined subset of 1/N frames, where N is a predetermined integer constant. The object detection unit 220 may process a fixed frequency of the input video frame, for example, one out of every three consecutive frames (when N ═ 3). The object detection unit 220 may identify the object and its location and may distinguish the object, for example, by assigning a unique ID to each object detected in the image content. The ID may be used to determine whether an object detected in one detection frame is the same as an object detected in a different detection frame. Thus, the object ID may be used to determine whether a face detected in one frame is the same as a face detected in another frame. The object tracking unit 240 may track an object previously detected by the object detection unit 220. As shown in fig. 2, object tracking may operate on any frame of the input video 206. The object tracking unit 240 may receive indications of objects identified in the detection frame from the object detection unit 220 and then track changes to these objects in subsequent frames.
Object detection 220 and object tracking 240 may identify the location of each object in the frame. Detection and tracking may only identify object IDs in the frame.
The image analysis system 200 includes a data association unit 260, and the data association unit 260 assigns an ID to the detected object. The data association unit 260 may respond to the location data output by the object tracking unit 240 and the object detection unit 220 and assign an ID based on a correlation between locations.
The data association unit 260 may also determine additional attributes of the object not provided by the detection unit 220 or the tracking unit 240 through analysis of the object image. For example, the data association unit 260 may identify the object properties located on the detection frame by the object detection unit 220. The optional data association unit 260 may associate attributes of the determined object on the detection frame with the object tracked on the non-detection frame. Accordingly, the data association unit 260 may provide the object attributes 215 on the detection frame and the non-detection frame.
Fig. 3 depicts a detection and tracking flow and scheme 300. The method 300 may identify an object from the captured video data and output data for spatial location. The method 300 may first detect an object on a first frame of the detection frames (block 310). Using the ID of the object in the detection frame as a location reference, the detection object may be tracked for one or more frames following the detection frame (block 315). When the object location is identified in the tracking frames, the method 300 may output location data for the object in each tracking frame (block 320). As previously mentioned, the detection frames may be input into some predetermined subset of the frames in the video sequence. Thus, if the detection frame is chosen to be 1/N, blocks 315 and 320 may be performed N-1 times each time the detection frame is processed at block 310. For example, detection may be performed for 1 frame out of every 5 frames, while tracking may be performed for 4 frames out of every 5 frames remaining.
The method 300 may compare 310 the data of the object detected in the current iteration with the data of the object detected in the previous iteration 320 and determine whether a correlation exists (block 325). If a correlation between the objects is detected in both iterations, the method 300 may assign a common ID to the objects in the new iteration (block 330). If no correlation is detected for the tracked object in the current iteration, a new ID may be assigned to the object (block 335).
In one aspect, the object may be terminated based on a detection result on the detection frame. The object detection results of successive iterations may be compared (block 310) to determine when objects from previous iterations are no longer detected (block 340). Thus, the method 300 may determine whether the count of object losses exceeds a predetermined detection threshold (block 345). If the missing count value exceeds the predetermined detection threshold, the method 300 may terminate tracking of the object (block 350). On the other hand, the object may be terminated based on the tracking result of the tracking frame. A tracking score may be determined for each tracked object (block 355). If the tracking score does not exceed the tracking threshold (block 360), the object may be terminated (block 350).
Fig. 4 depicts a connection method 400 for a particular unit. The system 400 includes an input video 402, an object detection unit 420, an object tracking unit 440, a data association unit 460, and object attributes 415. The object detection unit 420 includes a detection neural network 422, detection weights 424 for controlling and training the detection neural network 422, a frame memory 426, and an image cropping unit 428. The object tracking unit 440 includes a tracking neural network 442 and tracking weights 444 for controlling and training the tracking neural network 442. The data association unit 460 includes an object analysis unit 462 and a control unit 464.
The detection neural network 422 may run on a subset of the frames of the input video 402. These detection frames may be stored in frame memory buffer 426, and detection unit 422 may detect the location of an object (e.g., a face) in the detection frames. The cropping unit 428 may crop the detection frame stored in the frame memory 426 based on the location of the object determined by the detection neural network 422. The cropped object image may be provided to the object tracking unit 440, the tracking neural network 442, and the object analysis unit 462. The tracking neural network 442 may track changes in the detected position of the object based on the object images of the current and previous detected frames to determine a new position and tracking score.
The object analysis unit 462 may determine the attribute of the object other than the position on the detection frame by analyzing the object image supplied from the cropping unit 428. The object properties determined by the object analysis unit 462 may include: face shine, face pose or angle relative to the camera, eye position, and whether the eyes are closed or blinking. The control unit 464 may determine whether the object detected by the object detection unit 420 in the first detection frame is the same as the object detected in another detection frame. The control unit 464 may also associate the object properties determined by the object analysis unit 462 on the detection frame with the object tracked by the object tracking unit 440 on the non-detection frame. Whether a detected frame or a non-detected frame, object attributes 415 may be provided for objects in all frames.
The detection weights 424 and tracking weights 444 may be trained in advance before beginning analysis of the input video 402. The detection weights 424 and tracking weights 444 may be trained during processing of the input video 402.
Fig. 5 depicts an example of detection tracking 500 of a video sequence with a moving facial object. The video sequence 500 comprises frames 501 and 505 containing facial objects. In this example application of the video sequence 500 to the system 200 of fig. 2, every other frame may be a detection frame (N-2). Thus, the object detection unit 220 may operate on frames 501, 503, and 505, while the object tracking unit 240 may operate on frames between detection frames (including frames 502 and 504). The video sequence 500 begins with two face objects in the frame 501 and an optional data association unit may associate an ID with each detected face. In fig. 5, the detection is represented by a frame surrounding the face, and the associated ID number is indicated under each detected face image. The tracking unit may operate on the next frame 502. In frame 502, faces with IDs 1 and 2 can be successfully tracked. A third face, a frown face, appears in frame 502 but is not tracked in frame 502 because it was not detected in the previous detected frame. In the second detection frame 503, all three faces are detected, and an ID is associated with each face. In the tracking frame 504, the face object having ID 2 has been partially occluded, and thus may not be tracked even if a part of the object exists in the frame 504. For the third detection frame 505, only the face with ID 3 is detected. The face with ID 2 has disappeared while the face with ID 1 still partially exists but has changed enough to not be detected. An object may not be detected or tracked on the detection frame, for example, when the object disappears from the frame, is partially occluded upon entering or exiting the frame, is partially occluded by other objects in the frame, or is still fully visible in the frame, but is visually changed in some manner.
Fig. 6 and 7 are used to describe example datasets for trace termination. In fig. 6, four objects are tracked over a series of frames ( ID 1, 2, 3, 4). In this example, the number of tracking frames between detection frames is N — 3, and tracking is stopped when there is a lack of an object in 3 consecutive detection frames after the first detection. Successful detection or tracking is indicated by a check mark and failed detection or tracking is indicated by an X. For example, all four objects (ID 1-4) are detected on detection frame 1, and no objects with ID 2 and 3 are detected on detection frame 2.
In the case of an object ID of 1, it is detected and tracked on all frames, so tracking of object 1 never terminates. The tracking of the object 3 is terminated on the detection frame 4 because of the absence of the object in 3 consecutive detection frames. In contrast to object ID 3, object ID 2 tracking is not terminated. After the object ID 2 is detected in the detection frame 1, it is not detected in the detection frames 2, 3, and 5, but the detection frames 3 and 5 are discontinuous, and thus the tracking is not terminated at the detection frame 5.
When tracking fails, object 4 tracking terminates. For example, tracking may fail when an object becomes blurred or leaves an image frame.
As further shown in fig. 7, an object ID of 4 failed tracking in tracking frame 2.2, which may be between detection frames 2 and 3.
In conclusion, the invention reasonably sets the distribution of the detection frame and the tracking frame, sets reasonable detection and tracking rules, reduces the comprehensive calculated amount, and accelerates the calculation speed and the tracking real-time performance; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (10)

1. A real-time video stream object detection and tracking method is characterized in that: the method comprises the following steps:
A. performing object detection on a first subset of frames of an input video;
B. detecting an object and an object position in a first detection frame in a first subset of frames;
C. after the first detected frame, the detected object is tracked to update the object position on a second subset of frames of the input video, wherein the first subset of frames and the second subset of frames are non-overlapping.
2. A real-time video stream object detection and tracking method according to claim 1, wherein: the input video is divided such that the first subset of frames corresponds to every nth frame, N being the selected number, and the second subset of frames corresponds to the remaining frames.
3. A real-time video stream object detection and tracking method according to claim 1, wherein: ending the tracking when no object is detected in a selected number of consecutive frames after the first detection frame; the tracking is ended when the tracking score of the object is below a tracking threshold.
4. A real-time video stream object detection and tracking method according to claim 3, wherein: assigning a tracking score based on features of the object detected in the first detection frame; a tracking threshold for each detected object is determined over the first subset of frames based on the attributes of the detected object.
5. A real-time video stream object detection and tracking method according to claim 1, wherein: associating the ID with the detection object of the first subset of frames; associating objects detected in different frames of the first subset based on the ID; determining a bounding box of the object over the first subset of frames; and determining a change in bounding box over the second subset of frames; when no object is detected on a second detection frame in the first subset, the object is tracked on the second detection frame.
6. A real-time video stream object tracking system, characterized by: the system comprises an object detection unit, an object tracking unit and a data association unit; wherein the object detection unit is configured to perform object detection on a first subset of frames of the input video; an object tracking unit to track a position of an object previously detected by the detection unit on a second subset of frames of the input video based on a tracking threshold for each detected object; wherein the second subset and the first subset are mutually exclusive; the object detection unit comprises a frame memory, a neural network weight, a detection neural network and a cutting unit; the neural network is a structure optimized on the basis of a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, wherein the three sub-networks form a cascade structure; a detection frame stored in a frame memory is clipped based on a position of an object determined by a detection neural network, and a clipped object image is supplied to an object tracking unit, a tracking neural network, and an object analysis unit.
7. The real-time video stream object tracking system of claim 6, wherein: the object tracking unit comprises a neural network weight and a tracking neural network, wherein weight information is a pre-trained parameter; the neural network is an optimized structure based on a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, and the three sub-networks form a cascade structure.
8. A real-time video stream object tracking system according to claim 6 or 7, characterized by: the P-Net is composed of four layers of convolution, and the size of a convolution kernel of the first layer is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 1 multiplied by 1; the convolution of the fourth layer comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the R-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 2 multiplied by 2; the size of the convolution kernel of the fourth layer is 2 multiplied by 2; the fifth layer convolution also comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the O-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 3 multiplied by 3; the size of the convolution kernel of the fourth layer is 3 multiplied by 3; the fifth layer convolution in turn comprises two convolution layers. The first convolutional layer convolution kernel size is 1 multiplied by 1, a channel called confidence is output, the channel is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The second layer convolution kernel size is 1 x 1, and four channels are output, called offsets, which are activated by relu to determine the object position.
9. The real-time video stream object tracking system of claim 6, wherein: the data association unit comprises an object analysis unit and a control unit; the object analysis unit may determine the attribute of an object other than the position on the detection frame by analyzing the object image provided by the cropping unit; the object properties determined by the object analysis unit may include: face shine, face pose or angle relative to the camera, eye position, and whether the eyes are closed or blinking; the control unit determines whether the object detected by the object detection unit in the first detection frame is the same as the object detected in another detection frame; the control unit further associates the object attribute determined by the object analysis unit on the detection frame with the object tracked by the object tracking unit on the non-detection frame.
10. The real-time video stream object tracking system of claim 6, wherein: dividing the input video such that the frames of the first subset include every Nth frame of the input video, N being a predetermined number, and the remaining frames are included in the second subset; when the object is not detected in a predetermined number of consecutive frames of the first subset, the tracking unit stops tracking the object; when the tracking score of the object is lower than the tracking threshold of the object, the tracking unit finishes tracking the object; wherein: the detection unit determines a tracking threshold value of each detection object on the first subset frame based on the attribute of the respective detection object and the background of the respective detection object; further comprising: a data association unit for associating the ID with the object detected in the first subset of frames; wherein: the detection unit determines a bounding box of the object on the first subset frame; the tracking unit determines a change in the bounding box over the second subset of frames; further comprising: when an object is not detected in the detection frame, the tracking unit tracks the previously detected object on the detection frame.
CN202011532140.5A 2020-12-23 2020-12-23 Real-time video stream object detection and tracking method Pending CN112598707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011532140.5A CN112598707A (en) 2020-12-23 2020-12-23 Real-time video stream object detection and tracking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011532140.5A CN112598707A (en) 2020-12-23 2020-12-23 Real-time video stream object detection and tracking method

Publications (1)

Publication Number Publication Date
CN112598707A true CN112598707A (en) 2021-04-02

Family

ID=75200530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011532140.5A Pending CN112598707A (en) 2020-12-23 2020-12-23 Real-time video stream object detection and tracking method

Country Status (1)

Country Link
CN (1) CN112598707A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792697A (en) * 2021-09-23 2021-12-14 重庆紫光华山智安科技有限公司 Target detection method and device, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792697A (en) * 2021-09-23 2021-12-14 重庆紫光华山智安科技有限公司 Target detection method and device, electronic equipment and readable storage medium
CN113792697B (en) * 2021-09-23 2023-09-05 重庆紫光华山智安科技有限公司 Target detection method, target detection device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US20220417590A1 (en) Electronic device, contents searching system and searching method thereof
US11010905B2 (en) Efficient object detection and tracking
CN108830252B (en) Convolutional neural network human body action recognition method fusing global space-time characteristics
AU2016352215B2 (en) Method and device for tracking location of human face, and electronic equipment
CN101470809B (en) Moving object detection method based on expansion mixed gauss model
CN110287907B (en) Object detection method and device
US20200082156A1 (en) Efficient face detection and tracking
CN109583355B (en) People flow counting device and method based on boundary selection
CN107295296B (en) Method and system for selectively storing and recovering monitoring video
CN115880784A (en) Scenic spot multi-person action behavior monitoring method based on artificial intelligence
Putro et al. Real-time face tracking for human-robot interaction
US20220100658A1 (en) Method of processing a series of events received asynchronously from an array of pixels of an event-based light sensor
CN109086725B (en) Hand tracking method and machine-readable storage medium
CN111191535A (en) Pedestrian detection model construction method based on deep learning and pedestrian detection method
CN114463368A (en) Target tracking method and device, electronic equipment and computer readable storage medium
CN112598707A (en) Real-time video stream object detection and tracking method
US20110222759A1 (en) Information processing apparatus, information processing method, and program
CN111797652A (en) Object tracking method, device and storage medium
WO2020019353A1 (en) Tracking control method, apparatus, and computer-readable storage medium
AU2020294281A1 (en) Target tracking method and apparatus, electronic device, and storage medium
CN108181989B (en) Gesture control method and device based on video data and computing equipment
CN111382705A (en) Reverse behavior detection method and device, electronic equipment and readable storage medium
Wang et al. Research on target detection and recognition algorithm based on deep learning
Kanagamalliga et al. Advancements in Real-Time Face Recognition Algorithms for Enhanced Smart Video Surveillance
CN114816044A (en) Method and device for determining interaction gesture and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination