CN112598707A

CN112598707A - Real-time video stream object detection and tracking method

Info

Publication number: CN112598707A
Application number: CN202011532140.5A
Authority: CN
Inventors: 羊爱英; 燕硕; 梁劲; 张亚斌; 张泽
Original assignee: Nanjing Daoziling Electromechanical Equipment Co ltd
Current assignee: Nanjing Daoziling Electromechanical Equipment Co ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-02

Abstract

The invention discloses a real-time video stream object detection and tracking method, which comprises the following steps: A. performing object detection on a first subset of frames of an input video; B. detecting an object and an object position in a first detection frame in a first subset of frames; C. after the first detection frame, tracking the detected object to update the object position on a second frame subset of the input video, wherein the first frame subset and the second frame subset are not overlapped; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.

Description

Real-time video stream object detection and tracking method

Technical Field

The invention relates to the technical field of video image analysis, in particular to a real-time video stream object detection and tracking method.

Background

Cameras are almost ubiquitous on mobile electronic devices such as cell phones. Images and video captured by a camera can generally be improved by understanding the content of the scene captured by the camera. For example, detection of an object such as a face may allow control of camera parameters such as focus and white balance based on the position, movement and lighting conditions of the detected object. However, reliable target detection techniques are typically a computationally intensive, power intensive, and offline process. The invention reasonably sets the distribution of the detection frame and the tracking frame, sets reasonable detection and tracking rules, reduces the comprehensive calculated amount, and accelerates the calculation speed and the tracking real-time performance; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.

Disclosure of Invention

The present invention is directed to a real-time video stream object detection and tracking method, so as to solve the problems mentioned in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: a real-time video stream object detection and tracking method, comprising the steps of:

A. performing object detection on a first subset of frames of an input video;

B. detecting an object and an object position in a first detection frame in a first subset of frames;

C. after the first detected frame, the detected object is tracked to update the object position on a second subset of frames of the input video, wherein the first subset of frames and the second subset of frames are non-overlapping.

Preferably, the input video is divided such that the first subset of frames corresponds to every nth frame, N being the selected number, and the second subset of frames corresponds to the remaining frames.

Preferably, the tracking is ended when no object is detected in a selected number of consecutive frames after the first detection frame; the tracking is ended when the tracking score of the object is below a tracking threshold.

Preferably, a tracking score is assigned based on the features of the object detected in the first detection frame; a tracking threshold for each detected object is determined over the first subset of frames based on the attributes of the detected object.

Preferably, an ID is associated with the detection object of the first subset of frames; associating objects detected in different frames of the first subset based on the ID; determining a bounding box of the object over the first subset of frames; and determining a change in bounding box over the second subset of frames; when no object is detected on a second detection frame in the first subset, the object is tracked on the second detection frame.

Preferably, the real-time video stream object tracking system comprises an object detection unit, an object tracking unit and a data association unit; wherein the object detection unit is configured to perform object detection on a first subset of frames of the input video; an object tracking unit to track a position of an object previously detected by the detection unit on a second subset of frames of the input video based on a tracking threshold for each detected object; wherein the second subset and the first subset are mutually exclusive; the object detection unit comprises a frame memory, a neural network weight, a detection neural network and a cutting unit; the neural network is a structure optimized on the basis of a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, wherein the three sub-networks form a cascade structure; a detection frame stored in a frame memory is clipped based on a position of an object determined by a detection neural network, and a clipped object image is supplied to an object tracking unit, a tracking neural network, and an object analysis unit.

Preferably, the object tracking unit comprises a neural network weight and a tracking neural network, wherein weight information is a pre-trained parameter; the neural network is an optimized structure based on a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, and the three sub-networks form a cascade structure.

Preferably, the P-Net consists of four layers of convolution, the first layer of convolution kernel size is 3 × 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 1 multiplied by 1; the convolution of the fourth layer comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the R-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 2 multiplied by 2; the size of the convolution kernel of the fourth layer is 2 multiplied by 2; the fifth layer convolution also comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the O-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 3 multiplied by 3; the size of the convolution kernel of the fourth layer is 3 multiplied by 3; the fifth layer convolution in turn comprises two convolution layers. The first convolutional layer convolution kernel size is 1 multiplied by 1, a channel called confidence is output, the channel is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The second layer convolution kernel size is 1 x 1, and four channels are output, called offsets, which are activated by relu to determine the object position.

Preferably, the data association unit comprises an object analysis unit and a control unit; the object analysis unit may determine the attribute of an object other than the position on the detection frame by analyzing the object image provided by the cropping unit; the object properties determined by the object analysis unit may include: face shine, face pose or angle relative to the camera, eye position, and whether the eyes are closed or blinking; the control unit determines whether the object detected by the object detection unit in the first detection frame is the same as the object detected in another detection frame; the control unit further associates the object attribute determined by the object analysis unit on the detection frame with the object tracked by the object tracking unit on the non-detection frame.

Preferably, the input video is divided such that the frames of the first subset include every nth frame of the input video, N being a predetermined number, and the remaining frames are included in the second subset; when the object is not detected in a predetermined number of consecutive frames of the first subset, the tracking unit stops tracking the object; when the tracking score of the object is lower than the tracking threshold of the object, the tracking unit finishes tracking the object; wherein: the detection unit determines a tracking threshold value of each detection object on the first subset frame based on the attribute of the respective detection object and the background of the respective detection object; further comprising: a data association unit for associating the ID with the object detected in the first subset of frames; wherein: the detection unit determines a bounding box of the object on the first subset frame; the tracking unit determines a change in the bounding box over the second subset of frames; further comprising: when an object is not detected in the detection frame, the tracking unit tracks the previously detected object on the detection frame.

Compared with the prior art, the invention has the beneficial effects that: the invention reasonably sets the distribution of the detection frame and the tracking frame, sets reasonable detection and tracking rules, reduces the comprehensive calculated amount, and accelerates the calculation speed and the tracking real-time performance; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.

Drawings

Fig. 1 is a block diagram of a video image acquisition and analysis system according to the present invention.

FIG. 2 is a block diagram of the detection and tracking system of the present invention.

Fig. 3 is a detection and tracking flow diagram of the present invention.

Fig. 4 is a flow chart of a connection method of specific units of the present invention.

FIG. 5 is a diagram of an exemplary detection tracking of a video sequence with a moving facial object in accordance with the present invention.

FIG. 6 is a schematic diagram of an exemplary data set for trace termination in accordance with the present invention.

FIG. 7 is a schematic diagram of an exemplary data set for trace termination according to the present invention.

FIG. 8 is a block diagram of a cascaded neural network used to train the weights of the neural network in accordance with the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a technical scheme that: a real-time video stream object detection and tracking method, comprising the steps of:

A. performing object detection on a first subset of frames of an input video;

In the present invention, the input video is divided such that the first subset of frames corresponds to every nth frame, N being the selected number, and the second subset of frames corresponds to the remaining frames.

In the present invention, tracking is ended when no object is detected in a selected number of consecutive frames following the first detection frame; the tracking is ended when the tracking score of the object is below a tracking threshold.

In the present invention, a tracking score is assigned based on a feature of an object detected in a first detection frame; a tracking threshold for each detected object is determined over the first subset of frames based on the attributes of the detected object.

In the invention, an ID is associated with a detection object of a first frame subset; associating objects detected in different frames of the first subset based on the ID; determining a bounding box of the object over the first subset of frames; and determining a change in bounding box over the second subset of frames; when no object is detected on a second detection frame in the first subset, the object is tracked on the second detection frame.

Fig. 1 depicts a video image acquisition analysis system 100 of the present disclosure. The system 100 includes a camera 102, an image analysis unit 108, and a camera controller 110. The camera 102 may capture a video image 106 of a scene, which may contain objects 104.1 and 104.2, such as faces. The camera 102 may provide the captured images as a video 106 data stream to the image analysis unit 108, and the image analysis unit 108 may analyze the images in the video 106 and detect the predetermined object from within its content. The camera controller 110 may control the camera 102 in response to data output from the image analysis unit 108. The image analysis unit 108 may detect objects 104.1, 104.2 from within the captured video 106 and identify the location of the detected objects. The image analysis unit 108 may assign attributes to the object data. For example, when the detected object is a human, the attribute data may be a motion characteristic, a lighting of the face, a posture or angle of the face with respect to the camera, an eye position, and a state of the face (e.g., whether eyes on the face are closed or blinking, whether the face is smiling, or the like). The camera controller 110 may use the image analysis results (e.g., object properties) to control camera capture parameters, such as focus or capture time of other images.

Fig. 2 depicts a detection and tracking system framework 200 of the present disclosure. The image analysis system 200 includes an object detection unit 220 and an object tracking unit 240. The object detection unit 220 may process a subset of the frames of the input video 206 and identify a predetermined type of object (e.g., a human face, a human body, etc.) from its content. The object tracking unit 240 may be responsive to data from the object detection unit 220 and may track detected objects in other frames of the input video 206. The object tracking unit 240 may output data identifying the location of the tracked object in the input video.

The processing performed by the image analysis system 200 may save processing resources and reduce latency as compared to known techniques for image processing. Object detection 220 may require more processing resources, including power and longer latency, than object tracking 240. Thus, the object detection unit 220 intermittently detects frames of the input video 206, reducing the required processing resources and delay compared to detecting all frames. The operations performed by the object tracking unit 240 are expected to be of lower complexity and lower latency than the object detection unit 220, and therefore, the image analysis system 200 may provide position data for all frames of the input video sequence 206 without incurring the processing costs required to detect objects in all such frames. For example, the object tracking unit 240 may only require 10% resources and 10% delay to process a frame as compared to the object detection unit 220. For example, by operating such exemplary detection units only intermittently, power consumption and delay may be reduced by 65%.

The object detection unit uses a predetermined subset of 1/N frames, where N is a predetermined integer constant. The object detection unit 220 may process a fixed frequency of the input video frame, for example, one out of every three consecutive frames (when N ═ 3). The object detection unit 220 may identify the object and its location and may distinguish the object, for example, by assigning a unique ID to each object detected in the image content. The ID may be used to determine whether an object detected in one detection frame is the same as an object detected in a different detection frame. Thus, the object ID may be used to determine whether a face detected in one frame is the same as a face detected in another frame. The object tracking unit 240 may track an object previously detected by the object detection unit 220. As shown in fig. 2, object tracking may operate on any frame of the input video 206. The object tracking unit 240 may receive indications of objects identified in the detection frame from the object detection unit 220 and then track changes to these objects in subsequent frames.

Object detection 220 and object tracking 240 may identify the location of each object in the frame. Detection and tracking may only identify object IDs in the frame.

The image analysis system 200 includes a data association unit 260, and the data association unit 260 assigns an ID to the detected object. The data association unit 260 may respond to the location data output by the object tracking unit 240 and the object detection unit 220 and assign an ID based on a correlation between locations.

The data association unit 260 may also determine additional attributes of the object not provided by the detection unit 220 or the tracking unit 240 through analysis of the object image. For example, the data association unit 260 may identify the object properties located on the detection frame by the object detection unit 220. The optional data association unit 260 may associate attributes of the determined object on the detection frame with the object tracked on the non-detection frame. Accordingly, the data association unit 260 may provide the object attributes 215 on the detection frame and the non-detection frame.

Fig. 3 depicts a detection and tracking flow and scheme 300. The method 300 may identify an object from the captured video data and output data for spatial location. The method 300 may first detect an object on a first frame of the detection frames (block 310). Using the ID of the object in the detection frame as a location reference, the detection object may be tracked for one or more frames following the detection frame (block 315). When the object location is identified in the tracking frames, the method 300 may output location data for the object in each tracking frame (block 320). As previously mentioned, the detection frames may be input into some predetermined subset of the frames in the video sequence. Thus, if the detection frame is chosen to be 1/N, blocks 315 and 320 may be performed N-1 times each time the detection frame is processed at block 310. For example, detection may be performed for 1 frame out of every 5 frames, while tracking may be performed for 4 frames out of every 5 frames remaining.

The method 300 may compare 310 the data of the object detected in the current iteration with the data of the object detected in the previous iteration 320 and determine whether a correlation exists (block 325). If a correlation between the objects is detected in both iterations, the method 300 may assign a common ID to the objects in the new iteration (block 330). If no correlation is detected for the tracked object in the current iteration, a new ID may be assigned to the object (block 335).

In one aspect, the object may be terminated based on a detection result on the detection frame. The object detection results of successive iterations may be compared (block 310) to determine when objects from previous iterations are no longer detected (block 340). Thus, the method 300 may determine whether the count of object losses exceeds a predetermined detection threshold (block 345). If the missing count value exceeds the predetermined detection threshold, the method 300 may terminate tracking of the object (block 350). On the other hand, the object may be terminated based on the tracking result of the tracking frame. A tracking score may be determined for each tracked object (block 355). If the tracking score does not exceed the tracking threshold (block 360), the object may be terminated (block 350).

Fig. 4 depicts a connection method 400 for a particular unit. The system 400 includes an input video 402, an object detection unit 420, an object tracking unit 440, a data association unit 460, and object attributes 415. The object detection unit 420 includes a detection neural network 422, detection weights 424 for controlling and training the detection neural network 422, a frame memory 426, and an image cropping unit 428. The object tracking unit 440 includes a tracking neural network 442 and tracking weights 444 for controlling and training the tracking neural network 442. The data association unit 460 includes an object analysis unit 462 and a control unit 464.

The detection neural network 422 may run on a subset of the frames of the input video 402. These detection frames may be stored in frame memory buffer 426, and detection unit 422 may detect the location of an object (e.g., a face) in the detection frames. The cropping unit 428 may crop the detection frame stored in the frame memory 426 based on the location of the object determined by the detection neural network 422. The cropped object image may be provided to the object tracking unit 440, the tracking neural network 442, and the object analysis unit 462. The tracking neural network 442 may track changes in the detected position of the object based on the object images of the current and previous detected frames to determine a new position and tracking score.

The object analysis unit 462 may determine the attribute of the object other than the position on the detection frame by analyzing the object image supplied from the cropping unit 428. The object properties determined by the object analysis unit 462 may include: face shine, face pose or angle relative to the camera, eye position, and whether the eyes are closed or blinking. The control unit 464 may determine whether the object detected by the object detection unit 420 in the first detection frame is the same as the object detected in another detection frame. The control unit 464 may also associate the object properties determined by the object analysis unit 462 on the detection frame with the object tracked by the object tracking unit 440 on the non-detection frame. Whether a detected frame or a non-detected frame, object attributes 415 may be provided for objects in all frames.

The detection weights 424 and tracking weights 444 may be trained in advance before beginning analysis of the input video 402. The detection weights 424 and tracking weights 444 may be trained during processing of the input video 402.

Fig. 5 depicts an example of detection tracking 500 of a video sequence with a moving facial object. The video sequence 500 comprises frames 501 and 505 containing facial objects. In this example application of the video sequence 500 to the system 200 of fig. 2, every other frame may be a detection frame (N-2). Thus, the object detection unit 220 may operate on frames 501, 503, and 505, while the object tracking unit 240 may operate on frames between detection frames (including frames 502 and 504). The video sequence 500 begins with two face objects in the frame 501 and an optional data association unit may associate an ID with each detected face. In fig. 5, the detection is represented by a frame surrounding the face, and the associated ID number is indicated under each detected face image. The tracking unit may operate on the next frame 502. In frame 502, faces with

IDs

1 and 2 can be successfully tracked. A third face, a frown face, appears in frame 502 but is not tracked in frame 502 because it was not detected in the previous detected frame. In the second detection frame 503, all three faces are detected, and an ID is associated with each face. In the tracking frame 504, the face object having ID 2 has been partially occluded, and thus may not be tracked even if a part of the object exists in the frame 504. For the third detection frame 505, only the face with ID 3 is detected. The face with ID 2 has disappeared while the face with ID 1 still partially exists but has changed enough to not be detected. An object may not be detected or tracked on the detection frame, for example, when the object disappears from the frame, is partially occluded upon entering or exiting the frame, is partially occluded by other objects in the frame, or is still fully visible in the frame, but is visually changed in some manner.

Fig. 6 and 7 are used to describe example datasets for trace termination. In fig. 6, four objects are tracked over a series of frames (

ID

1, 2, 3, 4). In this example, the number of tracking frames between detection frames is N — 3, and tracking is stopped when there is a lack of an object in 3 consecutive detection frames after the first detection. Successful detection or tracking is indicated by a check mark and failed detection or tracking is indicated by an X. For example, all four objects (ID 1-4) are detected on detection frame 1, and no objects with

ID

2 and 3 are detected on detection frame 2.

In the case of an object ID of 1, it is detected and tracked on all frames, so tracking of object 1 never terminates. The tracking of the object 3 is terminated on the detection frame 4 because of the absence of the object in 3 consecutive detection frames. In contrast to object ID 3, object ID 2 tracking is not terminated. After the object ID 2 is detected in the detection frame 1, it is not detected in the detection frames 2, 3, and 5, but the detection frames 3 and 5 are discontinuous, and thus the tracking is not terminated at the detection frame 5.

When tracking fails, object 4 tracking terminates. For example, tracking may fail when an object becomes blurred or leaves an image frame.

As further shown in fig. 7, an object ID of 4 failed tracking in tracking frame 2.2, which may be between

detection frames

2 and 3.

In conclusion, the invention reasonably sets the distribution of the detection frame and the tracking frame, sets reasonable detection and tracking rules, reduces the comprehensive calculated amount, and accelerates the calculation speed and the tracking real-time performance; and the structure of the neural network for detection and tracking is optimized, so that the calculation amount can be reduced, the neural network can be trained quickly, the loss is reduced quickly, and the tracking precision is improved.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A real-time video stream object detection and tracking method is characterized in that: the method comprises the following steps:

A. performing object detection on a first subset of frames of an input video;

2. A real-time video stream object detection and tracking method according to claim 1, wherein: the input video is divided such that the first subset of frames corresponds to every nth frame, N being the selected number, and the second subset of frames corresponds to the remaining frames.

3. A real-time video stream object detection and tracking method according to claim 1, wherein: ending the tracking when no object is detected in a selected number of consecutive frames after the first detection frame; the tracking is ended when the tracking score of the object is below a tracking threshold.

4. A real-time video stream object detection and tracking method according to claim 3, wherein: assigning a tracking score based on features of the object detected in the first detection frame; a tracking threshold for each detected object is determined over the first subset of frames based on the attributes of the detected object.

5. A real-time video stream object detection and tracking method according to claim 1, wherein: associating the ID with the detection object of the first subset of frames; associating objects detected in different frames of the first subset based on the ID; determining a bounding box of the object over the first subset of frames; and determining a change in bounding box over the second subset of frames; when no object is detected on a second detection frame in the first subset, the object is tracked on the second detection frame.

6. A real-time video stream object tracking system, characterized by: the system comprises an object detection unit, an object tracking unit and a data association unit; wherein the object detection unit is configured to perform object detection on a first subset of frames of the input video; an object tracking unit to track a position of an object previously detected by the detection unit on a second subset of frames of the input video based on a tracking threshold for each detected object; wherein the second subset and the first subset are mutually exclusive; the object detection unit comprises a frame memory, a neural network weight, a detection neural network and a cutting unit; the neural network is a structure optimized on the basis of a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, wherein the three sub-networks form a cascade structure; a detection frame stored in a frame memory is clipped based on a position of an object determined by a detection neural network, and a clipped object image is supplied to an object tracking unit, a tracking neural network, and an object analysis unit.

7. The real-time video stream object tracking system of claim 6, wherein: the object tracking unit comprises a neural network weight and a tracking neural network, wherein weight information is a pre-trained parameter; the neural network is an optimized structure based on a cascade convolution neural network MTCNN, and comprises three sub-networks which are respectively called P-Net, R-Net and O-Net, and the three sub-networks form a cascade structure.

8. A real-time video stream object tracking system according to claim 6 or 7, characterized by: the P-Net is composed of four layers of convolution, and the size of a convolution kernel of the first layer is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 1 multiplied by 1; the convolution of the fourth layer comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the R-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 2 multiplied by 2; the size of the convolution kernel of the fourth layer is 2 multiplied by 2; the fifth layer convolution also comprises two convolution layers, the convolution kernel size of the first convolution layer is 1 multiplied by 1, one channel is output to be called confidence degree, the confidence degree is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The convolution kernel size of the second layer of convolution layer is 1 multiplied by 1, four channels are output and called offset, and relu activation is used for determining the position of an object; the O-Net is composed of five layers of convolution, and the size of a first layer of convolution kernel is 3 multiplied by 3; the second layer convolution kernel size is 3 x 3; the size of the convolution kernel in the third layer is 3 multiplied by 3; the size of the convolution kernel of the fourth layer is 3 multiplied by 3; the fifth layer convolution in turn comprises two convolution layers. The first convolutional layer convolution kernel size is 1 multiplied by 1, a channel called confidence is output, the channel is activated by sigmoid and used for detecting whether an object exists or not, a threshold value is set, and if the output value is larger than the threshold value, the object is judged to exist. The second layer convolution kernel size is 1 x 1, and four channels are output, called offsets, which are activated by relu to determine the object position.

9. The real-time video stream object tracking system of claim 6, wherein: the data association unit comprises an object analysis unit and a control unit; the object analysis unit may determine the attribute of an object other than the position on the detection frame by analyzing the object image provided by the cropping unit; the object properties determined by the object analysis unit may include: face shine, face pose or angle relative to the camera, eye position, and whether the eyes are closed or blinking; the control unit determines whether the object detected by the object detection unit in the first detection frame is the same as the object detected in another detection frame; the control unit further associates the object attribute determined by the object analysis unit on the detection frame with the object tracked by the object tracking unit on the non-detection frame.

10. The real-time video stream object tracking system of claim 6, wherein: dividing the input video such that the frames of the first subset include every Nth frame of the input video, N being a predetermined number, and the remaining frames are included in the second subset; when the object is not detected in a predetermined number of consecutive frames of the first subset, the tracking unit stops tracking the object; when the tracking score of the object is lower than the tracking threshold of the object, the tracking unit finishes tracking the object; wherein: the detection unit determines a tracking threshold value of each detection object on the first subset frame based on the attribute of the respective detection object and the background of the respective detection object; further comprising: a data association unit for associating the ID with the object detected in the first subset of frames; wherein: the detection unit determines a bounding box of the object on the first subset frame; the tracking unit determines a change in the bounding box over the second subset of frames; further comprising: when an object is not detected in the detection frame, the tracking unit tracks the previously detected object on the detection frame.