CN108734735B

CN108734735B - Object shape tracking device and method, and image processing system

Info

Publication number: CN108734735B
Application number: CN201710249742.1A
Authority: CN
Inventors: 金浩; 黄耀海; 陈存建; 赵东悦
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2017-04-17
Filing date: 2017-04-17
Publication date: 2022-05-31
Anticipated expiration: 2037-04-17
Also published as: CN108734735A

Abstract

The invention discloses an object shape tracking device and method and an image processing system. The apparatus for tracking the shape of an object includes: means configured to determine pose change information of a first object from a previous video frame to a current video frame based on a pose of the first object in at least one video frame prior to the current video frame; means configured to determine a pose of the first object in the current video frame based on the pose of the first object in the previous video frame and the determined pose change information; means configured to determine a shape of the first object in the current video frame based on the determined pose of the first object; and means configured to update the determined shape of the first object and update the determined pose of the first object based on the updated shape of the first object.

Description

Object shape tracking device and method, and image processing system

Technical Field

The present invention relates to image processing, and more particularly, to an apparatus and method for tracking the shape of an object and an image processing system.

Background

In tracking the shape of an object in motion in a video, such as tracking the shape of a moving face or the shape of a moving human joint in a video, in order to more accurately obtain the shape of an object in one video frame (e.g., a current video frame) of a video, the initial shape of the object in the video frame is typically initialized using object poses determined from previous video frames of the video frame. Then, an object final shape in the video frame may be determined based on the initialized object initial shape. For example, an object pose in a video frame represents the orientation/tilt and position of the object in the geometric space of the video frame.

An exemplary technique is disclosed in "Facial Shape Tracking via spread-Temporal cassette Shape Regression" (j. yang. j. ding, k. zhang, and q. liu., the IEEE International Conference on Computer Vision (ICCV) works, 2015, pp.41-49). This exemplary technique mainly discloses the following processes: for a current video frame of a video, firstly, directly using an object shape and an object posture determined from a previous video frame of the current video frame to generate an object initial shape in the current video frame; then, a Shape Regression method (e.g., a Cascaded Shape Regression (CSR) method) is performed on the object initial Shape to determine the final Shape of the object in the current video frame. These processes are repeated until the end of the video is reached.

In other words, in tracking the shape of an object in motion in a video according to the above-described techniques, the object shape and object pose determined from a previous video frame will be passed to a subsequent video frame to determine the corresponding object initial shape. Thus, the accuracy of at least the object pose determined from the previous video frame will directly affect the accuracy of the object shape to be determined in the subsequent video frame. The above-described technique only considers the pose of an object determined from a previous video frame of the video frame, and therefore, in the case of a fast motion of an object in the video, especially a fast rotation in the video, the pose of the object in each video frame will change rapidly, which will cause a large difference between the pose of the object in the previous video frame and the pose of the object in the current video frame. In other words, in the case of a fast motion of an object in the video, the pose of the object in the previous video frame will not fit the appearance of the object in the current video frame. That is, the pose of the object in the previous video frame is not accurate for the object in the current video frame. Therefore, the initial shape of the object in the current video frame determined based on the inaccurate object pose will also be inaccurate, thus resulting in inaccurate final shape of the object in the obtained current video frame. Therefore, the accuracy of the object tracking result of one video frame or even the whole video will be affected.

Disclosure of Invention

Accordingly, the present disclosure is directed to solving the above-described problems in view of the above description in the background art.

According to an aspect of the present invention, there is provided an apparatus for tracking a shape of an object in a video, the apparatus comprising: a change information determination unit configured to determine posture change information of a first object from a previous video frame to a current video frame based on a posture of the first object in at least one video frame previous to the current video frame; a pose determination unit configured to determine a pose of the first object in the current video frame based on the pose of the first object in the previous video frame and the pose change information determined by the change information determination unit; a shape determination unit configured to determine a shape of the first object in the current video frame based on the pose of the first object determined by the pose determination unit; and an updating unit configured to update the shape of the first object determined by the shape determining unit and update the posture of the first object determined by the posture determining unit based on the updated shape of the first object.

By using the method and the device, the accuracy of the object shape and the accuracy of object tracking can be improved in the process of tracking the shape of the object under the motion condition in the video.

Other characteristic features and advantages of the present invention will be apparent from the following description with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 schematically shows an exemplary target object and an exemplary associated main object (dependent master object) according to the present invention.

Fig. 2 is a block diagram schematically showing a hardware configuration in which the technology according to the embodiment of the present invention can be implemented.

Fig. 3 is a block diagram illustrating a configuration of an object shape tracking apparatus according to a first embodiment of the present invention.

Fig. 4 schematically shows a flow chart of object shape tracking according to a first embodiment of the present invention.

Fig. 5 schematically shows a flowchart of step S420 shown in fig. 4 according to the first embodiment of the present invention.

Fig. 6A to 6C schematically show an exemplary target object in a video in motion.

Fig. 7 schematically shows another flowchart of step S420 shown in fig. 4 according to the first embodiment of the present invention.

Fig. 8 schematically shows a flowchart of step S425 as shown in fig. 7 according to the present invention.

Fig. 9 is a block diagram illustrating a configuration of an object shape tracking apparatus according to a second embodiment of the present invention.

Fig. 10 is a block diagram illustrating a configuration of an object shape tracking apparatus according to a third embodiment of the present invention.

Fig. 11 schematically shows a flowchart of step S420 shown in fig. 4 according to a third embodiment of the present invention.

Fig. 12 schematically shows an exemplary target object that is not visible between certain video frames of a video.

Fig. 13 illustrates an arrangement of an exemplary image processing system according to the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the following description is merely illustrative and exemplary in nature and is in no way intended to limit the invention, its application, or uses. The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in the embodiments do not limit the scope of the present invention unless it is specifically stated otherwise. Additionally, techniques, methods, and apparatus known to those skilled in the art may not be discussed in detail but are intended to be part of the present specification where appropriate.

Note that like reference numerals and letters refer to like items in the drawings, and thus, once an item is defined in a drawing, it is not necessarily discussed in the following drawings.

As described above, in the process of tracking the shape of an object in motion in a video, in the case of rapid motion of the object in the video, especially in the case of rapid rotation in the video, the object pose in one video frame for determining the object initial shape of a subsequent video frame will not be able to fit to the appearance of the object in the subsequent video frame, and thus the obtained object initial shape and object final shape of the subsequent video frame will be inaccurate.

To assist in object shape tracking in video, in one aspect, the present invention finds that the pose of an object ultimately determined in a corresponding video frame can be used to predict a pose change trend (i.e., pose motion trend) of the object from one video frame to a subsequent video frame due to motion (particularly rotation) of the object in the video, and that the pose change trend can be used to improve the pose of the object ultimately determined in one video frame and can be used to determine the initial shape of the object in the subsequent video frame. In the present invention, an object whose shape is to be tracked in a video is referred to as a target object. For example, the target object is a human body, a human face, a human joint (e.g., a human hand), or the like. In the present invention, the pose of a target object in a video frame represents the orientation/tilt and position of the target object in the geometric space of the video frame. In other words, the pose of a target object in one video frame describes the spatial properties of the target object in the video frame, wherein the spatial properties are represented by direction-dependent spatial properties (e.g. Pitch angle of the target object, Yaw angle of the target object, Roll angle of the target object) and position-dependent spatial properties (e.g. coordinates of feature points of the shape of the target object). In the present invention, the trend of the change of the pose of the target object from one video frame to the subsequent video frame describes the motion of the spatial attribute of the target object in the video, wherein the motion of the spatial attribute is represented by motion-related attributes (such as the angular velocity and/or angular acceleration of direction-related spatial attributes, the velocity and/or acceleration of position-related spatial attributes).

Since the pose change trend of the target object from one video frame to a subsequent video frame may describe the motion of the spatial attributes of the target object in the video, the finally determined pose of the object in one video frame, which is improved based on the pose change trend, may be fitted to the appearance of the object in the subsequent video frame. Thus, based on the improved object pose in one video frame, an accurate initial object shape and an accurate final object shape can be obtained for subsequent video frames.

Further, in practical situations, the change in pose of the target object is always constrained by the object to which the target object belongs or is connected. In other words, the object to which the target object belongs or is connected is an object that restricts the range of change in the pose of the target object in one video frame. In the present invention, an object to which a target object belongs or is connected is referred to as an association master object. In general, the variation range of the pose of the target object in one video frame in which the pose of the target object is in the same direction as the pose of the associated main object is larger than the variation range of the pose of the target object in one video frame in which the pose of the target object is in the opposite direction to the pose of the associated main object. In addition, in the case where the target object is a human face or a human head, the associated main object is the upper half of the human body. In the case where the target object is a human hand, its associated primary object is the forearm of the human. In the case where the target object is a forearm of a person, its associated main object is the forearm of the person.

Therefore, in order to obtain a more accurate initial shape of an object and a more accurate final shape of an object in a video frame, on the other hand, the present invention finds that a pose change constraint (i.e., a pose motion constraint) of a target object caused by a pose of an associated main object of the target object in a video can be further used to improve a pose change trend of the object from one video frame to a subsequent video frame. In the present invention, the pose change constraint of a target object in a video frame indicates the available change range (available changing range) in which the pose of the target object in the video frame may change or move relative to its associated main object. For example, as shown in FIG. 1, the face 110 is the target object, the person's upper body 120 is the associated main object, and the sector region 130, shown in dashed lines, is the pose change constraint of the face 110. In the present invention, the pose change constraint of a target object is described by a predetermined rule or predetermined model that is determined using statistical or training methods based on multiple samples of the target object and its associated master object at different poses. For example, the pose change constraint of the target object may be represented by the following equation (1):

attitude change constraint m (di) … (1)

Where Di represents the pose of the associated main object in a video frame, and m (Di) represents a trained model describing the constrained relationship between the target object and its associated main object.

For a target object under a motion condition in a video, when determining an initial shape of the target object in a video frame, in addition to considering a posture of the target object finally determined in a previous video frame of the video frame, the present invention also considers a posture change trend of the target object from the previous video frame to the video frame due to the motion (especially rotation) of the target object in the video and a posture change constraint of the target object due to a posture of an associated main object of the target object in the video. Therefore, the accuracy of the initial shape of the target object determined for one video frame will be improved. Thus, based on a more accurate initial shape, the accuracy of the final shape of the target object determined for the corresponding video frame will also be improved. Therefore, according to the present invention, the accuracy of the object shape and the accuracy of the object tracking will be improved.

(hardware configuration)

A hardware configuration that can implement the technology described hereinafter will be described first with reference to fig. 2.

Hardware configuration 200 includes, for example, a Central Processing Unit (CPU)210, a Random Access Memory (RAM)220, a Read Only Memory (ROM)230, a hard disk 240, an input device 250, an output device 260, a network interface 270, and a system bus 280. Further, the hardware configuration 200 may be implemented by a device such as a camera, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a notebook computer, a desktop computer, or other suitable electronic devices.

In a first implementation, the process of tracking object shapes in video according to the present invention is configured by hardware or firmware and used as a module or component of hardware configuration 200. For example, the apparatus 300, which will be described in detail below with reference to fig. 3, the apparatus 900, which will be described in detail below with reference to fig. 9, and the apparatus 1000, which will be described in detail below with reference to fig. 10, are used as modules or components of the hardware configuration 200. In a second implementation, the process of tracking object shapes in video according to the present invention is configured by software stored in ROM 230 or hard disk 240 and executed by CPU 210. For example, the process 400 described in detail below with reference to fig. 4 and the process of step S420 shown in fig. 4 described in detail below with reference to fig. 5, 7, and 11 are used as programs stored in the ROM 230 or the hard disk 240.

The CPU 210 is any suitable programmable control device, such as a processor, and can perform various functions to be described hereinafter by executing various application programs stored in the ROM 230 or the hard disk 240, such as a memory. The RAM 220 is used to temporarily store programs or data loaded from the ROM 230 or the hard disk 240, and is also used as a space in which the CPU 210 performs various processes (such as implementing the techniques that will be described in detail below with reference to fig. 4, 5, 7, 8, and 11) and other available functions. The hard disk 240 stores a variety of information such as an Operating System (OS), various application programs, control programs, data pre-stored or predefined by the manufacturer, and models and/or classifiers pre-stored or pre-generated by the manufacturer.

In one implementation, input device 250 is used to allow a user to interact with hardware configuration 200. In one example, a user may input images/video/data through input device 250. In another example, a user may trigger a corresponding process of the present invention through input device 250. Further, the input device 250 may take various forms, such as a button, a keyboard, or a touch screen. In another implementation, the input device 250 is used to receive images/video output from specialized electronic devices such as digital cameras, video cameras, and/or web cameras.

In one implementation, the output device 260 is used to display object tracking results (such as a bounding box of the target object, a shape of the target object, a pose of the target object, etc.) to the user. Also, the output device 260 may take various forms, such as a Cathode Ray Tube (CRT) or a liquid crystal display. In another implementation, the output device 260 is used to output the object tracking results for subsequent processes of video/image analysis and recognition, such as face analysis, portrait retrieval, expression recognition, face recognition, facial attribute recognition, and so forth.

Network interface 270 provides an interface for connecting hardware configuration 200 to a network. For example, hardware configuration 200 may be in data communication with other electronic devices connected via a network via network interface 270. Optionally, hardware configuration 200 may be provided with a wireless interface for wireless data communication. The system bus 280 may provide a data transmission path for mutually transmitting data among the CPU 210, the RAM 220, the ROM 230, the hard disk 240, the input device 250, the output device 260, the network interface 270, and the like. Although referred to as a bus, the system bus 280 is not limited to any particular data transfer technique.

The hardware configuration 200 described above is merely illustrative and is in no way intended to limit the present invention, its applications, or uses. Also, for simplicity, only one hardware configuration is shown in FIG. 2. However, multiple hardware configurations may be used as desired.

(object shape tracking apparatus and method)

Next, a process of tracking the shape of an object in a video according to the present invention will be described with reference to fig. 3 to 13.

Fig. 3 is a block diagram illustrating a configuration of an apparatus 300 according to a first embodiment of the present invention. Wherein some or all of the modules shown in figure 3 may be implemented by dedicated hardware. Wherein, in this embodiment, the pose of the target object in one video frame describes the direction-dependent spatial properties of the target object in that video frame.

As shown in fig. 3, the apparatus 300 includes a change information determination unit 310, a posture determination unit 320, a shape determination unit 330, and an update unit 340. First, the input device 250 shown in fig. 2 receives video output from a special electronic device (e.g., a camera) or input by a user. Input device 250 then transmits the received video to apparatus 300 via system bus 280.

Then, as shown in fig. 3, for a target object (e.g., a first object) in a current video frame (such as the t-th video frame) of the received video, the change information determination unit 310 determines a posture change tendency (i.e., posture change information) of the target object from a previous video frame to the current video frame based on the posture of the target object in at least one video frame preceding the current video frame. In other words, the change information determination unit 310 predicts the posture change tendency (i.e., the posture motion tendency) of the target object from the (t-1) th video frame to the t-th video frame based on the posture of the target object transferred from the previous video frame. Where T is a natural number and 2 ≦ T < T, which is the total number of video frames of the received video.

The pose determination unit 320 determines the pose of the target object in the current video frame based on the pose of the target object in the previous video frame and the pose change information determined by the change information determination unit 310. In other words, the pose determination unit 320 determines the initial pose of the target object in the t-th video frame by improving the pose of the target object in the (t-1) th video frame based on the predicted pose change tendency of the target object from the (t-1) th video frame to the t-th video frame.

The shape determination unit 330 determines the shape of the target object in the current video frame based on the pose of the target object determined by the pose determination unit 320. In other words, the shape determination unit 330 determines the initial shape of the target object in the t-th video frame based on the initial pose of the target object in the t-th video frame.

The updating unit 340 updates the shape of the target object determined by the shape determining unit 330, and updates the posture of the target object determined by the posture determining unit 320 based on the updated shape of the target object. In other words, first, the updating unit 340 (e.g., a shape updating subunit (not shown)) determines the final shape of the target object in the t-th video frame by updating the initial shape of the target object in the t-th video frame. Then, the updating unit 340 (e.g., a pose updating subunit (not shown)) determines the final pose of the target object in the t-th video frame by updating the initial pose of the target object in the t-th video frame based on the determined final shape of the target object, so that a more accurate pose of the target object can be transferred to the subsequent video frame.

That is, for the t video frame of the received video (where t ≧ 2), the apparatus 300 will determine the initial pose of the target object in the t video frame using the final pose of the target object in the (t-1) video frame and the pose change trend of the target object from the (t-1) video frame to the t video frame due to the motion (especially rotation) of the target object in the video. Further, as described above, in order to obtain a more accurate initial pose of the target object in the tth video frame, and thus a more accurate shape of the target object, a pose change constraint of the target object due to the pose of the associated master object of the target object in the video may also be used to determine the initial pose of the target object in the tth video frame. Therefore, the apparatus 300 further comprises a change constraint determination unit 350.

As shown in fig. 3, for a target object in a current video frame (such as the t-th video frame), the change constraint determination unit 350 determines a pose change constraint of the target object in the current video frame based on a pose of a second object (i.e., an associated master object of the target object) in the current video frame and a constraint relationship between the target object and the associated master object in the current video frame. For example, the change constraint determining unit 350 determines a corresponding posture change constraint by using the above formula (1).

The pose determination unit 320 determines an initial pose of the target object in the current video frame based on the pose of the target object in the previous video frame, the pose change information determined by the change information determination unit 310, and the pose change constraint determined by the change constraint determination unit 350. In other words, the pose determination unit 320 determines the initial pose of the target object in the t-th video frame by improving the pose of the target object in the (t-1) th video frame based on an improved pose change tendency obtained by improving the pose change tendency of the predicted target object from the (t-1) th video frame to the t-th video frame based on the pose change constraint of the target object determined in the t-th video frame.

Furthermore, in order to trigger the object shape tracking process and to determine the corresponding shape and the corresponding pose in the first video frame (i.e. the 1 st video frame) of the received video, the apparatus 300 further comprises a detection unit 360. As shown in fig. 3, for the 1 st video frame of the received video, the detection unit 360 detects the corresponding pose of the target object in the 1 st video frame and detects the corresponding shape of the target object in the 1 st video frame. Then, taking the 2 nd video frame of the received video as an example, the change information determination unit 310 performs a corresponding operation based on the posture of the target object detected from the 1 st video frame.

The flow chart 400 shown in fig. 4 is a corresponding process of the apparatus 300 shown in fig. 3.

As shown in fig. 4, for one received video, in the detection step S410, the detection unit 360 detects the corresponding pose of the target object in the 1 st video frame of the received video and detects the corresponding shape of the target object in the 1 st video frame.

In one implementation, in one aspect, the detection unit 360 performs a pose detection method on the 1 st video frame to detect a corresponding pose of the target object. The pose detection method is, for example, a model-based pose detection and classification method. The model for detection and classification operations is generated based on a plurality of pre-labeled samples of the target object in different postures by using a general supervised machine learning method (generic supervised machine learning method). On the other hand, the detection unit 360 performs a shape detection method (e.g., a cascade regression method) on the 1 st video frame to detect the corresponding shape of the target object in the 1 st video frame. For example, in the case where the shape of the target object is represented by feature points, the corresponding positions of the feature points of the shape of the target object in the 1 st video frame can be obtained. Wherein, in the case that the target object is a human face, the feature points are, for example, human face feature points; in the case where the target object is a human joint, the feature points are, for example, human joint feature points.

Returning to FIG. 4, in step S420, for the t-th video frame of the received video (where t ≧ 2), the apparatus 300 shown in FIG. 3 determines the corresponding shape of the target object and the corresponding pose of the target object in the t-th video frame. In one implementation, the apparatus 300 determines the correspondence information with reference to fig. 5.

Then, after the apparatus 300 determines the corresponding shape of the target object and the corresponding pose of the target object in the T-th video frame, the apparatus 300 determines whether T is greater than T in step S430. In case T is greater than T (meaning that the entire received video has been processed), the corresponding process of the apparatus 300 will stop. Otherwise, in step S440, the device 300 sets t +1 and repeats the corresponding process of step S420.

Fig. 5 schematically shows a flowchart of step S420 as shown in fig. 4 according to the present invention.

As shown in fig. 5, in the change information determination step S421, the change information determination unit 310 shown in fig. 3 determines the posture change tendency (i.e., posture change information) of the target object from the (t-1) th video frame to the t-th video frame based on the posture of the target object in at least one video frame before the t-th video frame.

In case that only one video frame is available before the t-th video frame, such as only the (t-1) th video frame, the change information determination unit 310, in one implementation, considers a predefined motion state (e.g., a stationary or predetermined velocity motion state) as the posture change tendency of the target object from the (t-1) th video frame to the t-th video frame.

In the case where two video frames are available before the tth video frame, such as the (t-2) th and (t-1) th video frames, the change information determination unit 310 regards the posture change tendency of the target object from the (t-1) th video frame to the tth video frame as a uniform motion and regards the angular velocity as a corresponding posture change tendency, in one implementation. In one example, the angular velocity is determined based on a difference in corresponding direction-dependent spatial properties of the target object in the (t-2) th and (t-1) th video frames, such as a difference in yaw angle of the target object in the (t-2) th and (t-1) th video frames. Taking the faces (i.e., the object targets) shown in fig. 6A to 6C as an example, fig. 6A to 6C respectively show the (t-2) th to t-th video frames of one video. It is assumed that the face is rotated in the yaw angle direction in the video and that the pose of the face in the (t-2) th video frame is the face yaw angle of-30 degrees and the pose of the face in the (t-1) th video frame is the face yaw angle of 0 degrees, and therefore, it can be considered that the pose change tendency of the face from the (t-1) th video frame to the t-th video frame is that the face angular velocity is 30 degrees/video frame.

In the case where more than two video frames are available before the t-th video frame, in one implementation, the change information determination unit 310 treats the posture change tendency of the target object from the (t-1) th video frame to the t-th video frame as a variable speed motion and determines a corresponding posture change tendency based on the angular velocity and the angular acceleration. In one example, angular velocity and angular acceleration are determined by performing a time-based pose change prediction method on corresponding direction-dependent spatial attributes of the target object between available video frames.

More specifically, according to the time-series-based attitude change prediction method, a predictor (e.g., a time-domain-based predictor) will process each direction-dependent spatial attribute (such as yaw angle) of a target object between available video frames to perform corresponding variable prediction. For example, the predictor may be a linear predictor and may be represented by the following equation (2):

X_t＝a₀+a₁*X_t-1+a₂*X_t-2+…+a_i*X_t-N…(2)

where ai (i e {0,1,2, …, N }) represents parameters of a predictor generated when the predictor is generated using multiple samples, N represents a length of the predictor, and Xt represents a variable value at time t (e.g., a yaw angle change between adjacent video frames). In addition, the predictor may also be a Polynomial-based predictor, a Kalman-based predictor, or other predictors.

Returning to fig. 5, in the pose determination step S422, the pose determination unit 320 determines the initial pose of the target object in the t-th video frame based on the pose of the target object in the (t-1) th video frame and the pose change tendency of the target object from the (t-1) th video frame to the t-th video frame.

In one implementation, the pose determination unit 320 determines an initial pose of the target object in the (t-1) th video frame by compensating the pose of the target object in the (t-1) th video frame by an offset. Wherein the offset is determined based on the posture change trend of the target object from the (t-1) th video frame to the t-th video frame. More specifically, the compensation operation is to add an offset to the pose of the target object in the (t-1) th video frame. Further, taking the face (i.e., the target object) shown in fig. 6A to 6C as an example, as described above, the pose of the face in the (t-1) th video frame is such that the face yaw angle is 0 degrees, and the pose change tendency of the face from the (t-1) th video frame to the t-th video frame is such that the face angular velocity is 30 degrees/video frame. Therefore, the corresponding offset is considered to be 30 degrees, and the initial pose of the face in the t-th video frame is considered to be 30 degrees.

Returning to fig. 5, in the shape determining step S423, the shape determining unit 330 determines the initial shape of the target object in the t-th video frame based on the initial pose of the target object in the t-th video frame. In one implementation, the shape determination unit 330 determines the initial shape of the target object in the tth video frame by transforming the predetermined shape using the initial pose of the target object in the tth video frame. More specifically, the initial shape of the target object in the t-th video frame is obtained by performing a transformation operation on a predetermined shape using a transformation matrix, which is generated based on the initial pose of the target object in the t-th video frame.

Wherein, in one example, the predetermined shape is a pre-generated average shape of the target object, the pre-generated average shape of the target object being generated based on a plurality of image samples of the target object, wherein the shape of the target object is marked in each image sample. To obtain a more accurate initial shape of the target object, in another example, the predetermined shape is a shape selected from a plurality of pre-generated average shapes of the target object, wherein a pose of the target object corresponding to the selected shape best matches the initial pose of the target object in the t-th video frame. In another example, the predetermined shape is a shape determined based on a shape of the target object in at least one video frame before the t-th video frame.

Returning to fig. 5, in the updating step S424, the updating unit 340 determines the final shape of the target object in the t-th video frame by updating the initial shape of the target object in the t-th video frame using a shape detection method (e.g., a cascade regression method). Then, based on the determined final shape of the target object, the updating unit 340 determines the final pose of the target object in the t-th video frame by updating the initial pose of the target object in the t-th video frame using a pose estimation method (e.g., a POSIT algorithm), so that a more accurate pose of the target object can be transferred to a subsequent video frame.

As described above, the pose change constraint of the target object due to the pose of the associated master object of the target object in the video may also be used to determine the initial pose of the target object in the t-th video frame. Therefore, another flowchart of step S420 shown in fig. 4 is shown in fig. 7. Comparing fig. 7 with fig. 5, the main difference in the flowchart shown in fig. 7 is that step S420 further includes a change constraint determining step S425. That is, the pose change constraint of the target object in the t-th video frame will also be determined according to the flowchart shown in FIG. 7.

As shown in fig. 7, after the change information determination unit 310 determines the posture change tendency of the target object from the (t-1) th video frame to the t-th video frame in the change information determination step S421, the change constraint determination unit 350 determines the posture change constraint of the target object in the t-th video frame based on the posture of the associated main object in the t-th video frame and the constraint relationship between the target object and the associated main object in the t-th video frame in the change constraint determination step S425.

In one implementation, to obtain a more accurate pose of the associated main object, the pose of the associated main object in the t-th video frame is determined by using a pose detector (e.g., an Adaboost-based detector) for the associated main object in the t-th video frame. More specifically, first, an associated main object is detected from a t-th video frame using a gesture detector, and then a gesture of the associated main object in the t-th video frame is recognized using the gesture detector. Wherein the gesture detector is generated based on a plurality of pre-labeled samples of the associated master object under different gesture conditions using a general supervised machine learning approach.

To determine the pose of the associated primary object with less computational effort and/or to ensure that the pose of the associated primary object can be determined whether the associated primary object is occluded by other objects or whether the associated primary object is moving, in another implementation, the pose of the associated primary object in the t-th video frame is determined according to the flowchart shown in fig. 8.

As shown in fig. 8, in step S4251, based on the pose of the associated main object in at least one video frame before the t-th video frame, pose change information (i.e., pose change tendency) of the associated main object from the (t-1) th video frame to the t-th video frame is determined. Here, the process of determining the posture change tendency of the associated main object is similar to the above-described process of determining the posture change tendency of the target object, and a detailed description will not be repeated here.

In one implementation, the pose of the associated master object in the corresponding video frame is determined by using the above-described pose detector for the associated master object in each video frame preceding the t-th video frame. In another implementation, the pose of the associated main object in the corresponding video frame is determined based on the final shape of the target object and the final pose of the target object determined in each video frame before the t-th video frame and the constraint relationship between the target object and the associated main object in the corresponding video frame. In addition, the pose of the associated master object in each video frame prior to the t-th video frame is determined using the pose estimation method described above.

Then, in step S4252, the pose of the associated main object in the t-th video frame is determined based on the pose of the associated main object in the (t-1) -th video frame and the determined pose change trend of the associated main object from the (t-1) -th video frame to the t-th video frame. Wherein the process of determining the pose of the associated host object is similar to the process of determining the initial pose of the target object described above, and a detailed description will not be repeated here.

Then, after determining the pose of the associated main object in the tth video frame, the pose change constraint of the target object in the tth video frame may be determined accordingly. For example, as described above, assuming that the pose change constraint of the target object is described by the predetermined rule and assuming that the normally available variation range in which the pose of the target object may vary or move with respect to its associated main object is (-40 degrees, 40 degrees), in the case where the pose of the associated main object in the t-th video frame is 0 degrees, the pose change constraint of the target object in the t-th video frame is (-40 degrees, 40 degrees). In the case where the pose of the associated master object in the tth video frame is 20 degrees, the change in pose of the target object in the tth video frame is constrained to (-20 degrees, 60 degrees).

Returning to fig. 7, after determining the pose change constraint of the target object in the t-th video frame in the change constraint determination step S425, the pose determination unit 320 determines the initial pose of the target object in the t-th video frame by compensating the pose of the target object in the (t-1) -th video frame by the offset amount in the pose determination step S422. Wherein the offset is determined based on the attitude change trend of the target object from the (t-1) th video frame to the t-th video frame and the attitude change constraint of the target object in the t-th video frame. And in the case that the posture change trend of the target object from the (t-1) th video frame to the t-th video frame exceeds the posture change constraint in the t-th video frame, using the posture change constraint in the t-th video frame to adjust the posture change trend of the target object from the (t-1) th video frame to the t-th video frame. Further, taking the face (i.e., the target object) shown in fig. 6A to 6C as an example, assuming that the pose change of the face in the t-th video frame is constrained to (-20 degrees, 20 degrees), as described above, the pose change trend of the face from the (t-1) th video frame to the t-th video frame is that the angular velocity of the face is 30 degrees/video frame. Since the trend of the change in the pose of the face from the (t-1) th video frame to the t-th video frame is greater than the maximum value of the determined pose change constraint of the face, the maximum value of the determined pose change constraint of the face (i.e., 20 degrees) is considered as the corresponding offset. As described above, the pose of the face in the (t-1) th video frame is such that the face yaw angle is 0 degrees, and therefore the initial pose of the face in the t-th video frame is considered to be 20 degrees.

In addition, since the change information determining step S421, the shape determining step S423, and the updating step S424 shown in fig. 7 are the same as the corresponding steps S421, S423, and S424 shown in fig. 5, detailed description will not be repeated here.

As described above, the shape of a target object in one video frame can be represented by corresponding feature points. Therefore, in a case where all feature points of the target object cannot be obtained from the corresponding video frame, such as in a case where the target object is occluded by other objects in the video frame, a part of the feature points of the shape of the target object may be determined to assist in tracking the shape of the target object in the video. Fig. 9 is a block diagram illustrating a configuration of an apparatus 900 according to a second embodiment of the present invention. Wherein some or all of the modules shown in figure 9 may be implemented by dedicated hardware. In addition, in the embodiment, the posture change constraint of the target object caused by the posture of the associated main object of the target object in the video is also considered. However, it is clear that it is not necessarily limited thereto.

Comparing fig. 9 with fig. 3, the main differences of the device 900 shown in fig. 9 are as follows:

first, the apparatus 900 further includes a feature point determining unit 910, wherein a detailed description of the feature point determining unit 910 will be described below.

Next, in this embodiment, the pose of the target object in one video frame describes the position-related spatial attribute, rather than the direction-related spatial attribute, of the target object in that video frame.

As shown in fig. 9, for a target object in a current video frame (such as the t-th video frame) of a received video, first, the feature point determining unit 910 determines a feature point of the shape of the target object in the t-th video frame. In other words, the feature point determining unit 910 first determines corresponding key feature points from all feature points of the shape of the target object in the t-th video frame.

In one implementation, the key feature points determined by the feature point determination unit 910 are feature points whose geometric relationship is stable with respect to a change in the pose of a target object in the video. Wherein feature points for which the geometric relationship is stable with respect to changes in the pose of the target object may be predefined based on predefined rules generated from prior knowledge or from statistical processes. For example, in the case where the target object is a human face, since the positions of and the distances between feature points at the corners of the eyes and at the nose tip are stable with respect to the posture change of the human face, these feature points can be regarded as key feature points.

In another implementation, the key feature points determined by the feature point determining unit 910 are non-occlusion feature points represented by occlusion information of the shape of the target object in the t-th video frame. Wherein, for any one video frame, the occlusion information of the shape of the target object in the video frame represents the feature points of the shape of the target object as occlusion feature points and non-occlusion feature points. More specifically, occlusion information of the shape of the target object in the t-th video frame is determined by the following procedure.

First, the occlusion information of the shape of the target object in the (t-1) th video frame is updated by judging the occlusion information of each feature point of the final shape of the target object in the (t-1) th video frame determined by the updating unit 340 shown in fig. 9 based on a pre-generated occlusion classifier or other occlusion judging method. In one implementation, the pre-generated occlusion classifier is a binary classifier generated from positive and negative examples using a learning method such as a Support Vector Machine (SVM) algorithm, an Adaboost algorithm, or the like. Wherein the positive samples are generated by sampling the corresponding image around the occluded feature point, and the negative samples are generated by sampling the corresponding image around the non-occluded feature point.

Wherein, for example, the final shape of the target object in the (t-1) th video frame is determined by the following procedure: first, the updating unit 340 updates the position of the non-occlusion feature point of the initial shape of the target object determined by the shape determining unit 330 using a shape detection method (e.g., a cascade regression method); then, the updating unit 340 updates the position of the occlusion feature point of the initial shape of the target object determined by the shape determining unit 330 based on the updated position of the non-occlusion feature point and the geometric relationship between the non-occlusion feature point and the occlusion feature point with respect to the target object region. Wherein the target object region may be estimated based on object tracking results (e.g., object shapes) obtained from previous video frames or may be detected in the corresponding video frame using existing detection methods.

Wherein, for the 1 st video frame of the received video, the detection unit 360 is further operable to detect corresponding occlusion information of the shape of the detected target object in the 1 st video frame.

Then, after updating occlusion information of the shape of the target object in the (t-1) th video frame, occlusion information of the shape of the target object in the t-th video frame is determined based on the occlusion information of the shape of the target object in at least one video frame before the t-th video frame. In one example, the updated occlusion information of the shape of the target object in the (t-1) th video frame is directly regarded as the occlusion information of the shape of the target object in the t-th video frame. In another example, occlusion information of a shape of the target object in the t-th video frame is determined using a statistics-based method based on occlusion information of a shape of the target object determined in a plurality of video frames preceding the t-th video frame.

Wherein the posture change tendency of the target object determined by the change information determination unit 310 from the (t-1) th video frame to the tth video frame is the position change information (i.e., the position change tendency) of the determined key feature point from the (t-1) th video frame to the tth video frame. Wherein for any one of the determined key feature points, the corresponding position change information is determined based on the position of the key feature point in at least one video frame before the t-th video frame. For example, for a determined key feature point from the (t-1) th video frame to the tth video frame, in case only one video frame is available before the tth video frame, a predefined motion state (e.g., a stationary or predetermined velocity motion state) is considered as a corresponding position variation trend of the key feature point. In the case where two video frames are available before the t-th video frame, a uniform velocity will be considered as the corresponding position variation trend of the key feature point. In the case where more than two video frames are available before the t-th video frame, the corresponding trend of change in position of the key feature point will be determined based on the velocity and acceleration determined by performing a time-based method of prediction of change in position of the key feature point on the corresponding position-related spatial attributes between the available video frames.

Wherein the final pose of the target object in the tth video frame determined by the updating unit 340 is the position of the determined key feature point of the final shape of the target object in the tth video frame.

In addition, since the change information determining unit 310, the posture determining unit 320, the shape determining unit 330, the updating unit 340, the change constraint determining unit 350, and the detecting unit 360 shown in fig. 9 are the same as the corresponding units shown in fig. 3, detailed description will not be repeated here.

As described above, in the present invention, for a target object in a video under motion, when determining an initial shape of the target object in one video frame, a pose change trend of the target object from a previous video frame to the video frame due to motion (especially rotation) of the target object in the video and a pose change constraint of the target object due to a pose of an associated main object of the target object in the video are also considered. Therefore, the accuracy of the initial shape of the target object determined for one video frame will be improved. Thereby, the accuracy of the final shape of the target object determined for the corresponding video frame will also be improved. Therefore, according to the present invention, the accuracy of the shape of the object will be improved.

In the process of tracking the shape of a target object in a moving situation in a video, in practical application, the target object is often not visible between some video frames of the video due to the target object rotating itself or being occluded by other objects. In general, in such a case, the prior art will not perform any operation on the corresponding video frame in which the target object is not visible, and will re-perform the tracking process on the subsequent video frame in which the target object is visible again, which will cause the process of tracking the shape of the target object in the entire video to be interrupted. Therefore, the tracking Identification (ID) of the target object in the entire video will be switched or the tracking of the target object in the entire video will be lost, i.e., the accuracy of object tracking will be affected. For example, switching the tracking ID of the target object includes giving the target object a new tracking ID or exchanging the tracking ID of the target object with another object being tracked. In particular, in the people counting application that counts the number of people in or through a specific space, in the case where the tracking ID of a person is switched in the process of tracking the shape of the person in a video, an erroneous people counting result is output. Thus, in order to enable the process of tracking the shape of a target object throughout the video to be performed continuously, whether or not the target object is visible, the present invention will continue to predict the pose and shape of the target object in video frames in which the target object is not visible.

Fig. 10 is a block diagram illustrating a configuration of an apparatus 1000 according to a third embodiment of the present invention. Wherein some or all of the modules shown in figure 10 may be implemented by dedicated hardware. In addition, in the embodiment, the posture change constraint of the target object caused by the posture of the associated main object of the target object in the video is also considered. However, it is clear that it is not necessarily limited thereto. Wherein, in this embodiment, the pose of the target object in one video frame describes the direction-dependent spatial properties of the target object in that video frame. However, the pose of a target object in a video frame may also describe the position-dependent spatial attributes of the target object in the video frame. That is, as an alternative solution, key feature points may also be determined for the shape of the target object in the video frame.

Comparing fig. 10 with fig. 3, the main difference of the apparatus 1000 shown in fig. 10 is that the apparatus 1000 further comprises a first judging unit 1010.

As shown in fig. 10, for a target object in a current video frame (such as a tth video frame) of a received video, after the shape determination unit 330 determines an initial shape of the target object in the tth video frame, the first judgment unit 1010 judges whether the target object is visible in the tth video frame based on the confidence of the target object in the tth video frame. In a case where the first judgment unit 1010 judges that the target object is visible in the t-th video frame, the updating unit 340 updates the initial shape of the target object determined by the shape determining unit 330, and updates the initial posture of the target object determined by the posture determining unit 320 based on the updated shape of the target object. Conversely, in the case where the target object is not visible in one video frame, the present invention will determine the initial pose of the target object and the initial shape of the target object using only the pose determination unit 320 and the shape determination unit 330. Also, the initial pose of the target object and the initial shape of the target object will be passed to subsequent video frames to perform subsequent operations, so the process of tracking the shape of the target object throughout the video will be continuous.

In addition, in order to ensure the accuracy of object tracking, in the case that the target object is judged to be invisible in the (t-1) th video frame and visible in the t-th video frame, namely, in the case that the visibility of the target object is recovered in the t-th video frame, the invention further judges whether the target object in the (t-1) th video frame and the target object in the t-th video frame belong to the same object or not, so as to avoid switching the tracking ID erroneously. Therefore, comparing fig. 10 with fig. 3, the apparatus 1000 shown in fig. 10 further includes a second determination unit 1020.

As shown in fig. 10, in a case where the first determination unit 1010 determines that the target object is not visible in the (t-1) th video frame but is visible in the t-th video frame, first, the detection unit 360 detects the corresponding shape and the corresponding posture of the target object in the t-th video frame, and then, the second determination unit 1020 determines whether the shape of the target object in the t-th video frame detected by the detection unit 360 and the final shape of the target object in the t-th video frame determined by the update unit 340 belong to the same object based on the similarity metric. In the case where it is determined that the shape of the target object in the t-th video frame detected by the detection unit 360 and the final shape of the target object in the t-th video frame determined by the update unit 340 belong to the same object, the target object in the (t-1) th video frame and the target object in the t-th video frame will be regarded as belonging to the same object, and the same object tag (e.g., the same tracking ID) will be tagged to the shapes of the target object in the (t-1) th video frame and the t-th video frame.

In addition, since the change information determining unit 310, the posture determining unit 320, the shape determining unit 330, the updating unit 340, the change constraint determining unit 350, and the detecting unit 360 shown in fig. 10 are the same as the corresponding units shown in fig. 3, detailed description will not be repeated here.

As shown in fig. 11, after the shape determining unit 330 determines the initial shape of the target object in the t-th video frame in the shape determining step S423, the first judging unit 1010 judges whether the target object is visible in the t-th video frame in the first judging step S1110.

In one implementation, the first judgment unit 1010 performs an object detection method (e.g., a face detection method) on the t-TH video frame and judges that the target object is visible in the t-TH video frame in a case where a confidence obtained by the object detection method is greater than or equal to a predefined threshold (e.g., TH 1).

In another implementation, in a case where the target object is occluded by another object (hereinafter, referred to as a third object) in the t-TH video frame, the first judgment unit 1010 detects a region of the target object and a region of the third object in the t-TH video frame using an existing detection method, calculates an overlap ratio between a region area of the target object and a region area of the third object, and judges that the target object is visible in the t-TH video frame in a case where the calculated overlap ratio is less than a predefined threshold (e.g., TH 2).

Returning to fig. 11, in the case where it is determined in the first determination step S1110 that the target object is not visible in the t-th video frame, only the initial pose of the target object and the initial shape of the target object will be determined. Otherwise, the final pose of the target object and the final shape of the target object will be determined in the updating step S424. As shown in fig. 12, since the face (i.e., the target object) itself rotates, the face is not visible from the (i +3) th video frame to the (i +5) th video frame. Thus, according to the present invention, only the initial pose and the initial shape will be determined for the face from the (i +3) th video frame to the (i +5) th video frame.

Then, in step S1120, the second determination unit 1020 determines whether the visibility of the target object changes from invisible to visible from (t-1) th video frame to tth video frame. If not, meaning that the target object is always visible between previous video frames, the second decision unit 1020 will not perform the subsequent operation. Otherwise, such as the target object in the (i +6) th video frame shown in fig. 12, in step S1130, the detection unit 360 shown in fig. 10 detects the corresponding shape and the corresponding posture of the target object in the t-th video frame.

Then, in step S1140, the second judgment unit 1020 judges whether the shape of the target object in the t-th video frame detected by the detection unit 360 and the final shape of the target object in the t-th video frame determined by the update unit 340 belong to the same object based on the similarity metric. In one implementation, the similarity metric is calculated using, for example, a Cosine distance (Cosine distance) or a Euclidean distance (Euclidean distance) based on the pose of the target object in the tth video frame detected by the detection unit 360 and the final pose of the target object in the tth video frame determined by the update unit 340. For example, in a case where the calculated similarity metric is greater than or equal to a predefined threshold (e.g., TH3), the second judging unit 1020 judges that the shape of the target object in the t-TH video frame detected by the detecting unit 360 and the final shape of the target object in the t-TH video frame determined by the updating unit 340 belong to the same object.

In addition, since the change information determining step S421, the posture determining step S422, the shape determining step S423, the updating step S424, and the change constraint determining step S425 shown in fig. 11 are the same as the corresponding steps shown in fig. 5, detailed description will not be repeated here.

As an exemplary application of the above-described process with reference to fig. 10 to 11, next, an exemplary image processing system for tracking the shape of a person in a video will be described with reference to fig. 13. As shown in fig. 13, the image processing system 1300 includes a first image processing apparatus 1310, an apparatus 1000 (i.e., a second image processing apparatus), and a third image processing apparatus 1320. In one implementation, the device 1000, the first image processing device 1310, and the third image processing device 1320 are connected to each other via a system bus. In another implementation, the apparatus 1000, the first image processing apparatus 1310, and the third image processing apparatus 1320 are connected to each other via a network. In addition, the apparatus 1000, the first image processing apparatus 1310, and the third image processing apparatus 1320 may be implemented via the same electronic device (e.g., a computer, a PDA, a mobile phone, a camera). Alternatively, the apparatus 1000, the first image processing apparatus 1310, and the third image processing apparatus 1320 may also be implemented via different electronic devices.

As shown in fig. 13, first, the apparatus 1000 and the first image processing apparatus 1310 receive a video output from a dedicated electronic device (e.g., a camera) or input by a user.

For a person (i.e., a target object) to be tracked in an input video, the first image processing device 1310 determines first tracking information of the person in each video frame of the input video. In one implementation, the first image processing device 1310 performs a general tracking method, for example, on each video frame of the input video to determine corresponding first tracking information. The tracking information of the shape of a person in each video frame includes, for example, a tracking ID of the person, a track of the shape of the person (for example, a track of each feature point), and the like.

For a person to be tracked in the input video, the apparatus 1000 determines second tracking information of the person in each video frame of the input video based on the shape of the person in each video frame of the input video and the pose of the person in each video frame of the input video, which are determined with reference to fig. 10 to 11. As described in the third embodiment of the present invention, in the process of tracking the shape of a person to be tracked in an input video, in the case where the visibility of the person is restored in a certain video frame, the apparatus 1000 will determine whether or not the two shapes belong to the same object (i.e., the person to be tracked) for a shape determined from a previous video frame of the video frame in which the person is not visible and a shape determined from the video frame in which the person is visible. In the case where it is determined that the two shapes belong to the same object, the two shapes will be labeled with the same tracking ID.

Then, in the case where the first tracking information is different from the second tracking information for the person to be tracked in the input video, the third image processing apparatus 1320 updates the first tracking information determined by the first image processing apparatus 1310 based on the second tracking information determined by the apparatus 1000.

As described above, in the present invention, regardless of whether a target object is visible or not, the present invention will continuously predict the pose and shape of the target object in a video frame in which the target object is not visible. And, in case the visibility of the target object is restored, the present invention will perform the corresponding matching process. Therefore, the process of tracking the shape of the target object in the entire video according to the present invention can be continuously performed, and a plurality of tracking IDs belonging to the same object can be restored to the same tracking ID. Therefore, the accuracy of object tracking will be improved.

All of the elements described above are exemplary and/or preferred modules for implementing the processes described in this disclosure. These units may be hardware units (such as Field Programmable Gate Arrays (FPGAs), digital signal processors, application specific integrated circuits, etc.) and/or software modules (such as computer readable programs). The units for carrying out the steps have not been described in detail above. However, in case there are steps to perform a specific procedure, there may be corresponding functional modules or units (implemented by hardware and/or software) to implement the same procedure. The technical solutions through all combinations of the described steps and the units corresponding to these steps are included in the disclosure of the present application as long as the technical solutions formed by them are complete and applicable.

The method and apparatus of the present invention may be implemented in various ways. For example, the methods and apparatus of the present invention may be implemented in software, hardware, firmware, or any combination thereof. The above-described order of the steps of the method is intended to be illustrative only and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, which includes machine-readable instructions for implementing a method according to the present invention. Accordingly, the present invention also covers a recording medium storing a program for implementing the method according to the present invention.

While some specific embodiments of the present invention have been shown in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are intended to be illustrative only and are not limiting upon the scope of the invention. It will be appreciated by those skilled in the art that the above-described embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is to be limited only by the following claims.

Claims

1. An apparatus for tracking object shapes in a video, the apparatus comprising:

a change information determination unit configured to determine posture change information of a first object from a previous video frame to a current video frame based on a posture of the first object in at least one video frame prior to the current video frame;

a pose determination unit configured to determine a pose of the first object in the current video frame based on the pose of the first object in the previous video frame and the pose change information determined by the change information determination unit;

a shape determination unit configured to determine a shape of the first object in the current video frame based on the pose of the first object determined by the pose determination unit; and

an updating unit configured to update the shape of the first object determined by the shape determining unit and update the posture of the first object determined by the posture determining unit based on the updated shape of the first object.

2. The apparatus of claim 1, the apparatus further comprising:

a change constraint determining unit configured to determine a pose change constraint of the first object in the current video frame based on a pose of a second object in the current video frame and a constraint relationship between the first object and the second object in the current video frame;

wherein the second object is an object that constrains a range of variation in the pose of the first object in the current video frame;

wherein the pose determination unit determines the pose of the first object in the current video frame based on the pose of the first object in the previous video frame, the pose change information determined by the change information determination unit, and the pose change constraint determined by the change constraint determination unit.

3. The apparatus of claim 2, wherein the pose of the second object in the current video frame is determined by using a pose detector for the second object in the current video frame.

4. The apparatus of claim 2, wherein the pose of the second object in the current video frame is determined based on the pose of the second object in the previous video frame and pose change information of the second object from the previous video frame to the current video frame determined based on the pose of the second object in the at least one video frame prior to the current video frame.

5. The apparatus of claim 1, the apparatus further comprising:

a feature point determination unit configured to determine a feature point of a shape of the first object in the current video frame;

wherein the feature points determined by the feature point determination unit are feature points whose geometric relationship is stable with respect to a change in the posture of the first object in the video; alternatively, the feature point determined by the feature point determination unit is a non-occlusion feature point represented by occlusion information of the shape of the first object in the current video frame.

6. The apparatus of claim 2, the apparatus further comprising:

7. The apparatus of claim 5 or claim 6,

wherein the posture change information of the first object from the previous video frame to the current video frame determined by the change information determination unit is position change information of the feature point from the previous video frame to the current video frame determined by the feature point determination unit;

wherein the posture of the first object updated by the updating unit is the position of the feature point determined by the feature point determining unit of the shape of the first object updated by the updating unit.

8. The apparatus of claim 5 or claim 6,

wherein occlusion information of the shape of the first object in the current video frame is determined based on occlusion information of the shape of the first object in the at least one previous video frame prior to the current video frame;

wherein, for any one video frame, the occlusion information of the shape of the first object in that video frame represents feature points of the shape of the first object as occlusion feature points and non-occlusion feature points.

9. The apparatus of claim 8, wherein,

the updating unit updates the position of the non-occlusion feature point of the shape of the first object determined by the shape determining unit using a shape detection method; and is

The updating unit updates the position of the occluding feature point of the shape of the first object determined by the shape determining unit based on the position of the non-occluding feature point updated by the updating unit and the geometric relationship between the non-occluding feature point and the occluding feature point with respect to the region of the first object.

10. The apparatus according to claim 9, wherein the occlusion information of the shape of the first object in the current video frame is updated by judging the occlusion information of each feature point of the shape of the first object updated by the updating unit based on a pre-generated occlusion classifier.

11. The apparatus of claim 1 or claim 2 or claim 5 or claim 6, further comprising:

a first judging unit configured to judge whether the first object is visible in the current video frame based on a confidence of the first object in the current video frame;

wherein, in a case where the first judgment unit judges that the first object is visible in the current video frame, the update unit updates the shape of the first object determined by the shape determination unit and updates the posture of the first object determined by the posture determination unit based on the updated shape of the first object.

12. The apparatus of claim 11, the apparatus further comprising:

a detection unit configured to detect a pose of the first object in a first video frame of the video or a current video frame in which the first object is judged to be visible by the first judgment unit and detect a shape of the first object in the video frame;

wherein the first object is determined by the first determination unit to be invisible in a previous video frame of the current video frame.

13. The apparatus of claim 12, the apparatus further comprising:

a second determination unit configured to determine whether the shape of the first object in the current video frame detected by the detection unit and the shape of the first object in the current video frame determined by the update unit belong to the same object based on a similarity metric in a case where the first object is determined by the first determination unit to be invisible in the previous video frame and visible in the current video frame;

wherein the similarity metric is determined based on the pose of the first object in the current video frame detected by the detection unit and the pose of the first object in the current video frame determined by the update unit.

14. The apparatus according to claim 7, wherein the change information determination unit determines the posture change information or the position change information by using a time-series-based change prediction method.

15. The apparatus of claim 2, wherein the first and second electrodes are disposed in a common plane,

wherein the pose determination unit determines the pose of the first object in the current video frame by compensating the pose of the first object in the previous video frame by an offset;

wherein the offset amount is determined based on the posture change information determined by the change information determination unit; alternatively, the offset amount is determined based on the posture change information determined by the change information determination unit and the posture change constraint determined by the change constraint determination unit.

16. The apparatus of claim 1, wherein the first and second electrodes are disposed on opposite sides of the housing,

wherein the shape determination unit determines the shape of the first object in the current video frame by transforming a predetermined shape using the pose of the first object determined by the pose determination unit;

wherein the updating unit updates the shape of the first object determined by the shape determining unit using a shape detection method.

17. A method for tracking object shapes in a video, the method comprising:

a change information determination step of determining posture change information of a first object from a previous video frame to a current video frame based on a posture of the first object in at least one video frame before the current video frame;

a pose determining step of determining a pose of the first object in the current video frame based on the pose of the first object in the previous video frame and the pose change information determined by the change information determining step;

a shape determining step of determining a shape of the first object in the current video frame based on the pose of the first object determined by the pose determining step; and

an updating step of updating the shape of the first object determined by the shape determining step and updating the posture of the first object determined by the posture determining step based on the updated shape of the first object.

18. The method of claim 17, further comprising:

a change constraint determining step, configured to determine a change constraint of the pose of the first object in the current video frame based on the pose of a second object in the current video frame and a constraint relationship between the first object and the second object in the current video frame;

wherein, in the pose determination step, the pose of the first object in the current video frame is determined based on the pose of the first object in the previous video frame, the pose change information determined by the change information determination step, and the pose change constraint determined by the change constraint determination step.

19. The method of claim 17 or claim 18, further comprising:

a first judging step, configured to judge whether the first object is visible in the current video frame based on a confidence of the first object in the current video frame;

wherein, in the updating step, in a case where the first judging step judges that the first object is visible in the current video frame, the shape of the first object determined in the shape determining step is updated and the pose of the first object determined in the pose determining step is updated based on the updated shape of the first object.

20. The method of claim 19, further comprising:

a detecting step of detecting a pose of the first object in a first video frame of the video and detecting a shape of the first object in the video frame for the video frame or for a current video frame in which the first object is judged to be visible by the first judging step;

wherein the first object is determined by the first determining step to be invisible in a previous video frame of the current video frame.

21. The method of claim 20, further comprising:

a second determination step of, in a case where it is determined in the first determination step that the first object is not visible in the previous video frame but is visible in the current video frame, determining, based on a similarity metric, whether the shape of the first object in the current video frame detected by the detection step and the shape of the first object in the current video frame determined by the update step belong to the same object;

wherein the similarity metric is determined based on the pose of the first object in the current video frame detected by the detecting step and the pose of the first object in the current video frame determined by the updating step.

22. An image processing system, the system comprising:

the first image processing device is configured to determine first tracking information of a person to be tracked in an input video in each video frame of the input video;

second image processing means configured to determine, for the person to be tracked in the input video, second tracking information of the person in each video frame of the input video based on the shape of the person in each video frame of the input video and the pose of the person in each video frame of the input video, determined according to any one of claims 1 to 16; and

a third image processing device configured to update the first tracking information determined by the first image processing device based on the second tracking information determined by the second image processing device in a case where the first tracking information is different from the second tracking information for the person to be tracked in the input video.