CN108564014B

CN108564014B - Object shape tracking device and method, and image processing system

Info

Publication number: CN108564014B
Application number: CN201810288618.0A
Authority: CN
Inventors: 陈存建; 黄耀海; 赵东悦; 金浩
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2017-04-17
Filing date: 2018-04-03
Publication date: 2022-08-09
Anticipated expiration: 2038-04-03
Also published as: CN108564014A

Abstract

The invention discloses an object shape tracking device and method and an image processing system. The apparatus for tracking the shape of an object includes: a unit configured to determine a shape of an object in a current video frame based on an object shape in at least one previous video frame; occlusion information of a shape, a unit for determining occlusion information for the determined object shape; a unit configured to update the determined object shape based on the determined occlusion information; and a unit configured to update the object shape based on the updated object shape , the unit for updating the determined occlusion information. According to the present invention, in the process of tracking the shape of the object in the video, when the object in the video is blocked by other objects, the accuracy of the object shape and the accuracy of the object tracking will be improved.

Description

Object shape tracking device and method, and image processing system

Technical Field

The present invention relates to image processing, and more particularly, to an apparatus and method for tracking the shape of an object and an image processing system.

Background

In tracking objects, particularly object shapes, in a video, such as a human face or a human body joint, to more accurately obtain the object shape in one video frame (e.g., a current video frame) of the video, the object initial shape in the video frame is typically initialized using an object shape determined from a previous video frame of the video frame. Then, an object final shape in the video frame may be determined based on the initialized object initial shape.

An exemplary technique is disclosed in "Facial Shape Tracking via spread-Temporal cassette Shape Regression" (j. yang. j. ding, k. zhang, and q. liu., the IEEE International Conference on Computer Vision (ICCV) works, 2015, pp.41-49). This exemplary technique mainly discloses the following processes: regarding a current video frame of a video, firstly, an object shape determined from a previous video frame of the current video frame is regarded as an object initial shape in the current video frame; then, a Shape Regression method (e.g., a Cascaded Shape Regression (CSR) method) is performed on the object initial Shape to determine the final Shape of the object in the current video frame. These processes are repeated until the end of the video is reached.

In other words, in tracking object shapes in a video, an object shape determined from a previous video frame will be passed to a subsequent video frame to determine a corresponding object initial shape. That is, the accuracy of the object shape determined from a previous video frame will directly affect the accuracy of the object shape of a subsequent video frame or even the entire video to be determined. However, in the process of determining a corresponding object shape in one video frame according to the above-described technique, only the object shape determined from the previous video frame of the video frame is considered, and no other information is considered. Therefore, in the case where an object in a video is occluded by other objects (such as a mask, sunglasses, a scarf, a microphone, a hand, or a person), in the process of determining the shape of a corresponding object in one video frame, the influence of not considering the occlusion will cause inaccuracy in the final shape of the object in the obtained video frame. In other words, in tracking an object (especially, an object shape) in a video according to the above-mentioned technology, in the case that the object in the video is occluded by other objects, the occlusion will affect the accuracy of the object tracking result of one video frame or even the whole video.

Disclosure of Invention

Accordingly, the present disclosure is directed to solving the above-described problems in view of the above description in the background art.

According to an aspect of the present invention, there is provided an apparatus for tracking a shape of an object in a video, the apparatus comprising: a shape determination unit configured to determine an object shape in a current video frame of an input video based on object shapes in at least one previous video frame of the current video frame; an information determination unit configured to determine occlusion information of the object shape determined by the shape determination unit based on occlusion information of the object shape in the at least one previous video frame; a shape updating unit configured to update the shape of the object determined by the shape determining unit based on the occlusion information determined by the information determining unit; and an information updating unit configured to update the occlusion information determined by the information determining unit based on the object shape updated by the shape updating unit. Wherein, for any one video frame of the input video, the occlusion information of the object shape in the video frame represents the feature points of the object shape as occlusion feature points and non-occlusion feature points.

With the present invention, in the process of tracking an object (especially, an object shape) in a video, in the case where the object in the video is occluded by other objects (such as a mask, sunglasses, a scarf, a microphone, a hand, or a person), the accuracy of the object shape and the accuracy of object tracking will be improved.

Other characteristic features and advantages of the present invention will be apparent from the following description with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1A to 1B schematically show exemplary objects in a video that are occluded by other objects.

Fig. 2 is a block diagram schematically showing a hardware configuration in which a technique according to an embodiment of the present invention can be implemented.

Fig. 3 is a block diagram illustrating a configuration of an object shape tracking apparatus according to an embodiment of the present invention.

FIG. 4 schematically shows a flow diagram of object shape tracking according to an embodiment of the invention.

Fig. 5A to 5B schematically show an occlusion region detected in an object region.

Fig. 6 schematically shows a flowchart of step S420 as shown in fig. 4 according to the present invention.

Fig. 7A to 7D schematically show an example of determining occlusion information of an object shape in the t-th video frame by step S420 shown in fig. 6.

Fig. 8 illustrates an arrangement of an exemplary image processing system according to the present invention.

Fig. 9A to 9B schematically show exemplary two persons in a crowded entry scene.

Fig. 10A to 10C schematically show exemplary two persons in another crowded entry scenario.

Detailed Description

Exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the following description is merely illustrative and exemplary in nature and is in no way intended to limit the invention, its application, or uses. The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in the embodiments do not limit the scope of the present invention unless it is specifically stated otherwise. Additionally, techniques, methods, and apparatus known to those skilled in the art may not be discussed in detail, but are intended to be part of the present specification where appropriate.

Note that like reference numerals and letters refer to like items in the drawings, and thus, once an item is defined in a drawing, it is not necessary to discuss it in the following drawings.

As described above, in the case where an object in a video is occluded by other objects (such as a mask, sunglasses, a scarf, a microphone, a hand, or a person) in tracking the object (especially, the shape of the object) in the video according to the related art, the influence of the occlusion is not considered in determining the shape of the corresponding object in one video frame of the video. For example, fig. 1A schematically shows an exemplary human face that is occluded by a mask, i.e., the shape of the human face is occluded by a mask. Fig. 1B schematically shows a person who is blocked by another person in a crowded entry scene (a crowded and walk-through scenario), that is, a scene in which a plurality of persons enter one entrance together in the shooting direction of the camera, that is, a shape of one person is blocked by a shape of another person. In general, in tracking an object in a video according to the prior art, since an existing occlusion in the video will cause an inaccurate object shape to be output (i.e., the existing occlusion will affect the accuracy of the object shape), and since an existing occlusion in the video will also cause a tracking Identification (ID) of the object being tracked to be lost or switched (i.e., the existing occlusion will affect the accuracy of the object tracking), in case of a corresponding occlusion in the video, in order to reduce the effect caused by the existing occlusion, a person skilled in the art will usually consider how to remove the existing occlusion as much as possible in the object tracking process.

However, the inventors have found that in tracking the shape of objects in a video, occlusion information present in the video may also be used as a good reference when determining the shape of objects in corresponding video frames. Therefore, in tracking the shape of an object in a video, in the case where the object in the video is occluded by other objects, the present invention does not consider how to remove the existing occlusion, but considers how to use the existing occlusion to assist in the tracking of the shape of the object.

Thus, in tracking object shapes in a video, in determining a corresponding object shape in a video frame of the video, in addition to passing an object shape determined from a previous video frame to the video frame, the present invention passes occlusion information determined from a previous video frame to the video frame. Wherein the communicated object shape is used to determine an initial shape of the object in the video frame and the communicated occlusion information is used to determine occlusion information for the determined initial shape of the object. Wherein, for any one video frame of the video, the occlusion information of the object shape in the video frame represents the feature points of the object shape as occlusion feature points and non-occlusion feature points. In addition, the feature points of the object shape are also the mark points of the object shape, and the feature points are, for example, human face feature points or human body joint feature points.

To determine the final shape of an object in one video frame of a video, the position of an occluded part (i.e., occluded feature points) of the initial shape of the object and the position of a non-occluded part (i.e., non-occluded feature points) of the initial shape of the object may be updated using different methods, respectively, based on the corresponding occlusion information of the initial shape of the object in the video frame. Furthermore, after determining the final shape of the object in the video frame, the corresponding occlusion information of the initial shape of the object in the video frame will be updated based on the final shape of the object in the video frame, so that the occlusion information passed to subsequent video frames is more accurate. Thus, on the one hand, preventing the effect of the occlusion present during the determination of the final position of the non-occluded part will make the position of the non-occluded part more accurate. On the other hand, using a more accurate position of the non-occluded part to determine the final position of the occluded part minimizes the impact of the existing occlusion on the accuracy of the position of the occluded part. Therefore, according to the invention, in the process of tracking the object shape in the video, under the condition that the object in the video is shielded by other objects, the accuracy of the object shape of one video frame or even the whole video and the accuracy of the object tracking result can be improved.

(hardware configuration)

A hardware configuration that can implement the technology described hereinafter will be described first with reference to fig. 2.

Hardware configuration 200 includes, for example, a Central Processing Unit (CPU)210, a Random Access Memory (RAM)220, a Read Only Memory (ROM)230, a hard disk 240, an input device 250, an output device 260, a network interface 270, and a system bus 280. Further, the hardware configuration 200 may be implemented by a device such as a camera, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a notebook computer, a desktop computer, or other suitable electronic devices.

In a first implementation, the process of tracking object shapes in video according to the present invention is configured by hardware or firmware and used as a module or component of hardware configuration 200. For example, the apparatus 300, which will be described in detail below with reference to FIG. 3, serves as a module or component of the hardware configuration 200. In a second implementation, the process of tracking object shapes in video according to the present invention is configured by software stored in ROM230 or hard disk 240 and executed by CPU 210. For example, a process 400, which will be described in detail below with reference to fig. 4, is used as a program stored in the ROM230 or the hard disk 240.

The CPU 210 is any suitable programmable control device, such as a processor, and can perform various functions to be described hereinafter by executing various application programs stored in the ROM230 or the hard disk 240, such as a memory. The RAM 220 is used to temporarily store programs or data loaded from the ROM230 or the hard disk 240, and is also used as a space in which the CPU 210 performs various processes (such as implementing a technique which will be described in detail below with reference to fig. 4 and 6) and other available functions. The hard disk 240 stores a variety of information such as an Operating System (OS), various applications, control programs, data pre-stored or predefined by the manufacturer, and models and/or classifiers pre-stored or pre-generated by the manufacturer.

In one implementation, input device 250 is used to allow a user to interact with hardware configuration 200. In one example, a user may input images/video/data through input device 250. In another example, a user may trigger a corresponding process of the present invention through input device 250. Further, the input device 250 may take various forms, such as a button, a keyboard, or a touch screen. In another implementation, the input device 250 is used to receive images/video output from specialized electronic devices such as digital cameras, video cameras, and/or web cameras.

In one implementation, the output device 260 is used to display the object tracking results (such as a bounding box of the tracked object, a shape of the tracked object, an occlusion relationship between two tracked objects, etc.) to the user. Also, the output device 260 may take various forms, such as a Cathode Ray Tube (CRT) or a liquid crystal display. In another implementation, the output device 260 is used to output the object tracking results to subsequent processes of video/image analysis and recognition, such as face analysis, portrait retrieval, expression recognition, face recognition, facial attribute recognition, and the like.

Network interface 270 provides an interface for connecting hardware configuration 200 to a network. For example, hardware configuration 200 may be in data communication with other electronic devices connected via a network via network interface 270. Optionally, hardware configuration 200 may be provided with a wireless interface for wireless data communication. The system bus 280 may provide a data transmission path for mutually transmitting data among the CPU 210, the RAM 220, the ROM230, the hard disk 240, the input device 250, the output device 260, the network interface 270, and the like. Although referred to as a bus, the system bus 280 is not limited to any particular data transfer technique.

The hardware configuration 200 described above is merely illustrative and is in no way intended to limit the present invention, its applications, or uses. Also, for simplicity, only one hardware configuration is shown in FIG. 2. However, multiple hardware configurations may be used as desired.

(object shape tracking apparatus and method)

Next, a process of tracking the shape of an object in a video according to the present invention will be described with reference to fig. 3 to 7D.

Fig. 3 is a block diagram illustrating a configuration of an apparatus 300 according to an embodiment of the present invention. Wherein some or all of the modules shown in figure 3 may be implemented by dedicated hardware. As shown in fig. 3, the apparatus 300 includes a shape determining unit 310, an information determining unit 320, a shape updating unit 330, and an information updating unit 340.

First, the input device 250 shown in fig. 2 receives video output from a special electronic device (e.g., a camera) or input by a user. The input device 250 then transmits the received video to the apparatus 300 via the system bus 280.

Then, as shown in fig. 3, for an object in a current video frame (such as the t-th video frame) of the received video (i.e., the input video), the shape determination unit 310 determines an object shape in the current video frame based on an object shape in at least one previous video frame of the current video frame. Where T is a natural number and 2 ≦ T < T, which is the total number of video frames of the input video. In other words, the shape determining unit 310 determines an object initial shape in the current video frame based on at least one object shape delivered from the previous video frame. Among them, the shape of the object to be tracked is, for example, a human face shape or a human body joint shape.

The information determining unit 320 determines occlusion information of the object shape determined by the shape determining unit 310 based on the occlusion information of the object shape in at least one previous video frame. In other words, the information determining unit 320 determines occlusion information of the initial shape of the object in the current video frame based on the occlusion information transferred from the previous video frame.

The shape updating unit 330 updates the shape of the object determined by the shape determining unit 310 based on the occlusion information determined by the information determining unit 320. In other words, the shape updating unit 330 determines the object final shape in the current video frame by updating the object initial shape in the current video frame based on the occlusion information of the object initial shape.

The information updating unit 340 updates the occlusion information determined by the information determining unit 320 based on the object shape updated by the shape updating unit 330. In other words, the information updating unit 340 updates the occlusion information of the object initial shape based on the object final shape in the current video frame.

That is, for the t-th video frame of the input video (where t ≧ 2), the apparatus 300 will determine the corresponding shape information and the corresponding occlusion information in the t-th video frame using the shape information and the occlusion information determined from the previous video frame of the t-th video frame. Furthermore, in order to trigger the object shape tracking process and to determine the corresponding shape information and the corresponding occlusion information in the first video frame (i.e. the 1 st video frame) of the input video, the apparatus 300 further comprises a detection unit 350.

As shown in fig. 3, for the 1 st video frame of the input video, the detection unit 350 detects a corresponding object shape in the 1 st video frame, and detects corresponding occlusion information of the object shape detected in the 1 st video frame. Then, taking the 2 nd video frame of the input video as an example, the shape determining unit 310 and the information determining unit 320 perform corresponding operations based on the shape information and the occlusion information detected from the 1 st video frame.

As described above, for one input video, the detection unit 350 will detect the corresponding shape information and the corresponding occlusion information only from the 1 st video frame of the input video. In addition, in order to prevent accumulation of tracking errors due to positional deviation of the object shape in the entire input video and to improve accuracy of object shape tracking, a number of video frame sequences may be first acquired from the entire input video. Then, for the 1 st video frame of each video frame sequence, the detection unit 350 will perform the corresponding operation. For the tth video frame of each video frame sequence (where t ≧ 2), the shape determining unit 310, the information determining unit 320, the shape updating unit 330, and the information updating unit 340 will perform corresponding operations. Additionally, in one example, detection unit 350 will be used to obtain a corresponding sequence of video frames from the entire input video. In another example, other units (e.g., a sequence acquisition unit not shown in fig. 3) may also be used to acquire the corresponding sequence of video frames.

The flow chart 400 shown in fig. 4 is a corresponding process of the apparatus 300 shown in fig. 3.

As shown in fig. 4, for one input video, in the detection step S410, the detection unit 350 detects a corresponding object shape in the 1 st video frame of the input video, and detects corresponding occlusion information of the detected object shape in the 1 st video frame. As described above, as an alternative solution, the detection unit 350 detects a corresponding object shape and corresponding occlusion information from the 1 st video frame of one video frame sequence acquired from the input video. In one implementation, the detection unit 350 detects the corresponding shape information and the corresponding occlusion information in the 1 st video frame by the following process.

In one aspect, the detection unit 350 performs a shape detection method (e.g., a cascaded regression method) on the 1 st video frame to detect a corresponding object shape in the 1 st video frame, so that a corresponding position of a feature point of the object shape in the 1 st video frame can be obtained. For example, in the case where the object to be tracked is a human face, the feature points are human face feature points; and in the case where the object to be tracked is a human joint, the feature points are human joint feature points.

On the other hand, the detection unit 350 performs an occlusion detection method on the 1 st video frame to detect corresponding occlusion information of the detected object shape. In one example, the occlusion detection method is a template-based matching method. The template for matching operation includes, for example, a mask template, a scarf template, a sunglasses template, and the like. In another example, the occlusion detection method is a model-based object detection and classification method. Wherein a model for the detection and classification operation is generated, for example, using a deep learning method based on occlusion samples, and the model is used, for example, to detect the location of occlusion regions in the video frame and to identify the class of occlusion in the video frame.

In one implementation, the detection unit 350 detects occlusion information of the detected object shape in the 1 st video frame by detecting an occlusion region (e.g., a mask region 520 shown in fig. 5A) in an object region (e.g., a rectangular region 510 shown in fig. 5A). Wherein, in one example, the object region may be estimated based on object tracking results (e.g., object shape) obtained from previous video frames. In another example, the object region may be detected in the corresponding video frame by using an existing detection method. Further, the feature points of the detected object shape located inside the occlusion region are regarded as occlusion feature points, and the feature points of the detected object shape located outside the occlusion region are regarded as non-occlusion feature points. In other words, for any one video frame of the input video, the occlusion information of the object shape in the video frame represents the feature points of the object shape as occlusion feature points and non-occlusion feature points. That is, the occlusion information indicates the occlusion state of each feature point of the object shape.

In addition, occlusion information of an object shape in one video frame is represented using a binary representation or a probability representation. Here, the binary representation means that the occlusion state of each occlusion feature point is represented as "1", and the occlusion state of each non-occlusion feature point is represented as "0". The probability representation means that the occlusion state of each feature point is described using a probability value. For example, in the case where the probability value of a feature point is greater than or equal to a predetermined threshold (e.g., TH1), the feature point will be regarded as an occlusion feature point.

Further, in order to obtain a more accurate occlusion region in the object region so that more accurate occlusion information can be obtained, the detection unit 350 performs an image segmentation method on the detected occlusion region (e.g., the mask region 520 shown in fig. 5A) to obtain a more accurate occlusion region (e.g., the mask region 530 shown in fig. 5B). In one implementation, taking mask region 520 shown in fig. 5A as an example, the image segmentation method is implemented by performing a Convolutional Neural Network (CNN) algorithm on each pixel within mask region 520. By comparing the mask region 520 shown in fig. 5A with the updated mask region 530 shown in fig. 5B, it can be seen that by updating the mask region 520, the occlusion state of the feature points around the nose region will be updated from the occluded feature points to the non-occluded feature points.

Returning to FIG. 4, in step S420, for the t-th video frame of the input video (where t ≧ 2), the apparatus 300 shown in FIG. 3 determines the corresponding object shape and the corresponding occlusion information in the t-th video frame. In one implementation, the apparatus 300 determines the correspondence information with reference to fig. 6.

Then, after the apparatus 300 determines the corresponding object shape and the corresponding occlusion information in the tth video frame, the apparatus 300 determines whether T is greater than T in step S430. In case T is larger than T (meaning that the entire input video has been processed), the corresponding process of the apparatus 300 will stop. Otherwise, in step S440, the device 300 sets t +1 and repeats the corresponding process of step S420.

Fig. 6 schematically shows a flowchart of step S420 as shown in fig. 4 according to the present invention. As shown in fig. 6, in the shape determining step S421, the shape determining unit 310 shown in fig. 3 determines an object shape (i.e., an object initial shape) in the t-th video frame based on the object shape in at least one previous video frame of the t-th video frame.

In one implementation, the shape determination unit 310 directly considers the object shape determined in one previous video frame closest to the tth video frame as the object initial shape in the tth video frame. In another implementation, the shape determination unit 310 determines the initial shape of the object in the tth video frame by calculating an average or weighted sum of the shapes of the object determined in a plurality of previous video frames of the tth video frame.

Returning to fig. 6, in the information determining step S422, the information determining unit 320 determines occlusion information of an object initial shape in the t-th video frame based on the occlusion information of the object shape in at least one previous video frame.

In one implementation, the information determining unit 320 directly treats the occlusion information of the object shape determined in one previous video frame closest to the tth video frame as the occlusion information of the object initial shape in the tth video frame.

In another implementation, in order to obtain accurate occlusion information of the initial shape of the object in the tth video frame, the information determining unit 320 determines occlusion information of the initial shape of the object in the tth video frame using a statistical-based method based on occlusion information of the shape of the object determined in a plurality of previous video frames of the tth video frame.

In one example, the information determining unit 320 determines the occlusion information of the initial shape of the object in the t-th video frame by calculating an average or a weighted sum of the occlusion information of the shapes of the object determined in a plurality of previous video frames of the t-th video frame. In other words, occlusion information of an object initial shape in the T-th video frame is determined using occlusion information of an object shape determined from the (T-n) -th video frame to the (T-1) -th video frame, where n is a natural number and 2 ≦ n < T.

Taking the calculation of the weighted sum as an example, since the occlusion information of the object shape determined in the previous video frame closer to the tth video frame can better describe the occlusion information of the object initial shape in the tth video frame, in order to obtain more accurate occlusion information of the object initial shape in the tth video frame, the previous video frame closer to the tth video frame is given a larger weight value, and the previous video frame farther from the tth video frame is given a smaller weight value. For example, assuming that n is 6, a weight value of 0.8 may be assigned to the (t-1) th to (t-3) th video frames, and a weight value of 0.2 may be assigned to the (t-4) th to (t-6) th video frames. Those skilled in the art will appreciate that the foregoing examples are illustrative only, and not limiting. After each previous video frame is assigned a corresponding weight value, a corresponding weighted sum is calculated. In the case where the occlusion information of the object shape in the video frame is represented using the binary representation, as described above, the occlusion state of each occlusion feature point is represented as "1", and the occlusion state of each non-occlusion feature point is represented as "0". Therefore, the occlusion state of the feature point whose weighted sum is greater than or equal to a predetermined threshold (e.g., TH2) in the t-TH video frame is represented as "1", and the occlusion state of the feature point whose weighted sum is less than the predetermined threshold (e.g., TH2) in the t-TH video frame is represented as "0". In the case where occlusion information of an object shape in a video frame is represented using a probability representation, as described above, the occlusion state of each feature point will be described using a corresponding probability value. Thus, the corresponding weighted sum will be used to represent the corresponding probability values of the feature points in the t-th video frame.

In another example, in case that occlusion information of an object shape in a video frame is represented using a probability representation, in order to obtain more accurate occlusion information of an object initial shape in a tth video frame, the information determining unit 320 determines occlusion information of the object initial shape in the tth video frame by performing a machine learning method (e.g., Hidden Markov Model (HMM)) on the occlusion information of the object shape determined in a plurality of previous video frames of the tth video frame. Then, the corresponding probability values of the feature points in the t-th video frame will be represented using the corresponding values obtained based on the machine learning method.

In another implementation, to reduce the amount of computation, the information determining unit 320 determines occlusion information of an object initial shape in the tth video frame based on the stability of the occlusion information of the object shape determined in a plurality of previous video frames of the tth video frame. More specifically, in the case where the occlusion information of the object shape in a plurality of previous video frames is stable, the information determining unit 320 regards the occlusion information of the object shape in any one of the previous video frames as the occlusion information of the object initial shape in the t-th video frame. In other words, in the case where the occlusion information of the object shape in a plurality of previous video frames is stable (meaning that the occlusion occurring in the input video is a synchronous occlusion), the occlusion information of the object shape in each of the previous video frames is the same. Therefore, instead of performing the above-described statistics-based method, occlusion information of an object shape in any one of the previous video frames may be used as occlusion information of an object initial shape in the t-th video frame.

Further, on the one hand, the determination operation may be performed by the information determining unit 320 or a dedicated unit (not shown in fig. 3) to determine whether the occlusion information of the object shape in a plurality of previous video frames is stable.

On the other hand, in one example, it is determined whether occlusion information for an object shape in a plurality of previous video frames is stable based on an empirical setting. For example, in case an object to be tracked in an input video is occluded by a non-moving object (e.g. mask, sunglasses, scarf), the occlusion information of the object shape in a number of previous video frames will be considered stable.

In another example, to reduce the amount of computation and obtain more accurate occlusion information of the initial shape of the object in the t-th video frame, it is determined whether the occlusion information of the shape of the object in a plurality of previous video frames is stable based on a change frequency of the occlusion information of each feature point of the shape of the object between the plurality of previous video frames. Wherein, in case that the change frequency of the occlusion information of a feature point is less than a predetermined threshold (e.g., TH3), the occlusion information of the feature point between a plurality of previous video frames will be considered stable. Further, in the case where the occlusion information of all the feature points between the plurality of previous videos is stable, the occlusion information of the object shape in the plurality of previous video frames will be regarded as stable.

More specifically, for a feature point of an object shape between a plurality of previous video frames, the change frequency of the occlusion information of the feature point is obtained by calculating the number of changes of the edit distance of the feature point between every two adjacent previous video frames. For example, for a feature point of an object shape from the (t-n) th video frame to the (t-1) th video frame, (n-1) edit distances of the feature point between every two adjacent previous video frames will be calculated first. Then, the number of changes of these (n-1) edit distances will be calculated, and the change frequency of the occlusion information of this feature point will be calculated by, for example, the following formula:

returning to fig. 6, after determining the object initial shape in the t-th video frame and the occlusion information of the object initial shape in the t-th video frame, the shape updating unit 330 updates the object initial shape in the t-th video frame based on the occlusion information of the object initial shape in the t-th video frame in the shape updating step S423. Based on occlusion information of the initial shape of the object in the tth video frame, it can be determined which feature points of the initial shape of the object in the tth video frame are occlusion feature points and which feature points of the initial shape of the object in the tth video frame are non-occlusion feature points.

Accordingly, the shape updating unit 330 updates the position of the non-occlusion feature point by using a shape detection method (such as a CSR method and a deep learning-based shape detection method) with respect to the non-occlusion feature point of the object initial shape in the t-th video frame. That is, the final position of the non-occluded feature points will be determined using a shape detection method.

In view of the stability of the geometric relationship between the non-occlusion feature point and the occlusion feature point with respect to the object region, for the occlusion feature point of the object initial shape in the t-th video frame, the shape updating unit 330 updates the position of the occlusion feature point based on the final position of the non-occlusion feature point and the geometric relationship between the non-occlusion feature point and the occlusion feature point with respect to the object region. That is, the final position of the occluding feature point will be determined based on the final position of the non-occluding feature point and the particular geometric relationship.

For example, in case the object to be tracked is a human face, the corresponding geometrical relationship may comprise the following relationship:

relation 1: the distance between the centers of the two eyes is about one third of the width of the face area; and/or

Relation 2: the distance between the center of the mouth and the center of the left eye, the distance between the center of the mouth and the center of the right eye, and the distance between the center of the left eye and the center of the right eye are substantially the same; and/or

Relation 3: the distance between the center of the nose and the center of the mouth is approximately one-quarter of the height of the face region.

In addition, taking a mouth blocked by a mask as an example, that is, at least the feature points around the left-eye region and the right-eye region are non-blocking feature points and at least the feature points around the mouth region are blocking feature points, therefore, after determining the final positions of the feature points around the left-eye region and the right-eye region, the shape updating unit 330 may determine the final positions of the feature points around the mouth region based on the relationship 2 and the final positions of the feature points around the left-eye region and the right-eye region.

Returning to fig. 6, after determining the final position of the non-occlusion feature point and the final position of the occlusion feature point, that is, after determining the final shape of the object in the t-th video frame, in order to transfer more accurate occlusion information of the object shape to the subsequent video frame, in the information updating step S424, the information updating unit 340 updates the occlusion information of the initial shape of the object in the t-th video frame based on the final shape of the object in the t-th video frame. The updated occlusion information will then be considered as corresponding occlusion information for the object shape in the tth video frame.

In one implementation, the information updating unit 340 updates the occlusion information of the initial shape of the object in the t-th video frame by determining the occlusion information of each feature point of the final shape of the object in the t-th video frame based on a pre-generated occlusion classifier or other occlusion determination method. In one implementation, the pre-generated occlusion classifier is a binary classifier generated from positive and negative examples using a learning method such as a Support Vector Machine (SVM) algorithm, an Adaboost algorithm, or the like. Wherein the positive samples are generated by sampling the corresponding image around the occluded feature point, and the negative samples are generated by sampling the corresponding image around the non-occluded feature point.

Fig. 7A to 7D schematically show an example of determining occlusion information of an object shape in the t-th video frame by step S420 shown in fig. 6. FIG. 7A shows the corresponding occlusion information for object shapes from the 0 th video frame to the (t-1) th video frame. The horizontal direction represents the number of video frames, the vertical direction represents the number of feature points of the object shape, the symbol "O" represents that the corresponding feature point is a non-occlusion feature point, and the symbol "X" represents that the corresponding feature point is an occlusion feature point. After performing step S422 shown in fig. 6, the corresponding occlusion information of the object initial shape in the tth video frame is as shown in fig. 7B. Fig. 7C shows the corresponding occlusion information for the object shape in the tth video frame after performing step S424. Fig. 7D shows the corresponding occlusion information for the object shapes from the 0 th video frame to the t-th video frame.

As described above, in the present invention, in one aspect, occlusion information determined from a previous video frame will be used to assist in determining the final shape of an object in a current video frame. Therefore, in determining the final shape of the object in the current video frame, different methods can be used to determine the final position of the occlusion feature point and the final position of the non-occlusion feature point, so that the influence of occlusion existing in the input video can be prevented. On the other hand, after the final shape of the object in the current video frame is determined, the final shape of the object in the current video frame is used for assisting in updating the corresponding occlusion information of the initial shape of the object in the current video frame, so that the occlusion information to be transferred to the subsequent video frame is more accurate. Therefore, according to the invention, in the process of tracking the object shape in the video, under the condition that the object in the video is shielded by other objects, the accuracy of the object shape of one video frame or even the whole video and the accuracy of the object tracking result can be improved.

(image processing System)

In crowded entrance scenarios (e.g., on the street, at a shopping center, at a supermarket, etc.), it is common for one person to be occluded by another person (as shown in fig. 1B), that is, there is often a corresponding occlusion between the shapes of the people. Therefore, in tracking a particular person in a video, occlusion caused by others in the video will typically affect the accuracy of person tracking. For example, occlusion by others in the video will typically result in the loss of the person being tracked, or in switching the tracking ID of the person being tracked. Wherein switching the tracking ID of the person being tracked includes, for example, assigning a new tracking ID to the person being tracked or exchanging the tracking IDs of the two persons being tracked.

The inventors have found that in tracking the shape of a particular person in a video, the occlusion information for the particular person will remain unchanged between all video frames of the video, regardless of the occlusion that non-moving objects (e.g., mask, sunglasses, scarf) create for the particular person, and regardless of the particular person's being occluded by any other person. In case the particular person crosses another person over a certain period of time, the occlusion information of the occluded person between the corresponding video frames will change, while the occlusion information of the unoccluded person between the corresponding video frames will remain unchanged. Therefore, the inventors have considered that occlusion information of a person between video frames can be used to assist in tracking the shape of the person, and thus the influence of occlusion by other people in the video can be prevented and the accuracy of person tracking can be improved.

As an exemplary application of the above-described process with reference to fig. 3 to 7D, next, an exemplary image processing system will be described with reference to fig. 8. As shown in fig. 8, the image processing system 800 includes the apparatus 300 (i.e., a first image processing apparatus), a second image processing apparatus 810, and a third image processing apparatus 820. In one implementation, the device 300, the second image processing device 810, and the third image processing device 820 are connected to each other via a system bus. In another implementation, the apparatus 300, the second image processing apparatus 810, and the third image processing apparatus 820 are connected to each other via a network. In addition, the apparatus 300, the second image processing apparatus 810, and the third image processing apparatus 820 may be implemented via the same electronic device (e.g., a computer, a PDA, a mobile phone, a camera). Alternatively, the apparatus 300, the second image processing apparatus 810, and the third image processing apparatus 820 may also be implemented via different electronic devices.

As shown in fig. 8, first, the apparatus 300 and the second image processing apparatus 810 receive video output from a dedicated electronic device (e.g., a camera) or input by a user.

Then, for any two persons in the input video, the apparatus 300 determines the shape of each person in each video frame of the input video and the occlusion information of the shape of each person in each video frame of the input video with reference to fig. 3 to 7D.

Further, the second image processing device 810 determines, for any two persons in the input video, tracking information of the shape of each person in each video frame of the input video. In one implementation, the second image processing device 810 performs a general tracking method, for example, on each video frame of the input video to determine corresponding tracking information. The tracking information of the shape of a person in each video frame includes, for example, a tracking ID of the person, a track of each feature point of the shape of the person, and the like.

Then, for any two persons in any one video frame of the input video, the third image processing device 820 determines an occlusion relationship between the two persons based on occlusion information of the shape of each person in at least one previous video frame of the video frame and tracking information of the shape of each person in at least one previous video frame of the video frame. The occlusion relationship between two persons is, in particular, the positional relationship of an occlusion occurring between the two persons. For example, the occlusion relationship between the person a and the person B indicates that the person a is occluded by the person B, or the person B is occluded by the person a, or the person a and the person B are not occluded with each other.

To reduce the amount of computation, in one implementation, the third image processing device 820 determines the occlusion relationship between two people in any one of the video frames of the input video based on the amount of change in the non-occlusion feature points of the shape of each person between the previous video frames of the video frame and the relative position between the two people in the video frame.

More specifically, on the one hand, for two persons in a particular video frame of the input video, the third image processing device 820 determines the corresponding relative position between the two persons in the particular video frame, based on the tracking information of the two persons determined by the second image processing device 810 in at least one video frame preceding the particular video frame, in particular based on the corresponding trajectory of each feature point of the shapes of the two persons in at least one video frame preceding the particular video frame. In one example, the relative position between the two persons is calculated as the Euclidean distance (Euclidean distance) between the two persons.

On the other hand, for two persons in a particular video frame of the input video, the third image processing device 820 first determines the number of non-occlusion feature points of the shape of each person in each previous video frame based on the occlusion information of the two persons determined by the device 300 in at least one video frame preceding the particular video frame. Then, based on the determined number of non-occlusion feature points, the third image processing device 820 determines the amount of change in the non-occlusion feature points of the shape of each person between the previous video frames.

Then, the third image processing device 820 determines the occlusion relationship between the two persons in the specific video frame based on the determined relative position between the two persons and the determined amount of change in the non-occlusion feature points of the shape of each person.

Fig. 9A and 9B schematically illustrate two persons (e.g., person a and person B) in a crowded entrance scene. In which fig. 9A shows the relative position between person a and person B in one previous video frame, such as the (t-m) -th video frame. Fig. 9B shows the relative position between person a and person B in the current video frame (such as the t-th video frame). For person a, it can be seen that the number of non-occluded feature points of the shape of person a remains unchanged from the (t-m) th video frame to the t-th video frame. For the person B, it can be seen that the number of non-occlusion feature points of the shape of the person B gradually decreases between the video frames around the t-th video frame. Therefore, it can be determined that the person B is occluded by the person a at a period near the t-th video frame. In other words, at a time period near the t-th video frame, the occlusion relationship between the person a and the person B is that the person B is occluded by the person a.

Further, as described above, occlusion in a video by another person often results in switching the tracking ID of the person being tracked, resulting in outputting an erroneous tracking result during person tracking. In particular, in the case of the people counting application for counting the number of people in or passing through a specific space, if the tracking ID of a person is switched during tracking of people in a video, an erroneous result of the people counting is output. Therefore, in the process of tracking people in a video, in the case where there is a corresponding occlusion between people, in order to reduce switching of tracking IDs before and after the position where a specific occlusion occurs, so that the accuracy of person tracking can be improved, for any two people in an input video, after determining the occlusion relationship between the two people, the third image processing apparatus 820 will further update the tracking information of the two people determined by the second image processing apparatus 810 based on the occlusion relationship between the two people in each video frame of the input video. For example, in a case where the third image processing apparatus 820 finds that two tracking IDs before and after the position where the specific occlusion occurs actually belong to the same person, the third image processing apparatus 820 will correct the erroneous tracking ID.

In one implementation, the third image processing device 820 determines whether two tracking IDs before and after the position where the specific occlusion occurs belong to the same person by the following operation. Taking the person D shown in fig. 10B as an example, fig. 10B shows a t-th video frame in which occlusion occurs on the person D, fig. 10A shows a (t-1) -th video frame before the t-th video frame, and fig. 10C shows a (t +1) -th video frame after the t-th video frame. Further, the occlusion relationship between the person C and the person D is that the person D is occluded by the person C from the (t-1) th video frame to the (t +1) th video frame.

For the person D in the (t-1) th video frame, the third image processing device 820 extracts a corresponding appearance feature vector from the non-occlusion feature points of the shape of the person D in the (t-1) th video frame, which are determined based on the occlusion relationship between the person C and the person D in the (t-1) th video frame. For the person D in the (t +1) th video frame, the third image processing apparatus 820 extracts a corresponding appearance feature vector from the non-occlusion feature points of the shape of the person D in the (t +1) th video frame, which are determined based on the occlusion relationship between the person C and the person D in the (t +1) th video frame.

Then, in a case where the similarity measure between the two appearance feature vectors is less than or equal to a predetermined threshold (e.g., TH4), the third image processing apparatus 820 determines that the person D in the (t-1) TH video frame is actually the same person as the person D in the (t +1) TH video frame. That is, the tracking ID of the person D in the (t-1) th video frame should be the same as the tracking ID of the person D in the (t +1) th video frame. In the case where these two tracking IDs are not the same, the third image processing device 820 will correct the wrong tracking ID, and thus can ensure that the person D who is occluded by the person C can have the same tracking ID regardless of before or after the t-th video frame. Wherein the similarity measure between two appearance feature vectors is obtained by e.g. calculating the distance between the two appearance feature vectors.

As described above, in the present invention, the image processing system 800 shown in fig. 8 can determine an occlusion relationship between any two persons in any one video frame of an input video. Therefore, in the case where there is a specific occlusion between the persons in the input video, that is, in the case where there is a specific occlusion between the shapes of the persons in the input video, the image processing system 800 can correct the mis-tracking ID based on the occlusion relationship. Therefore, the present invention can reduce switching of the tracking ID before and after the occurrence of a specific occlusion position, and thus can improve the accuracy of person tracking.

All of the elements described above are exemplary and/or preferred modules for implementing the processes described in this disclosure. These units may be hardware units (such as Field Programmable Gate Arrays (FPGAs), digital signal processors, application specific integrated circuits, etc.) and/or software modules (such as computer readable programs). The units for carrying out the steps have not been described in detail above. However, in case there are steps to perform a specific procedure, there may be corresponding functional modules or units (implemented by hardware and/or software) to implement the same procedure. The technical solutions by all combinations of the described steps and the units corresponding to these steps are included in the disclosure of the present application as long as the technical solutions formed by them are complete and applicable.

The method and apparatus of the present invention may be implemented in various ways. For example, the methods and apparatus of the present invention may be implemented in software, hardware, firmware, or any combination thereof. The above-described order of the steps of the method is intended to be illustrative only and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, which includes machine-readable instructions for implementing a method according to the present invention. Therefore, the present invention also covers a recording medium storing a program for implementing the method according to the present invention.

While some specific embodiments of the present invention have been shown in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are intended to be illustrative only and are not limiting upon the scope of the invention. It will be appreciated by those skilled in the art that the above-described embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. An apparatus for tracking object shapes in a video, the apparatus comprising:

a shape determination unit configured to determine an object shape in a current video frame of an input video based on object shapes in at least one previous video frame of the current video frame;

an information determination unit configured to determine occlusion information of the object shape determined by the shape determination unit based on occlusion information of the object shape in the at least one previous video frame;

a shape updating unit configured to update the object shape determined by the shape determining unit based on the occlusion information determined by the information determining unit; and

an information updating unit configured to update the occlusion information determined by the information determining unit based on the object shape updated by the shape updating unit,

wherein the information determination unit regards the occlusion information of the object shape in any one of the previous video frames as the occlusion information of the object shape determined by the shape determination unit, in a case where the occlusion information of the object shape in the previous video frame is stable.

2. The apparatus according to claim 1, wherein the information determining unit determines the occlusion information of the object shape determined by the shape determining unit using a statistical-based method based on the occlusion information of the object shape in the previous video frame.

3. The apparatus of claim 1, wherein determining whether the occlusion information of the object shape in the previous video frame is stable is based on a frequency of change of occlusion information of each feature point of the object shape between the previous video frames.

4. The apparatus according to claim 1, wherein the occlusion information of the object shape in any one video frame represents feature points of the object shape as occlusion feature points and non-occlusion feature points.

5. The apparatus according to claim 4, wherein the shape updating unit updates the position of the non-occlusion feature point of the object shape determined by the shape determining unit using a shape detection method.

6. The apparatus according to claim 5, wherein the shape updating unit updates the position of the occluding feature point of the object shape determined by the shape determining unit based on the position of the non-occluding feature point and a geometric relationship between the non-occluding feature point and the occluding feature point with respect to the object region updated by the shape updating unit.

7. The apparatus according to claim 1, wherein the information updating unit updates the occlusion information determined by the information determining unit by judging occlusion information of each feature point of the object shape updated by the shape updating unit based on a pre-generated occlusion classifier.

8. The apparatus of claim 1, the apparatus further comprising:

a detection unit configured to detect, for a first video frame of the input video or for a first video frame of a sequence of video frames obtained from the input video, an object shape in the first video frame and to detect occlusion information of the detected object shape in the first video frame.

9. The apparatus according to claim 8, wherein the detection unit detects the occlusion information of the detected object shape in the first video frame by detecting an occlusion region in an object region.

10. The apparatus according to claim 9, wherein the detection unit updates the detected occlusion region by using an image segmentation method on the detected occlusion region.

11. A method for tracking object shapes in a video, the method comprising:

a shape determining step of determining an object shape in a current video frame of an input video based on object shapes in at least one previous video frame of the current video frame;

an information determining step of determining occlusion information of the object shape determined by the shape determining step based on occlusion information of the object shape in the at least one previous video frame;

a shape updating step of updating the shape of the object determined by the shape determining step based on the occlusion information determined by the information determining step; and

an information updating step of updating the occlusion information determined by the information determining step based on the object shape updated by the shape updating step,

wherein, in a case where the occlusion information of the object shape in the previous video frame is stable, the occlusion information of the object shape in any one of the previous video frames is regarded as the occlusion information of the object shape determined by the shape determining step.

12. The method of claim 11, further comprising:

a detecting step of detecting, for a first video frame of the input video or for a first video frame of a sequence of video frames obtained from the input video, an object shape in the first video frame and occlusion information of the detected object shape in the first video frame.

13. An image processing system, the system comprising:

a first image processing apparatus configured to determine, for any two persons in an input video, a shape of each person in each video frame of the input video and occlusion information of the shape of each person in each video frame of the input video according to any one of claims 1 to 11;

a second image processing device configured to determine, for the any two persons in the input video, tracking information of the shape of each person in each video frame of the input video; and

a third image processing device configured to determine, for the any two persons in any one video frame of the input video, an occlusion relationship between each person in at least one previous video frame of the video frame based on the occlusion information of the shape of the person and the tracking information of the shape of each person in the at least one previous video frame of the video frame.

14. The system of claim 13, wherein the third image processing device updates the tracking information determined by the second image processing device based on the occlusion relationship between the two persons determined by the third image processing device in each video frame of the input video.

15. The system according to claim 13, wherein for said any two persons in any one video frame of said input video, said third image processing device determines said occlusion relationship between each person based on the amount of change in the non-occluding feature points of said shape of said person between said previous video frames of that video frame and the relative position between the two persons in that video frame.