WO2014092553A2

WO2014092553A2 - Method and system for splitting and combining images from steerable camera

Info

Publication number: WO2014092553A2
Application number: PCT/MY2013/000267
Authority: WO
Inventors: Yuen Shang LI; Chan Ching Hau; Choong Teck LIONG
Original assignee: Mimos Berhad
Priority date: 2012-12-13
Filing date: 2013-12-13
Publication date: 2014-06-19
Also published as: WO2014092553A3

Abstract

The present invention provides a method for processing a video stream taken by a steerable camera to segment the same. The method comprises converting the video stream into an image sequence; segmenting each image of the image sequence to extract a foreground image and a background image; combining background images of the image sequence to form a global map; determining motion frequency of the foreground image through a motion table, wherein the motion table has a same (pixel?) size as the global map, each cell of the motion table records a count on the a number of foreground images; augmenting the count on the motion table based on the corresponding position of the foreground image over the global map to indicate a high motion area and a low motion area through the number of count; generating a set of local maps of the same size for each motion area over the global map, each local map is incorporated with the corresponding foreground image and its timestamp; and rendering a new segmented video by combing each set of local maps.

Description

METHOD AND SYSTEM FOR SPLITTING AND COMBINING IMAGES FROM STEERABLE CAMERA

Field of the Invention

[0001] The present invention generally relates to a image processing and in particular to a system and method for splitting and combining images from a steerable camera.

Background

[0002] A significant number of Pan-Tilt-Zoom (PTZ) cameras have been widely deployed nowadays and these cameras are deployed to monitor scenes of areas under surveillance of different directions. Some PTZ cameras further have built-in auto-tracking features that can steer or navigate the camera to focus on moving objects automatically. Therefore, tracking and surveillance information from different locations can be recorded. The recorded surveillance video at any digital storage device, such as the Network Video Recorder (NVR), which allows the recorded videos to be accessible remotely. To view the specific video of interest, users are required to specify the start time and end time for playing back. However, in order to review and locate a specific scene from the selected video stream, users have to review the entire selected video stream to visually identify the scene at the specific location. Such attempt is time consuming and inefficient. [0003] In the prior art, when reviewing the recorded video streams from a moving camera (e.g. Pan-Tilt-Zoom(PTZ) camera), the user needs to manually playback the entire video length to locate the video of interest. However, this manual method is inefficient, demands high labour resources and consumes long time. The invention solves these problems by converting the video stream from a moving camera (PTZ camera) into multiple static spatial-based review videos automatically. In other words, the generated output consists of a series of review videos where each of the video has the same static background and but different moving objects.

[0004] US patent publication no US2007/0058717A1 discloses a method of video processing with a video registration process. It produces composite images for visualization. It relies on sensing units to project frames into a common reference in order to obtain registered frames of the input video. [0005] US patent no. US6522787B1 discloses an image processing system for imaging a scene to mosaic. It selects a new viewpoint of the scene, and renders synthetic image from the mosaic of the scene from that new viewpoint. The synthesized image is then used to form composite image.

Summary [0006] In one aspect of the present invention, there is provided a method for processing a video stream taken by a steerable camera to segment the same. The method comprises converting the video stream into an image sequence; segmenting each image of the image sequence to extract a foreground image and a background image; combining background images of the image sequence to form a global map; determining motion frequency of the foreground image through a motion table, wherein the motion table has a same size as the resolution of the global map, each cell of the motion table records a count on the a number of foreground images; augmenting the count on the motion table based on the corresponding position of the foreground image over the global map to indicate a high motion area and a low motion area through the number of count; generating a set of local maps of the same size for each motion area over the global map, each local map is incorporated with the corresponding foreground image and its timestamp; and rendering a new segmented video by combing each set of local maps.

[0007] In one embodiment, the global map is formed based on transformation, which is computed from correspondences between each successive background image.

[0008] In another embodiment, each high and low motion area is determined according to a motion area level pre-defined by user. It is also possible that each high and low motion area is determined though a mean value, the mean value is an average value of a maximum count and a minimum count.

[0009] In yet a further embodiment, the motion area is considered high when its count is greater than the mean value and is considered low, when its count is smaller than the mean value. The same-size local maps may further be generated by interpolating the size of each local map according to user-defined value. A resolution value may be determined by finding the maximum size among all the local maps. The local maps may further be clustered according to the each high and low motion area.

[0010] In a further embodiment, the step of generating a set of local maps further comprises generating a new image for each local maps with the foreground image and its timestamp inserted thereto. Brief Description of the Drawings

[0011] Preferred embodiments according to the present invention will now be described with reference to the figures accompanied herein, in which like reference numerals denote like elements; [0012] FIG. 1 illustrates a schematic diagram a surveillance system in accordance with one embodiment of the present invention;

[0013] FIG. 2 is a flow chart illustrating a video compilation process in accordance with one embodiment of the present invention;

[0014] FIG. 3A and FIG. 3B illustrate video input process of FIG. 2 in accordance with one embodiment of the present invention;

[0015] FIG. 4A and FIG. 4B illustrate the foreground-background segmentation process of FIG. 2 in accordance with one embodiment of the present invention;

[0016] FIG. 5 A and FIG. 5B illustrate the global map generation process of FIG. 2 in accordance with one embodiment of the present invention;

[0017] FIG. 6A and FIG. 6B illustrate a motion frequency determination process of FIG. 2 in accordance with one embodiment of the present invention;

[0018] FIG. 6C and FIG. 6D exemplifies the life captured videos illustrating the motion frequency determination process; [0019] FIG. 7A and FIG. 7B illustrates the local map clustering process of

FIG. 2 in accordance with one embodiment of the present invention; [0020] FIG. 8A and FIG. 8B illustrates the foreground insertion process of

FIG. 2 in accordance with one embodiment of the present invention; and

[0021] FIG. 9A and FIG. 9B illustrates review video creation process of FIG.

2 in accordance with one embodiment of the present invention. Detailed Description

[0022] Embodiments of the present invention shall now be described in detail, with reference to the attached drawings. It is to be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated device, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

[0023] FIG. 1 illustrates a schematic diagram a surveillance system 100 in accordance with one embodiment of the present invention. The surveillance system 100 comprising a plurality of video cameras 102, a network video recorder (NVR) 104, a video processing system 110 and a video output unit 120. Typically, depending on the area to under surveillance, a surveillance system adapts more than one surveillance (video) camera to monitoring the area. To provide a broad coverage, Pan-Tilt-Zoom (PTZ) cameras have becoming more common. Accordingly, one continuous video steam 106 may comprise a combination moving scenes and scenes from distinct areas. The video steam captured by the video cameras 102 are channeled and stored in to the NVR 104 for storage and the administrator of the surveillance system may, at their desire, access the NVR 104 for reviewing. These video streams 106 are further fed into a video processing system 110 for compiling video streams for administrator's review. The compilation of the video streams includes splitting the video streams into multiple video streams, and grouping and combining the related video streams. Preferably, the video streams are being converted into multiple static spatial-based videos automatically. Once the video streams are processes and compiled, they can be readily reviewed through the video output unit 120. Operationally, the video streams are subjected to image sequence generation 112, foreground-background segmentation 113, global map generation 114, motion frequency determination 115, local maps clustering 116, foreground insertion 117, and review video generation 118.

[0024] FIG. 2 is a flow chart illustrating a video compilation process 200 in accordance with one embodiment of the present invention. The process 200 comprises inputting recorded video stream in step 202, segmenting foreground and background in step 204, generating a global map in step 206, determining motion frequency in step 208, clustering local map in step 210, inserting foreground in step 212, and rendering review video in step 214. Briefly, in the step 202, the captured video stream is inputted in to the video processing system 110 and decoded into sequence of images. Each image in the sequence is processed to segment the foreground and background information in the step 204. The foreground and background information are presented as distinct images. The background images are then stitched up to generate a global map in the step 206. Further in step 208, the system determines high and low motions area of the foreground image(s) over the global map. The high and low motions are determined through the positions of the foreground image(s). Once the motion areas are identified, the system clusters and generates local maps from the global map in step 210. Thereafter, the corresponding foreground images as well as the corresponding time stamp is inserted back to the corresponding local maps in the step 212. Accordingly, the review videos having different locations and motion levels will be generated in the step 214.

[0025] FIG. 3A and FIG. 3B illustrate video input process 202 of FIG. 2 in accordance with one embodiment of the present invention. The video input process 202 involves connecting to NVR in step 302, receiving a recorded PTZ video stream 312 in step 304, decoding the video stream 312 into image sequence 314 in step 306, and storing image sequence 314 into a database for later processing.

[0026] It is understood to a skilled person that the video processing may be carried out locally from the user side or at the server side. When it occurs at the user side, the users are required to download the video or the image sequence from the NVR for processing, and once the processing is done, the user may view the outcome locally. On the other hand, when it occurs on the server side where the NVR is located, the video stream processing may be carried out over at the server side, and once the processing is done, the user may stream it directly over the network. [0027] FIG. 4A and FIG. 4B illustrate the foreground-background segmentation process 204 of FIG. 2 in accordance with one embodiment of the present invention. The foreground-background segmentation process 204 includes connecting to the database and downloading the image sequence in step 402, extracting image feature of all images of the image sequence in step 404, computing optical flow 412 between two consequent images, k and k+1 , of the image sequence in step 406, extracting foreground (i.e. moving object) and background (i.e. static environment) information for all image k+1 till the end of the selected scene in step 408, and storing segmented background images 416 and foreground images 415 into the database in step 409.

[0028] Referring back to the step 406, the optical flow is a known al gorithm to identify the image flow 412. Through the information from the image flow 412, foreground and background images 414, 416 can be acquired for storing into the database. For the foreground images 414, the bottom centered position with respect to the corresponding background image 416 is being identified and recorded into the database for later processing.

[0029] It is also understood to a skilled person that above the foreground- background segmentation is provided for illustrations only, not limitation. Many other foreground-background segmentations method and process may also be adapted fully or partially herein for the puipose of the present invention.

[0030] FIG. 5A and FIG. SB illustrate the global map generation process 206 of FIG. 2 in accordance with one embodiment of the present invention. The global map generation process 206 comprises loading all the background images k, k+1 , k+2, ... in step 501, extracting features for all background images k in step 503, computing the transformation for each successive images of the background images in step 505, stitching up all the background images through the transformation information to form a global map 514 in step 507, and storing the global map 514 as well as the global position of each background image into the database in step 509.

[0031] In the step 503, by processing all the background images k, the transformation between one image and its successive one can be determined by exploiting extracted landmarks from the background images. Accordingly, in the step 504, the background images can be stitched up using these transformations.

[0032] FIG. 6A and FIG. 6B illustrate a motion frequency determination process 208 of FIG. 2 in accordance with one embodiment of the present invention. The motion frequency determination process 208 comprises creating a motion table 616 of zero (0) score having a same size as the global map in step 601, loading foreground information 612 from the database in step 602, determining size of the loaded foreground images 612 in step 603, determining if the size of the foreground image larger than a pre-defined margin, T in step 604, computing the bottom-center position of foreground image in the global map in step 605, and increasing by one score to the foregi und image 612 on the corresponding cell of the global map 614 according to the bottom-center position of the foregiOund image 612 in step 606.

[0033] In motion frequency determination process 208, a scoring scheme is adapted to determine the motion area, i.e area where motions occurs. In the scoring scheme, the motion table 616 is first created. The motion table 616 is adapted with table matrix of same size as the resolution or the pixel size of the global map 614. All the value in the motion table 616 is preset to zero (0). After the foreground images 612 and their corresponding position information is loaded up in the step 602, the positions of these foreground images 612 are marked on the motion table 616. When the foreground images 612 appear on the same position over at the global map 614, the corresponding cell of the motion table increases the its scoring count. Accordingly, the system is able to identify which are the areas on the global map having high frequency movements. [0034] The system also provides pre-processing steps (i.e. steps 603 and 604) to filter out noise through the size of the images. Those image sizes that are smaller than T shall be regarded as noise, and will therefore be ignored in the scoring scheme.

[0035] In the embodiments illustrated above, the scoring scheme is being adapted for determining a movement frequency of the detected foreground. The scoring scheme facilitates a simple counter for determining how often the foregrounds are located at a location (or pixel co-ordinates). It is well understood that other methods, such as voting scheme or the like, may be desired for the present invention, without departing from the scope of the present invention. [0036] FIG. 6C illustrates a sequence of life captured three images with their respective foregrounds extracted. FIG. 6D provides a visual illustration how the voting scheme is being implemented on the images of FIG. 6C. It shows that the positions (i.e. bottom-center) of the respective foreground images are being marked on the corresponding cell of the motion able. [0037] FIG. 7A and FIG. 7B illustrates the local map clustering process 210 of

FIG. 2 in accordance with one embodiment of the present invention. The local map clustering process 210 includes fining mean value by averaging a max and min value from the motion table 712 in step 702, determining a number of peak point and valley point in step 704, determining the number of cluster in step 706, clustering the motion table based on k in step 704, segmenting the high and low motion region in step 705, increasing the region to min resolution of camera if region is smaller than the foreground image, and expending the region centered from the peak point or valley point in step 706, and storing local maps into the database in step 707. [0038] In this process, the system identifies the maximum and minimum values from the motion table 712 in the step 702. The minimum value taken herewith is nonzero value. In the example illustrates in FIG. 7B, the maximum value is four (4) and the minimum values are one (1), and there are three minimum values all together. The mean value in this case shall be 2.5. Using the mean value, the high and low motion regions of the global map can be determined through the motion table. In the step 704, the peak values are those that are above the mean and the valley point are values that are below the mean. High motion areas are the peak points where the low motion areas are the valley point. Accordingly, the number of cluster is computed based on the summation of number of high and low motion areas. The local maps can then be created according to these high and low motion areas. In the step 706, the region of local map is being expended to a minimum resolution of camera, if the corresponding region is smaller than that of the minimum resolution from the camera.

[0039] FIG. 8A and FIG. 8B illustrates the foreground insertion process 212 of FIG. 2 in accordance with one embodiment of the present invention. The foreground insertion process 212 includes loading local maps and foreground images in step 801, generating stack memory allocation for each local map in step 802, finding all related local maps for each foreground image based on its global position in step 804, generating a new image of this local map in step 805, computing foreground image on this local map image in step 806, inserting foreground image on this local map image, determining if time stamp is needed in step 808, creating time-stamp of foreground image based on its time on global video in step 809, inserting the time stamp on the local map image in step 810, and storing foreground image on its stack in step 811. [0040] During the foreground insertion process, each local map is given a stack of memory allocation in the step 802. The number of stacks are generated based on the cluster number. The foreground images and their corresponding time stamps are being associated to their local maps based on their position on the global map in the step 804. New images for the local map with its corresponding foreground image inserted therein is generated at the step 805 and the new images are stored at the corresponding memory stack at the step 811. Before the new images are stored in the memory stacks, the system may further tag with its timestamp.

[0041] FIG. 9A and FIG. 9B illustrates review video creation process 214 of FIG. 2 in accordance with one embodiment of the present invention. The review video creation process 210 first loading all the map stacks in step 902. Once they are loaded, the images from the respective stacks are encoded to form a new video stream in step 904. Each new video stream is being identified according to the activity state, for example, they can be named with "high activity" and "low activity". These videos are then stored into the database in step 906.

[0042] While specific embodiments have been described and illustrated, it is understood that many changes, modifications, vaiiations, and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

1. A method for processing a video stream taken by a steerable camera to segment the same, the method comprising: converting the video stream into an image sequence; segmenting each image of the image sequence to extract a foreground image and a background image; combining background images of the image sequence to form a global map; determining motion frequency of the foreground image through a motion table, wherein the motion table has a same size as a resolution of the global map, each cell of the motion table records a count on the a number of foreground images; augmenting the count on the motion table based on the corresponding position of the foreground image over the global map to indicate a high motion area and a low motion area through the number of count; generating a set of local maps of the same size for each motion area over the global map, each local map is incorporated with the corresponding foreground image and its timestamp; rendering a new segmented video by combing each set of local maps.

2. The method of claim 1, wherein the global map is formed based on n-ansforrnation, which is computed from coiTespondences between each successive background image.

3. The method of claim 1, wherein each high and low motion area is determined according to a motion area level pre-defined by user.

4. The metliod of claim 1, wherein each high and low motion area is determined though a mean value, the mean value is an average value of a maximum count and a minimum count.

5. The method of claim 4, wherein the motion area is considered high when its count is greater than the mean value and is considered low, when its count is smaller than the mean value.

6. The method of clam 1 , wherein same-size local maps are generated by interpolating the size of each local map according to user defined value.

7. The method of claim 1, wherein a resolution value which is determined by finding the maximum size among all the local maps.

8. The method of clam 6, wherein local maps are clustered according to the each high and low motion area.

9. The method of claim 1, wherein the step of generating a set of local maps further comprises generating a new image for each local maps with the foreground image and its timestamp inserted thereto.