AU2009243430A1

AU2009243430A1 - Multiple frame-rate summary video using object detection

Info

Publication number: AU2009243430A1
Application number: AU2009243430A
Authority: AU
Inventors: David Grant Mcleish
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-11-30
Filing date: 2009-11-30
Publication date: 2011-06-16

Description

S&F Ref: 921398 AUSTRALIA PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address Canon Kabushiki Kaisha, of 30-2, Shimomaruko 3 of Applicant : chome, Ohta-ku, Tokyo, 146, Japan Actual Inventor(s): David Grant McLeish Address for Service: Spruson & Ferguson St Martins Tower Level 35 31 Market Street Sydney NSW 2000 (CCN 3710000177) Invention Title: Multiple frame-rate summary video using object detection The following statement is a full description of this invention, including the best method of performing it known to me/us: 5845c(2417409_1) - 1 MULTIPLE FRAME-RATE SUMMARY VIDEO USING OBJECT DETECTION TECHNICAL FIELD The present invention relates to video playback and, in particular, to displaying a large quantity of video in a short period of time. 5 BACKGROUND Recent years have seen a dramatic increase in the use of digital video cameras for surveillance. It is increasingly common for a surveillance system to record more video than it is possible for security personnel to review. A need exists for a system that can display video in a summarised form. A typical to scenario involves a surveillance system that records video footage throughout the night, with the intention being that the recorded video footage is to be reviewed by security personnel in the morning. In this scenario, several hours of footage throughout the night may need to be compressed into a few minutes, so that security personnel viewing the footage in the morning are able to determine in a timely manner whether any security is breaches have occurred or whether there are any other incidents or anomalies that require investigation. A simple approach to this problem might be simply to play the recorded video footage at a faster rate. However, the problem quickly arises that it is not possible to play video at an infinitely high frame rate. Typical display devices have a refresh rate of around 20 60Hz, meaning that video recorded at, say, 30 frames per second can not be played back at more than double speed without skipping frames. This maximum rate is often further limited by other considerations of the system. For example, stored video frames need to be read from storage, decoded, and transferred to graphics hardware for display. If the video is being transferred over a network, or played on a device with low computing power such 25 as a mobile phone or PDA, the maximum possible rate of playback is reduced dramatically. The problem of skipping frames during playback is not a major obstacle at relatively low rates. However, as the playing speed increases and the number of skipped frames grows, the viewer will likely miss activity occurring in a scene falling within the field of view of the camera that is recording the frames. At very high playback rates, events lasting 2417027 LDOC IRN: 921398 -2 several seconds may pass entirely unseen. Additionally, the viewer may find it difficult to follow the movement of any particular object, such as a moving person, as the objects appear to jump around the scene erratically. One solution that has been proposed is to rescale temporally a stream of video 5 footage to produce a summary video, such that each frame in the summary video is derived from a series of frames from the original video which have been combined in some way. The result is a video with fewer frames than the original video footage that nevertheless retains some of the appearance of each of the intermediate frames from the original video. Several methods of combining the frames of the original video have been proposed. 10 Perhaps the simplest method is to take the average colour value at each pixel location of a consecutive series of video frames. This produces a "motion blurred" effect in the summary video, where a moving object appears as a trail. This is advantageous over skipping frames, because each original video frame is at least partially visible in the summary video. However, averaging a colour value at each pixel location has the is drawback that moving objects result in a trail with low opacity. A quickly-moving object, or any activity in a greatly rescaled video, can become almost completely transparent and evade the notice of the viewer. A more advanced approach is to detect objects of interest in the original video, and give visual priority to the detected objects of interest in the summarised video. One 20 approach that has been proposed, for example, is to superimpose detected objects from each of a series of sequential frames over one of the original frames, typically using the first frame in each sequence. This produces a summary video in which the trail of a moving object is much more prominent, even at high temporal rescaling factors. A disadvantage of this approach, however, is that activity in the video is obscured by objects 25 that move through the same region of the frame. In a single frame of the summary video, only objects in the final frame of the corresponding sequence of the original video are visible, as the object most recently detected at a pixel will be superimposed and thus replace any previously detected object at that pixel. This effect is especially pronounced if the object boundaries are detected imperfectly. Any portion of the background that is 30 incorrectly marked as being part of an object - for example, a small "fringe" that is visible around foreground objects in many object detection systems - is drawn over objects detected from earlier frames and completely obscures the earlier objects. 2417A27 1 nor IRN: 921398 -3 Thus, a need exists to provide an improved method for displaying a large quantity of video in a short period of time. SUMMARY 5 It is an object of the present invention to overcome substantially, or at least ameliorate, one or more disadvantages of existing arrangements. According to a first aspect of the present disclosure, there is provided a method of forming a summary frame derived from a first video frame and a second video frame, wherein the summary frame includes a plurality of summary pixels. For each summary to pixel in the summary frame, the method performs each of the following steps. The method determines a first frame pixel value for a location corresponding to the summary pixel in the first video frame. The method also determines a second frame pixel value for a location corresponding to the summary pixel in the second video frame. The method utilises the first frame pixel value and the second frame pixel value to determine a is summary pixel value. The value of the summary pixel value is dependent upon the corresponding first frame pixel in the first video frame being associated with a foreground object of the first video frame and the corresponding second frame pixel in the second video frame being associated with a foreground object of the second video frame. The method then sets the summary pixel in the summary frame to the summary pixel value. 20 According to a second aspect of the present disclosure, there is provided a computer readable storage medium having recorded thereon a computer program for directing a processor to execute a method of forming a summary frame derived from a first video frame and a second video frame, wherein the summary frame includes a plurality of summary pixels. The computer program product includes code for performing the 25 following method steps. For each summary pixel in the summary frame, the method performs each of the following steps. The method determines a first frame pixel value for a corresponding location in the first video frame. The method also determines a second frame pixel value for a corresponding location in the second video frame. The method utilises the first frame pixel value and the second frame pixel value to determine a 30 summary pixel value. The value of the summary pixel value is dependent upon the corresponding first frame pixel in the first video frame being associated with a foreground object of the first video frame and the corresponding second frame pixel in the second 2417027 1 DOC IRN- 92139R -4 video frame being associated with a foreground object of the second video frame. The method then sets the summary pixel in the summary frame to the summary pixel value. According to a third aspect of the present disclosure, there is provided a system for forming a summary frame derived from a first video frame and a second video frame, s wherein the summary frame includes a plurality of summary pixels. The imaging system includes a storage device for storing a computer program and a processor for executing the program. The program includes code for performing the aforementioned method. According to another aspect of the present disclosure, there is provided an apparatus for implementing any one of the aforementioned methods. 10 According to another aspect of the present disclosure, there is provided a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the aforementioned methods. Other aspects of the invention are also disclosed. 15 BRIEF DESCRIPTION OF THE DRAWINGS One or more embodiments of the invention will now be described with reference to the following drawings, in which: Figs I a and lb collectively form a schematic block diagram of a general purpose computing system in which the arrangements to be described may be implemented; 20 Fig. 2 is a schematic block diagram of an imaging system, illustrating a camera, object detection and encoding modules, a decoder, and a display device; Fig. 3a-e illustrate a process of forming a summary frame derived from two video frames; Fig. 4 illustrates a process of creating higher-level summary frames; 25 Fig. 5 is a schematic flow diagram illustrating a method of capturing video, creating summary video, and writing the video and summary video to storage; Fig. 6 is a diagram illustrating a structure of stored video, including multiple summary levels, according to an embodiment of the present disclosure; and Fig. 7 is a diagram illustrating a structure of stored video, according to another 30 embodiment of the present disclosure. IA17AY2 IR Q210 R -5 DETAILED DESCRIPTION Where reference is made in any one or more of the accompanying drawings to steps and/or features that have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary 5 intention appears. A need exists for a system capable of playing recorded video at high effective frame rates, without reducing the visibility of foreground objects that are likely to be of interest to a viewer. The following terms are defined: 10 A videoframe is an image captured from a camera. A video stream is a series of video frames that are sequential in time. A video stream is alternatively referred to as a video sequence. Whilst the frames in a video stream are sequential in time, the individual frames can be derived from one camera or multiple cameras. Depending on the application, the individual frames of a video stream may have is been captured at regular intervals. Alternatively the individual frames may have been captured at irregular intervals, especially when the individual frames are derived from multiple cameras. A summaryframe is an image derived from a plurality of video frames selected from a video stream. The selected video frames are successive frames in the video stream and 20 may be consecutive frames or frames separated by one or more intervening frames. There may be a different number of intervening frames between the selected video frames, depending on the application. Alternatively, the number of intervening frames can be consistent between selected frames. A summary frame combines information from the selected video frames in such a way that the visibility of the foreground of each of the 25 video frames is given priority over the background. Similarly, a summary stream is a series of summary frames, each of which has the same (or similar) temporal scale. The temporal scale of a summary frame is the number of video frames from which that summary frame is derived. Theforeground of a video frame is defined as the portion of the video frame 30 comprising one or more objects of interest. The foreground may include moving objects, such as people, and stationary objects that have entered the scene, such as luggage. In contrast, the background of the video frame may include stationary elements of the scene, 2417027 1 DOC IRN: 921398 -6 such as the floor and walls, and moving but unimportant objects, such as swaying trees. The distinction between foreground and background in a video frame is determined by an object detection module. A foreground mask is a two-dimensional rectangular array that indicates, for each 5 position in a video frame, whether that position is foreground or background. The resolution of the foreground mask may be the same as the corresponding video frame, or the resolution of the foreground mask may be different from the resolution of the corresponding video frame. The foreground mask may be stored as an alpha channel of the video frame, or the foreground mask may be stored separately. Typically, the foreground 10 mask is a binary mask, wherein each element is a single bit that contains the value I where the frame is foreground, and the value 0 where the frame is background. However, a further embodiment may contain a higher-precision value at each element, representing, for example, the confidence that the element under consideration corresponds to foreground. The present disclosure provides a system for deriving, from a captured video stream, 15 one or more summary streams, each summary stream containing summary frames of a different temporal scale. An embodiment of the present disclosure provides an operator reviewing video footage with a choice of different playback rates, allowing the operator to watch video recorded over a large period of time in a relatively short time, while maintaining the visibility of objects and events of interest that occurred in the recorded 20 scene captured in the video footage that is being reviewed. The present disclosure provides a system, apparatus, computer program, and a method of forming a summary frame derived from a first video frame and a second video frame, wherein the summary frame includes a plurality of summary pixels. For each summary pixel in the summary frame, the method determines a summary pixel value by 25 performing the following steps. The method determines a first frame pixel value for a location corresponding to the summary pixel in the first video frame. The method also determines a second frame pixel value for a location corresponding to the summary pixel in the second video frame. The method then utilises the first frame pixel value and the second frame pixel value 30 to determine the summary pixel value. The value of the summary pixel value is dependent on whether the corresponding first frame pixel in the first video frame is associated with a foreground object of the first video frame and whether the corresponding second frame 2417n97 1 nMr IRN: 921398 -7 pixel in the second video frame is associated with a foreground object of the second video frame. The method then sets the value of the summary pixel in the summary frame to the summary pixel value. In one embodiment, if only one of the corresponding first frame pixel in the first 5 video frame and the corresponding second frame pixel in the second video frame is associated with a foreground object in the respective frame, then the summary pixel value is substantially equal to the value of whichever one of the corresponding first frame pixel or the corresponding second frame pixel is associated with a foreground object. In another embodiment, if the corresponding first frame pixel in the first video frame io and the corresponding second frame pixel in the second video frame are both associated with a foreground object in the respective frame, then the first frame pixel value and the second frame pixel value are blended to produce the summary pixel value. Similarly, if neither one of the corresponding first frame pixel in the first video frame and the corresponding second frame pixel in the second video frame is associated with a is foreground object in the respective frame, then the first frame pixel value and the second frame pixel value are blended to produce the summary pixel value. Blending of the first frame pixel value and the second frame pixel value can be implemented in many ways. One implementation averages the first frame pixel value and the second frame pixel value to produce the summary pixel value. An alternative implementation includes the step of 20 applying a weighting to each of the first frame pixel value and the second frame pixel value to produce a weighted first frame pixel value and a weighted second frame pixel value. The weighted first frame pixel value and the weighted second frame pixel value can then be used to produce the summary pixel value. In one embodiment, the first frame and the second frame are selected from a video 25 sequence. The first frame and second frame can be consecutive frames in the video sequence or can be successive frames that are separated by one or more intervening frames. In a further embodiment, the method encodes the summary frame and writes the encoded summary frame to a storage device. A further embodiment appends the encoded summary frame to a summary video sequence. 30 One embodiment utilises one or more summary frames to produce further summary frames. In particular, one embodiment of the present disclosure forms a second summary frame derived from the summary frame obtained in the manner described above and a third 2417027 .DOC IRN- 921 39R -8 video frame, wherein the second summary frame includes a plurality of pixels. For each pixel in the second summary frame, the method performs the following steps. The method determines a summary frame pixel value for a corresponding location in the summary frame; determines a third frame pixel value for a corresponding location in s the third video frame; and utilises the summary frame pixel value and the third frame pixel value to determine a second summary frame pixel value, dependent upon the corresponding summary frame pixel in the summary frame being associated with a foreground object of the summary frame and the corresponding third frame pixel in the third video frame being associated with a foreground object of the third video frame. The method then sets the io pixel in the second summary frame to the second summary frame pixel value. One embodiment includes the step of determining a first frame foreground mask for the first video frame, wherein the first frame foreground mask identifies foreground objects in the first video frame. Another embodiment includes the step of determining a second frame foreground is mask for the second video frame, wherein the second frame foreground mask identifies foreground objects in the second video frame. The present disclosure also provides a computer readable storage medium having recorded thereon a computer program for directing a processor to execute the aforementioned method of forming a summary frame derived from a first video frame and 20 a second video frame. The present disclosure further provides a system for forming a summary frame derived from a first video frame and a second video frame. The imaging system includes a storage device for storing a computer program and a processor for executing the program. The program includes code for performing the aforementioned method. 25 Figs Ia and lb collectively form a schematic block diagram of a general purpose computer system 100, upon which the various arrangements described can be practised. As seen in Fig. I a, the computer system 100 is formed by a computer module 101, input devices such as a keyboard 102, a mouse pointer device 103, a scanner 126, a camera 127, and a microphone 180, and output devices including a printer 115, a display 30 device 114 and loudspeakers 117. An external Modulator-Demodulator (Modem) transceiver device 116 may be used by the computer module 101 for communicating to and from a communications network 120 via a connection 121. The network 120 may be a 2417027 IDOC IRN: 921398 -9 wide-area network (WAN), such as the Internet or a private WAN. Where the connection 121 is a telephone line, the modem 116 may be a traditional "dial-up" modem. Alternatively, where the connection 121 is a high capacity (e.g., cable) connection, the modem 116 may be a broadband modem. A wireless modem may also be used for wireless 5 connection to the network 120. The computer module 101 typically includes at least one processor unit 105, and a memory unit 106 for example formed from semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The module 101 also includes a number of input/output ([/0) interfaces including an audio-video interface 107 that couples to the io video display 114, loudspeakers 117 and microphone 180, an I/O interface 113 for the keyboard 102, mouse 103, scanner 126, camera 127 and optionally a joystick (not illustrated), and an interface 108 for the external modem 116 and printer 115. In some implementations, the modem 116 may be incorporated within the computer module 101, for example within the interface 108. The computer module 101 also has a local network is interface I l which, via a connection 123, permits coupling of the computer system 100 to a local computer network 122, known as a Local Area Network (LAN). As also illustrated, the local network 122 may also couple to the wide network 120 via a connection 124, which would typically include a so-called "firewall" device or device of similar functionality. The interface I 11 may be formed by an Ethernet circuit card, a 20 Bluetooth wireless arrangement or an IEEE 802.11 wireless arrangement. The interfaces 108 and 113 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 109 are provided and typically include a hard disk drive (HDD) 110. Other storage 25 devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 112 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD), USB-RAM, and floppy disks for example may then be used as appropriate sources of data to the system 100. 30 The components 105 to 113 of the computer module 101 typically communicate via an interconnected bus 104 and in a manner which results in a conventional mode of operation of the computer system 100 known to those in the relevant art. The storage 2417027 .DOC IRN: 921398 -10 devices 109, memory 106 and optical disk drive 112 are connected to other components of the computer system 100 via the connection 119. Examples of computers on which the described arrangements can be practised include IBM-PCs and compatibles, Sun TM Sparcstations, Apple Mac , or alike computer systems evolved therefrom. 5 The method of displaying summarised video data on an output device may be implemented using the computer system 100 wherein the processes of Figs 2 to 7, to be described, may be implemented as one or more software application programs 133 executable within the computer system 100. In particular, the steps of the method of producing such a summarised video data are effected by instructions 131 in the io software 133 that are carried out within the computer system 100. The software instructions 131 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the summarising methods and a second part and the corresponding code modules manage a user interface between the first 15 part and the user. The software 133 is generally loaded into the computer system 100 from a computer readable medium, and is then typically stored in the HDD 110, as illustrated in Fig. Ia, or the memory 106, after which the software 133 can be executed by the computer system 100. In some instances, the application programs 133 may be supplied to the user 20 encoded on one or more CD-ROM 125 and read via the corresponding drive 112 prior to storage in the memory 110 or 106. Alternatively the software 133 may be read by the computer system 100 from the networks 120 or 122 or loaded into the computer system 100 from other computer readable media. Computer readable storage media refers to any storage medium that participates in providing instructions and/or data to the 25 computer system 100 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 101. Examples of computer readable transmission media that may also 30 participate in the provision of software, application programs, instructions and/or data to the computer module 101 include radio or infra-red transmission channels as well as a 2417027 .DOC IRNY W 2I - 11 network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. The second part of the application programs 133 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces s (GUIs) to be rendered or otherwise represented upon the display 114. Through manipulation of typically the keyboard 102 and the mouse 103, a user of the computer system 100 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be 10 implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 117 and user voice commands input via the microphone 180. Fig. lb is a detailed schematic block diagram of the processor 105 and a "memory" 134. The memory 134 represents a logical aggregation of all the memory devices (including the HDD 110 and semiconductor memory 106) that can be accessed by 15 the computer module 101 in Fig. Ia. When the computer module 101 is initially powered up, a power-on self-test (POST) program 150 executes. The POST program 150 is typically stored in a ROM 149 of the semiconductor memory 106. A program permanently stored in a hardware device such as the ROM 149 is sometimes referred to as firmware. The POST program 150 examines 20 hardware within the computer module 101 to ensure proper functioning, and typically checks the processor 105, the memory (109, 106), and a basic input-output systems software (BIOS) module 151, also typically stored in the ROM 149, for correct operation. Once the POST program 150 has run successfully, the BIOS 151 activates the hard disk drive 110. Activation of the hard disk drive 110 causes a bootstrap loader program 152 25 that is resident on the hard disk drive 110 to execute via the processor 105. This loads an operating system 153 into the RAM memory 106 upon which the operating system 153 commences operation. The operating system 153 is a system level application, executable by the processor 105, to fulfil various high level functions, including processor management, memory management, device management, storage management, software 30 application interface, and generic user interface. The operating system 153 manages the memory (109, 106) in order to ensure that each process or application running on the computer module 101 has sufficient memory in 9417A27 1 FlOC IRN- 921398 - 12 which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 100 must be used properly so that each process can run effectively. Accordingly, the aggregated memory 134 is not intended to illustrate how particular segments of memory are allocated (unless 5 otherwise stated), but rather to provide a general view of the memory accessible by the computer system 100 and how such is used. The processor 105 includes a number of functional modules including a control unit 139, an arithmetic logic unit (ALU) 140, and a local or internal memory 148, sometimes called a cache memory. The cache memory 148 typically includes a number of 10 storage registers 144 - 146 in a register section. One or more internal buses 141 functionally interconnect these functional modules. The processor 105 typically also has one or more interfaces 142 for communicating with external devices via the system bus 104, using a connection 118. The application program 133 includes a sequence of instructions 131 that may is include conditional branch and loop instructions. The program 133 may also include data 132 which is used in execution of the program 133. The instructions 131 and the data 132 are stored in memory locations 128-130 and 135-137 respectively. Depending upon the relative size of the instructions 131 and the memory locations 128-130, a particular instruction may be stored in a single memory location as depicted by the 20 instruction shown in the memory location 130. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 128-129. In general, the processor 105 is given a set of instructions which are executed therein. The processor 105 then waits for a subsequent input, to which the processor reacts by 25 executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 102, 103, data received from an external source across one of the networks 120, 122, data retrieved from one of the storage devices 106, 109 or data retrieved from a storage medium 125 inserted into the corresponding reader 112. The execution of a set of the instructions may 30 in some cases result in output of data. Execution may also involve storing data or variables to the memory 134. 2417027 1 DOC IRN: 921398 - 13 The disclosed video data summarising arrangements can use input variables 154, that are stored in the memory 134 in corresponding memory locations 155-158. The summarising arrangements produce output variables 161, that are stored in the memory 134 in corresponding memory locations 162-165. Intermediate variables may be stored in 5 memory locations 159, 160, 166 and 167. The register section 144-146, the arithmetic logic unit (ALU) 140, and the control unit 139 of the processor 105 work together to perform sequences of micro-operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 133. Each fetch, decode, and execute cycle 10 comprises: (a) a fetch operation, which fetches or reads an instruction 131 from a memory location 128; (b) a decode operation in which the control unit 139 determines which instruction has been fetched; and is (c) an execute operation in which the control unit 139 and/or the ALU 140 execute the instruction. Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 139 stores or writes a value to a memory location 132. 20 Each step or sub-process in the processes of Figs 2 - 14 is associated with one or more segments of the program 133, and is performed by the register section 144-147, the ALU 140, and the control unit 139 in the processor 105 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 133. 25 The methods of summarising video data may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions to be described. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories. Fig. 2 is a block diagram of one embodiment of a system 200 comprising the 30 computer system 100. In the system 200, a network video camera 201 captures a video frame and transmits the captured video frame 202 to both an object detection module 203 and a video encoding module 204. 2417027 .DOC IRN: 921398 - 14 The object detection module 203 determines which parts of the video frame correspond to foreground. From this, the object detection module 203 produces a foreground mask and passes the foreground mask 205 to the video encoding module 204. The video encoding module 204 writes the video frames to a data store 206. The 5 video encoding module 204 also derives summary frames from the video frames, and writes the summary frames to the data store. Preferably, the video frames and summary frames are indexed so that the stored video and summary frames can be retrieved in chronological order. The video decoding module 207 receives one or both of the video frames and the to summary frames. In the illustrated embodiment, the video decoding module reads directly from the data store 206. In one embodiment, the video decoding module reads directly from the data store 206 concurrently with the video encoding module 204 writing to the data store. In another embodiment, the video decoding module reads from the data store 206 at a later time. In a different embodiment, the video encoding module transmits to the is video decoding module, for example via a communications link or communications network. The video decoding module 207 then renders the video frames and/or summary frames to a display device 208. Each of the modules in Fig. 2 can be implemented using a separate computer system 100, or alternatively can be implemented using dedicated hardware comprising one or more graphic processors, digital signal processors, 20 microprocessors, associated memories, and combinations thereof. In another embodiment, multiple modules or all modules can reside on a single computer system 100, or an embedded processor on the camera 201. The process of producing summary frames from video frames is illustrated in Figs 3a-e. Specifically, the illustrated process combines two video frames into one 25 summary frame with a temporal scale equal to 2. In Fig. 3a, a camera 301 captures two sequential video frames 302, 303 of a scene. In an exemplary embodiment, the video frames 302, 303 are of the same resolution, although video frames of different resolutions can be used equally as well. In Fig. 3b, an object detection module 304 determines foreground masks 305, 306 for 30 each of the video frames 302, 303. The determination of each of the foreground masks is independent, and can be done immediately after the corresponding frame is captured from the camera 301. For example, the object detection module 304 can compute the 2417027 LDOC IRN: 921398 - 15 foreground mask 305 corresponding to the first frame 302 before the second frame 303 is captured from the camera 301. Object detection can be performed using methods such as pixel-based or shape matching methods for object detection. In Fig. 3c, the foreground masks 305, 306 are combined to form a compositing mask 5 307, using the formula: a a. +(-a

,

) (1) 2 where ac represents the compositing mask 307, and a and a, represent the foreground masks 305, 306. Each of the foreground masks 305, 306 is a rectangular array of values between 0.0 and 1.0, inclusive. For the foreground masks 305, 306, the value for 1o each location in the mask represents the confidence specified by the object detection module 304 that foreground exists at that location. For example, in an embodiment where the object detection module makes a binary distinction between foreground and background, the value at a location in a foreground mask is 1.0 if the location corresponds to foreground, or 0.0 if the location corresponds to background. is The binary distinction between foreground and background produces a compositing mask that weights the two input video frames equally. A further embodiment could bias the result more towards one of the frames by weighting ao and (1-a,) differently: a, = bao +(1- b)(l-a,) ... (2) where b is the bias towards, say, the first foreground mask 305, between 0.0 and 1.0. 20 The compositing mask 307 is an alpha mask for compositing the two video frames. Fig. 3d illustrates the process of using the compositing mask to combine two video frames 302, 303 to form a summary frame 308. The frames are composited using the standard alpha compositing formula (the Porter-Duff "over" operator): f, = foa, +f,(l- a,) 25 wheref represents the summary frame 308,fo andf/ represent the video frames 302, 303, and a represents the compositing mask 307. If the frames and compositing mask are different resolutions, then the compositing mask is scaled to the resolution of the frames. It is advantageous to use a method of smoothing or interpolation when scaling the mask, so that object boundaries blend smoothly into background instead of having a "blocky" 30 appearance. The desirable properties of the summary frame 308 produced by this process are: 2417027 .DOC IRN- 97 1 39R - 16 e Where the first video frame 302 is foreground and the second video frame 303 is background, the compositing mask 307 contains a value of (or close to) 1.0 309, which means that the summary frame at that location 310 will resemble the first video frame - that is, the foreground of the first frame. 5 . Similarly, where the first video frame is background and the second video frame is foreground, the compositing mask contains a value of (or close to) 0.0 311, which means that the summary frame at that location 312 will resemble the second video frame - again, the foreground, but the foreground of the second frame. 10 e Where the first video frame and second video frame are both foreground, the compositing mask contains a value of (or close to) 0.5 313, meaning that the summary frame at that location 314 will be a blend of the foreground in the first and second video frames. * Similarly, where the first video frame and second video frame are is both background, the compositing mask contains a value of (or close to) 0.5 315, meaning that the summary frame at that location 316 will be a blend of the background in the first and second video frames. Thus, foreground from either of the two video frames 302, 303 is given priority over background that is visible in the other of the two frames at the same location. Producing a 20 summarised video frame ensures that the "trail" of the foreground object is visible at full opacity in the summary frame 308. This method of compositing two frames also ensures that the background of the summary frame is representative of the background in the video frames. Further, if either of the two foreground masks 305, 306 contains errors or artefacts, the effect of the errors on the summary frame is reduced by being composited with the 25 other frame. Fig. 3e illustrates the process of combining the foreground masks 305, 306 to form a summary foreground mask 317. In one embodiment, this process involves taking the greater value of the two foreground masks at each location: a, = max(ao, a,) ... (3) 30 where o ) and a represent the foreground masks 305, 306, and as represents the summary foreground mask 317. In a further embodiment, the foreground masks are combined to form a summary foreground mask using the same calculation as that used to 74,7077 1 [Y)C ID.M- 0'1209 - 17 i frames 302, 303 into a summary frame 308 using the compositing mask i hdiment may be more efficient if the foreground masks are stored as the 301. T of the video frames, so that a single optimised compositing operation can aXpa processing of Figs 3d and 3e simultaneously. The two embodiments described peTScessing illustrated in Fig. 3e produce equivalent results if the foreground masks s fbinary distinction between foreground and background, i.e., if the value at each ion of a foreground mask is either 0.0 or 1.0. In an alternative embodiment, a first weight is applied to values in the foreground aask 305 and a second weight is applied to values in the foreground mask 306. The process described so far is used to form a summary frame by combining information derived from two video frames. The process of combining more than two video frames involves repeated application of this process, as illustrated in Fig. 4. Two video frames 401, 402 are combined 403 using the process described in Fig. 3a-e to produce a summary frame 404 and a corresponding summary foreground mask (317 - not 15 shown in Fig. 4). The video frames 401, 402 are referred to as level Oframes, while the summary frame 404 is referred to as a level 1 f-ame. The same process described in Figs 3c-e can then be applied 405 to the level 1 frame 404 and a further level 1 frame 406 (produced from two level 0 frames 408, 409) to produce a level 2frame 407. The level 2 frame is derived from the image data of four 20 video frames 401, 402, 408, 409. Similarly, the same process can be applied 410 to the level 2 frames 407, 411 to produce a level 3frame 412, which is derived from the image data of eight video frames. In general, two level n-1 frames can be combined into a level n frame, which is derived from the image data of 2" video frames - that is, it has a temporal scale of 2", or the sum of 25 the temporal scales of the frames from which it is derived. In this way, the system can create a higher-level summary frame derived from any sequence of video containing a number of frames equal to a power of 2. The process can be extended to combine a number of frames not equal to a power of 2 by differently weighting the values associated with the foreground masks when creating the compositing 30 mask 307 in proportion to the temporal scale of each of the source summary frames. The described approach of generating summary frames with a high temporal scale has a potential drawback that the blended object "trail" does not necessarily retain visible m yn77 i nne IRN: 921398 - 18 features of the object, even though the trail itself remains visible at high levels. Thus, a further embodiment superimposes the foreground of the final video frame in a sequence at full opacity over the summary frame. This makes the features of a moving object - for example, a person - visible in each frame. However, superimposing the foreground of the 5 final video frame in a sequence at full opacity over the summary frame also increases the chance that object detection errors or artefacts in the final frame will obscure portions of the object trail from earlier frames. One embodiment of the process of computing summary frames from live video and writing the summary frames to a storage device 206 is illustrated in the flowchart in Fig. 5. to After a Start step 501, control passes to an initialisation step 502, which initialises a list, tempframes, corresponding to temporarily stored frames of different levels. In one embodiment, this list can grow dynamically. In another embodiment, the list is an array with a fixed size, and frames at levels higher than the size of the array are discarded. Initially, each element of the list is empty. is Control then passes to a capture step 503 that captures a video frame from the camera 201 and stores the captured video frame, or a reference to the captured video frame, in a variable referred to as current_frame. The frame includes a foreground mask as determined by the object detection module 203. Control then passes to an initialisation step 504 that sets an index variable, level, indicating the level of current frame, to zero. 20 Control then passes to an output step 505 that writes the frame referred to by current_frame to the storage device 206. The frame may be encoded in an image format, for example JPEG, or as a frame of a video coding format, such as MPEG. Additionally, the frame is stored in the data store 206 so that a decoder can quickly locate the frame on the storage device if the decoder has read the previous frame of the 25 same level. In one embodiment, this is achieved by storing, before or after each frame, an offset to the location of the next frame of the same level in the file. In another embodiment, each level is stored in the data store 206 as a separate video stream. A further embodiment maintains an index that records the location of each frame on the disk; for example, if each frame is stored as a separate image file, then the name of each image file 30 indicates the order and level of the associated frame. Control then passes to a decision step 506, which checks whether the temporary frame at the current level is empty. If the current level is empty, then control passes to 7417077 1 OC IRN: 921398 -19 storing step 507, which stores currentframe at that level, then returns control to the capture step 503. If the frame at the specified level is not empty, control passes to the combining step 508, which combines current_frame with the temporary frame at the level using the 5 method illustrated in Figs. 3c-e, then sets the current_frame variable to refer to the new combined frame. The combined frame includes a foreground mask as produced by the process illustrated in Fig. 3e. Control then passes to flushing step 509 that resets the temporary frame at the current level to empty; then to an incrementing step 510 that adds one to level; and then back to the 10 output step 505. The results of this process are illustrated graphically in Fig. 6. Frames are shown left-to-right then top-to-bottom in the order that the frames are written to the storage device. The first stored frame 601 is the encoded form of the first level 0 frame 401. A further embodiment also stores the associated foreground mask. The second stored frame is 602 corresponds to the second level 0 frame 402. The third stored frame 603 is the level 1 summary frame derived from the first two frames. In general, a summary frame is stored immediately after the second of the two frames that were combined to create the summary frame. For example, the first level 1 frame 603 is derived from the first level 0 frame 601 and the second level 0 frame 602, so the first 20 level 1 frame 603 is stored immediately after the second level 0 frame. As another example, the first level 2 frame 604, which is created by combining the first level 1 frame 603 and the second level I frame 605, is stored immediately after the second level 1 frame. The illustrated embodiment stores summary frames at every level at which summary frames are created. An alternative embodiment stores only frames at certain pre-selected 25 levels - for example, level 0, an intermediate level such as level 4 (with a temporal scale of 16), and a high level such as 8 (with a temporal scale of 256), to reduce the storage requirement. In the illustrated embodiment, two types of links are stored between frames. In the embodiment, each link is stored as an offset to the position on the storage device 206 of the 30 frame being linked to. First, links are stored between consecutive frames at the same level, e.g., 606, 607. These links allow the decoding module 207 to play a summary stream at a chosen level by 2417027 .DOC IRN: 921398 - 20 following the link from each frame to the next frame at the same level, and displaying the frames on the display device 208 in order. Second, links are stored between frames derived from the same level 0 frame, e.g., 608, 609, 610. These allow the decoder that is currently playing a video stream or 5 summary stream at one level to change to a stream at a different level, but at the same position in time. For example, a user application to play a video stream could change to a higher level in response to the user clicking a "fast forward" button. The links illustrated allow the decoder to move from a lower level to the next higher level; in a further embodiment, links are stored in both directions, so that the decoder can move from a higher 1o level to the next lower level. Alternative embodiments may structure these links differently. For example, in the illustrated embodiment, to find the first frame in the sequence at a higher level, the decoder 207 must traverse the initial frames beginning with the first level 0 frame 601 to find the first link to level 1 608, then to level 2 610, and so on. A further embodiment may store an 15 index to the first frame at each level, either separately or at the beginning of the file. The amount of processing required by the system executing the process illustrated in Fig. 5 varies greatly as the number of levels increases. In particular, for each frame captured at the capturing step 504, the computationally expensive combining step 508 is executed a variable number of times, from zero up to the maximum level stored so far; and 20 the output step 506 is executed one more time than this. This is visible in Fig. 6 as the variable number of frames stored after each level 0 frame. This can cause the frame rate of the video stored by the system to appear inconsistent, because the time elapsed between consecutive capture steps 504 varies. To avoid this issue, a further embodiment has a constraint on the storage structure 25 that each stored video frame is followed by at most one stored summary frame. Where multiple summary frames are derived from the same video frame, the later summary frames are kept in memory until the next gap after a video frame. In general, a level n frame is stored 2" - I frames after the last video frame from which it is derived. Links between frames are stored as described earlier. 30 This alternative link structure is illustrated in Fig. 7. Frames are again shown left-to-right then top-to-bottom in the order that they are written to the storage device. The first three video frames are handled as described earlier. The fourth video frame 701 is 2417027 .DOC IRN: 921398 -21 also stored as before, as is the subsequent level I frame 702. However, the level 2 frame 703 is not stored next, as it is in the earlier embodiment 604. Instead, the level 2 frame 703 is stored after the next video frame 704. In this embodiment, after each capture step 503, the output step 505 is executed at 5 most twice, and the combining step 508 is executed at most once. A further embodiment may introduce a delay when the combining step 508 is not called after a capture step, so that the time elapsed between video frames is always consistent. INDUSTRIAL APPLICABILITY The arrangements described are applicable to the computer and data processing 1o industries and particularly for the imaging and security industries. The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. In the context of this specification, the word "comprising" means "including is principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises", have correspondingly varied meanings. 2417027 .DOC IRN: 921398

Claims

1. A method of forming a summary frame derived from a first video frame and a second video frame, wherein said summary frame includes a plurality of summary pixels, said 5 method comprising the steps of: for each summary pixel in the summary frame: (a) determining a first frame pixel value for a location corresponding to said summary pixel in said first video frame; (b) determining a second frame pixel value for a location corresponding to said to summary pixel in said second video frame; and (c) utilising said first frame pixel value and said second frame pixel value to determine a summary pixel value, dependent upon the first frame pixel in the first video frame being associated with a foreground object of said first video frame and the second frame pixel in the second video frame being associated with a foreground 1s object of said second video frame; and (d) setting the summary pixel in the summary frame to said summary pixel value.

2. The method according to claim 1, wherein said summary pixel value corresponds 20 substantially to the corresponding first frame pixel value, when said corresponding first frame pixel is associated with a foreground object of said first video frame and said corresponding second frame pixel is not associated with a foreground object of said second video frame. 25 3. The method according to claim 1, wherein said summary pixel value corresponds substantially to the corresponding second frame pixel value, when said corresponding first frame pixel is not associated with a foreground object of said first video frame and said corresponding second frame pixel is associated with a foreground object of said second video frame. 30

4. The method according to claim 1, wherein said summary pixel value corresponds to a blending of the corresponding first frame pixel value and the corresponding second frame 2417027 LDOC IDN. om1IoQ - 23 .i said corresponding first frame pixel is associated with a foreground xe\ vabjeirst video frame and said corresponding second frame pixel is associated object oound object of said second video frame. ,e method according to claim 1, wherein said summary pixel value corresponds to a 5 .ng of the corresponding first frame pixel value and the corresponding second frame 1 value, when said corresponding first frame pixel is not associated with a foreground aject of said first video frame and said corresponding second frame pixel is not associated with a foreground object of said second video frame.

6. The method according to either one of claims 4 and 5, wherein said blending includes the steps of: applying a first weight to said corresponding first frame pixel to produce a weighted first frame pixel value; 15 applying a second weight to said corresponding second frame pixel to produce a weighted second frame pixel value; and averaging said weighted first frame pixel value and said weighted second frame pixel value to produce said summary pixel value. 20 7. The method according to any one of claims I to 6, wherein each of said first video frame and said second video frame are selected from a video sequence.

8. The method according to claim 7, wherein said first video frame and said second video frame are successive frames in the video sequence. 25

9. The method according to any one of claims I to 8, wherein each one of said first video frame, said second video frame, and said summary frame includes an equal number of pixels. 30 10. The method according to any one of claims 1 to 9, comprising the further steps of: encoding the summary frame, and writing the encoded summary frame to a storage device. 7417027 1 DOC IRN: 921398 - 24

11. The method according to claim 10, wherein the step of writing the encoded summary frame to a storage device includes appending the encoded summary frame to a summary video sequence. 5

12. The method according to claim 1, comprising the further step of determining a first frame foreground mask for said first video frame, said first frame foreground mask identifying foreground objects in said first video frame. 1o 13. The method according to claim 1, comprising the further step of determining a second frame foreground mask for said second video frame, said second frame foreground mask identifying foreground objects in said second video frame.

14. The method according to claim 1, comprising the further step of forming a second 15 summary frame derived from said summary frame and a third video frame, wherein said second summary frame includes a plurality of pixels, including the steps of: for each pixel in the second summary frame: (e) determining a summary frame pixel value for a corresponding location in said summary frame; 20 (f) determining a third frame pixel value for a corresponding location in said third video frame; and (g) utilising said summary frame pixel value and said third frame pixel value to determine a second summary frame pixel value, dependent upon said corresponding summary frame pixel in the summary frame being associated with a foreground 25 object of said summary frame and said corresponding third frame pixel in the third video frame being associated with a foreground object of said third video frame; and (h) setting the pixel in the second summary frame to said second summary frame pixel value. 30 15. The method according to any one of claims I to 14, wherein at least one of the first video frame and the second video frame is captured from a camera. 2417027 1DOC IRN: 921398 -25

16. A computer readable storage medium having recorded thereon a computer program for directing a processor to execute a method of forming a summary frame derived from a first video frame and a second video frame, wherein said summary frame includes a plurality of summary pixels, said computer program comprising code for performing the 5 steps of: for each summary pixel in the summary frame: (a) determining a first frame pixel value for a corresponding location in said first video frame; (b) determining a second frame pixel value for a corresponding location in said 10 second video frame; and (c) utilising said first frame pixel value and said second frame pixel value to determine a summary pixel value, dependent upon said corresponding first frame pixel in the first video frame being associated with a foreground object of said first video frame and said corresponding second frame pixel in the second video frame 15 being associated with a foreground object of said second video frame; and (d) setting the summary pixel in the summary frame to said summary pixel value.

17. A system for forming a summary frame derived from a first video frame and a 20 second video frame, wherein said summary frame includes a plurality of summary pixels, said imaging system comprising: a storage device for storing a computer program; and a processor for executing the program, said program comprising code for performing the method steps of: 25 for each summary pixel in the summary frame: (a) determining a first frame pixel value for a corresponding location in said first video frame; (b) determining a second frame pixel value for a corresponding location in said second video frame; and 30 (c) utilising said first frame pixel value and said second frame pixel value to determine a summary pixel value, dependent upon said corresponding first frame pixel in the first video frame being associated 2417027 .DOC IRN: 921398 - 26 with a foreground object of said first video frame and said corresponding second frame pixel in the second video frame being associated with a foreground object of said second video frame; and (d) setting the summary pixel in the summary frame to said summary 5 pixel value.

18. A method of forming a summary frame derived from a first video frame and a second video frame, wherein said summary frame includes a plurality of pixels, said method being substantially as described herein with reference to the accompanying 1o drawings.

19. A computer readable storage medium substantially as described herein with reference to the accompanying drawings. is 20. A system for forming a summary frame derived from a first video frame and a second video frame, wherein said summary frame includes a plurality of summary pixels, said system being substantially as described herein with reference to the accompanying drawings. 20 DATED this Thirtieth Day of November, 2009 Canon Kabushiki Kaisha Patent Attorneys for the Applicant 2417027 1DOC IRN: 921398