WO2013056311A1

WO2013056311A1 - Keypoint based keyframe selection

Info

Publication number: WO2013056311A1
Application number: PCT/AU2012/001272
Authority: WO
Inventors: Zhiyong Wang; Dagan FENG
Original assignee: The University Of Sydney
Priority date: 2011-10-20
Filing date: 2012-10-18
Publication date: 2013-04-25

Abstract

Disclosed herein are a keypoint based keyframe selection (KBKS) video compression system and method. The disclosed method of compressing a video sequence identifies unique keypoints in each frame in a video sequence (110) and forms a global keypoint pool, based on the identified unique keypoints. The method then associates each frame with a selection value based on a number of unique keypoints identified in that frame (115, 120) and selects keyframes of the video sequence, based upon the selection values (125). The selection values include a coverage value and an optional redundancy value.

Description

KEYPOINT BASED KEYFRAME SELECTION Related Application

The present application claims priority to Australian Provisional Patent Application No.

2011904344, entitled "Keypoint based keyframe selection" and filed on 20 October 2011 in the name of The University of Sydney, the entire contents of which are incorporated herein by reference.

Technical Field

The present disclosure relates generally to processing of video content and, in particular, to a method and system for keyframe selection.

Background

The proliferation of video acquisition devices and the mounting interest of consumers in the access to video repositories have significantly boosted the demand for effective and efficient methods in retrieving and managing such multimedia data. Video acquisition devices may be generally referred to as video cameras and include standalone video cameras, such as Pan-Tilt-Zoom (PTZ) cameras, and video cameras integrated into digital cameras, mobile telephone handsets, smartphones, and tablet computing devices. Video cameras capture more data (video content) than human viewers are able to process in a timely manner. Consequently there is a need for automated analysis of video content.

A video is structurally composed of a number of stories. Each story is depicted with a number of video shots and each shot comprises a sequence of images or frames. Thus, each frame Is an image in an image sequence (video sequence). It is often desirable to reduce the volume of data associated with a video sequence in order to reduce storage requirements or to reduce transmission capacity. For example, it may be advantageous to reduce the volume of data associated with a video sequence in order to stream that video to a mobile telephone handset or smartphone over a telecommunications network, such as a mobile (cellular) telephone network. It may also be advantageous to reduce the volume of data associated with a video sequence in order to provide a quick preview of that video sequence.

A reduction in the volume of data associated with a video sequence may be referred to as compression of the video sequence. Compression of a video sequence may be achieved by reducing the content in each frame or by reducing the number of frames in a video sequence, or a combination thereof. Compression of a video sequence to produce a reduced set of frames, through identification of keyframes, may be referred to as summarisation of the video sequence.

Various encoding and compression techniques are known for use with video sequences. Compression can be either lossless or lossy. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by identifying and removing less important information within each frame of a video sequence. Consequently, the quality of a video subjected to lossy compression is reduced. A reduction in quality is often acceptable in exchange for a reduction in the size of the data associated with the video sequence. The actual extent of the reduction in quality and the reduction in the size of the, data depends on the particular compression technique and the nature of the source video.

Due to the inherent temporal continuity of the consecutive frames within a video shot, there exists a great deal of redundant information among those frames. Therefore, summarising a video sequence by selecting a set of frames known as keyframes to represent a video shot is crucial for effective and efficient video content analysis. Early attempts at keyframe selection relied on a set of heuristic rules, such as choosing the first, middle, and last frames of each shot within a video sequence. In some application domains, sampling frames at a predefined interval is a plausible compromise between effectiveness and the cost of video shot detection. While these approaches may provide a reasonable subset of the frames, visual content within the frames is largely neglected by such approaches.

Since then, visual feature based methods have attracted more attention. By following the idea in shot boundary detection, a frame is chosen as a keyframe if that frame is significantly different from neighbouring frames in terms of visual features. Visual features may include, for example, colour histogram, motion information, and distribution correlation. Such approaches, however, focus on visual content change within a small temporal window and fail to take into account a global view of video shots within the video sequence that is being processed.

An alternative approach is to cluster all frames in a video shot and identify cluster centres as keyframes. This approach is intuitive, as the selected keyframes represent the prominent visual appearances and variations within a shot. "MINMAX optimal video summarization" Z. Li, G. M. Schuster, and A. K. Katsaggelos, IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 15, pp. 1245-1256, 2005 turned the task of keyframe selection into a MINMAX rate distortion optimization problem for video summarization and "Video summarization and scene detection by graph modeling" C.-W. Ngo, Y.-F. Ma, and

H.-J. Zhang, IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, pp. 296-305, 2005 tackled the clustering problem with the normalized cut algorithm. Recently, "Equivalent key frames selection based on Iso-∞ntent principles", C. Panagiotakis,

A. Doulamis, and G. Tziritas, IEEE Transactions on Circuits and Systems for Video

Technology, vol. 19, pp. 447-451, 2009 proposed a novel keyframe selection algorithm based on three iso-content principles (Iso-Content Distance, Iso- Content Error and Iso- Content Distortion). According to the specific principle, the selected keyframes are equidistant in the video content curve and the most appropriate number of key frames is automatically estimated in supervised or unsupervised manners.

Most of these approaches are, however, subject to the following two limitations. First, they rely highly on global features such as colour, texture, and motion information. As a result, local details of frames are neglected, which makes the selected keyframes less

representative, though global features coarsely represent visual characteristics of an image. Second, it is difficult to decide how many keyframes should be selected. For example, it is always challenging to set an appropriate threshold when two adjacent frames are compared and there is no global context in relation to the video sequence that is being processed. For the clustering based approaches, it is generally an open issue to set a reasonable number of clusters without prior knowledge.

Recently, local features such as the scale-invariant feature transform (SIFT) descriptor, as described in "Distinctive image features from scale-invariant keypoints" D. G. Lowe, International Journal of Computer Vision, vol. 60, pp. 91-110, 2004, have played a significant role in many application domains of visual content analysis such as object recognition and image classification due to their distinctive representation capacity and being invariant to location, scale and rotation, and robust to affine transformation.

Thus, a need exists to provide an improved method and system for keyframe selection. Summary

The present disclosure relates to a method and system for keyframe selection based on the identification of keypoints within each frame of a video sequence. The method determines a coverage value and optionally a redundancy value for each frame in the video sequence, based on the keypoints contained within that frame relative to a global pool of keypoints, and subsequently selects keyframes based on the coverage value.

In a first aspect, the present disclosure provides a method of keyframe selection, comprising the steps of: identifying unique keypoints in each frame in a video sequence; forming a global keypoint pool, based on the identified unique keypoints; associating each frame with a selection value based on a number of unique keypoints identified in that frame; and selecting keyframes of the video sequence, based upon the selection values.

In a second aspect, the present disclosure provides a computer readable storage medium having recorded thereon a computer program for keyframe selection. The computer program comprising code for performing the steps of: identifying unique keypoints in eacfi frame in a video sequence; forming a global keypoint pool, based on the identified unique keypoints; associating each frame with a selection value based on a number of unique keypoints identified in that frame; and selecting keyframes of the video sequence, based upon the selection values.

In a third aspect, the present disclosure provides an apparatus for performing keyframe selection. The apparatus includes a storage device for storing a computer program and a processor for executing a program. The program includes code for performing the method steps of: identifying unique keypoints in each frame in a video sequence; forming a global keypoint pool, based on said identified unique keypoints; associating each frame with a selection value based on a number of unique keypoints identified in that frame; and selecting keyframes of said video sequence, based upon said selection values.

In a fourth aspect, the present disclosure provides method of keyframe selection, comprising the steps of: identifying a set of keypoints in each frame of a video sequence, said video sequence having a plurality of frames; forming a global keypoint pool derived from said identified sets of keypoints, said global keypoint pool including mutually exclusive sets of covered keypoints and uncovered keypoints; selecting a frame containing a highest number of keypoints as an initial keyframe in a set of keyframes; amending said set of covered keypoints to include keypoints contained in said initial keyframe and amending said set of uncovered keypoints to exclude keypoints contained in said initial keyframe; and iteratively performing a set of steps until said set of keyframes satisfies a quality criteria. The set of steps includes: determining a selection value for each frame outside said set of keyframes; selecting one of said frames outside said set of keyframes as a keyframe in said set of keyframes, based on said selection value of said frame being higher than selection values associated with other frames outside said set of keyframes; and amending said set of covered keypoints to include keypoints contained in said selected frame and amending said set of uncovered keypoints to exclude keypoints contained in said selected frame.

In a fifth aspect, the present disclosure provides a camera system for keyframe selection, said camera system comprising: a lens system for focussing on a scene; a camera module coupled to said lens system to capture a video sequence of said scene; a storage device for storing a computer program; and a processor for executing the program. The program includes: code for identifying unique keypoints in each frame in a video sequence; code for forming a global keypoint pool, based on said identified unique keypoints; code for associating each frame with a selection value based on a number of unique keypoints identified in that frame; and code for selecting keyframes of said video sequence, based upon said selection values.

According to another aspect, the present disclosure provides an apparatus for implementing any one of the aforementioned methods.

According to another aspect, the present disclosure provides a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the methods described above.

Other aspects of the present disclosure are also provided.

Brief Description of the Drawings

One or more embodiments of the present disclosure will now be described by way of specific example(s) with reference to the accompanying drawings, in which:

Fig. 1 is a flow diagram of a method for keypoint based keyframe selection;

• Fig. 2 is a schematic representation of a system on which one or more embodiments of the present disclosure may be practised;

Fig. 3 is a schematic block diagram representation of a system that includes a general purpose computer on which one or more embodiments of the present disclosure may be practised;

Fig. 4 illustrates Intra-window keypoint chaining within one window;

Fig. 5 is a schematic representation of Intra-window chaining of keypoints;

Fig. 6 illustrates selection of a keyframe based on an influence value;

Fig. 7 shows four shots on which an assessment of keyframe selection techniques was performed;

Figs 8A to 8E illustrate the results of applying four techniques to the first shot 710 of Fig. 7, derived from the Foreman video sequence;

Fig. 9 shows the result of a keypoint based keyframe selection approach, having regard to global features;

Figs 10A to 10E illustrate the results of applying four techniques to the second shot 720 of Fig. 7, derived from the Coastguard video sequence.;

Fig. 11 is a plot of the F-score for each of four keyframe selection approaches;

Fig. 12 is a plot of F-score as a varies;

Fig. 13 is a plot of the metrics precision, recall and F-score for each of the four keyframe selection approaches described with reference to Figs 7 to 12;

Fig. 14 is a schematic representation illustrating contribution of keypoints in each frame of a video sequence to a global keypoint pool;

Figs 15A to 15D illustrate a keyframe selection process; and

Fig. 16 illustrates SIFT keypoints on a frame of the Foreman video. Detailed Description

Method steps or features in the accompanying drawings that have the same reference numerals are to be considered to have the same function(s) or operation(s), unless the contrary intention is expressed or implied.

A selected set of keyframes of a video sequence is preferably representative of video content of that video sequence and contains minimum redundancy. Disclosed herein are a computer-Implemented method, system, and computer program product for selecting keyframes of a video sequence based on a keypolnt based framework, such that local features contained within frames of the video sequence are employed in selecting keyframes. The keyframe selection method and system define a global pool of keypoints identified from the video sequence and utilise a coverage attribute determined for each frame in the video sequence to select keyframes. The coverage attribute corresponds to the number of keypoints in each frame relative to the global keypoint pool or a portion of the global keypoint pool. The keyframe selection method and system optionally also utilises a redundancy attribute in the process of selecting the set of keyframes.

The method characterises each frame in a video sequence with a set of keypoints for that frame. A keypoint is an item of interest within a frame. Depending on the particular application, a keypoint may be a SIFT keypoint or other local feature, such as a predefined portion, block, or patch of a frame. For example, in the case in which a frame consists of Discrete Cosine Transform (DCT) blocks, as used in JPEG images in a motion-JPEG stream, a keypoint may correspond to an 8x8 DCT block or a group of such blocks. Fig. 16 illustrates SIFT keypoints identified on a frame of the Foreman video that is widely known in the art. Each keypoint in the set of keypoints is described with visual descriptors. Since a video shot generally captures a scene, being a part of the physical world, the content of each frame corresponds to a fraction of the scene. Thus, all the frames in a video sequence contribute a respective set of keypoints to the depiction of the scene. The process of keyframe selection identifies a number of frames of a video sequence having an associated set of keypoints that are representative of the scene depicted in the video sequence.

The keypoint-based keyframe selection method of the present disclosure identifies keypoints in each frame of a video sequence that is being analysed and extracts visual descriptors for each keypoint. The method forms a global keypoint pool of unique keypoints identified from the frames of the video sequence to represent the whole video shot through keypoint matching. The method then selects representative frames which best cover the global keypoint pool as keyframes. Selection of the keyframes is based on a coverage measure, a redundancy measure, or a combination thereof. The coverage measure is the extent to which each frame covers the keypoints in the global keypoint pool at a given time. The redundancy measure is an indication of the extent to which the keypoints in a set of keypoints associated with a frame are present in a set of keypoints associated with another frame in the video sequence.

Applying the keypoint based keyframe selection (KBKS) method to a video sequence produces a set of keyframes, which may be used in many applications, including, for example, but not limited, to: summarising or compressing a video sequence, previewing large quantities of video data, reviewing video footage (e.g., closed circuit television (CCTV) footage, security camera footage, and medical imaging data), video editing software applications, broadcasting in reduced bandwidth scenarios, and reviewing historical video data. For example, keyframe selection may be utilised to select an I-frame in a step of a video compression application.

A system utilising the keypoint based keyframe selection method may be implemented on-camera, in which a software application executing on a processor of the camera processes a window of a video sequence in near real-time or at a later time. Alternatively, a system utilising the keypoint based keyframe selection method may be implemented on a computing device, wherein software executing on a processor of the computing device executes instructions in accordance with the keypoint based keyframe selection to process a stored video sequence and produce a set of keyframes of that video sequence. Other arrangements may equally be practised without departing from the spirit and scope of the present disclosure.

Fig. 1 is a flow diagram illustrating a method 100 of keypoint-based keyframe selection in accordance with the present disclosure. The method 100 begins at a Start step 105 and proceeds to step 110, which identifies a set of unique keypoints, or key features, across a window of a video sequence. The window may be all frames in a video sequence or a subset thereof comprising one or more frames. The set of keypoints defines a global keypoint pool for the window. In one implementation, the method analyses each frame in the window and identifies the unique keypoints in each frame. Each frame is associated with a set of keypoints identified in that frame. All of the unique keypoints identified in the frames of the window form the global keypoint pool.

Control passes from step 110 to step 115, which assesses each frame in the window for a coverage value, corresponding to the contribution of unique keypoints from that frame to the global keypoint pool, and a redundancy value in relation to the global keypoint pool. Control passes to step 120, which selects keyframes from the window, based on a predefined quality criteria (quality criteria). The quality criteria is based on a number of frames to be returned, a selected coverage value, a selected redundancy value, or a combination thereof. In one implementation, the quality criteria is an influence value determined from the coverage value and redundancy value of each frame.

One arrangement repeats steps 115 and steps 120 in an iterative fashion to select consecutive keyframes of a set of keyframes. During each iteration, the method determines in step 115 a new coverage value for each frame that has not already been selected as a keyframe. The new coverage value is based on the number of keypoints in each frame that, at the time of that iteration, remain uncovered in the global keypoint pool by the keypoints identified in the already selected keyframes. Step 120 selects a new keyframe to add to the set of keyframes based on the new coverage values determined for that iteration. One implementation determines a redundancy value for each frame in step 115 of each iteration and the selection of a new keyframe in step 120 is based on the coverage values and the redundancy values calculated during a present iteration.

Control passes from step 125 to an End step 130 and the method 100 terminates.

Fig. 14 is a schematic representation of one example of a method of keyframe selection in accordance with the present disclosure. Fig. 14 shows a video sequence comprising a first frame 1410, a second frame 1420, a third frame 1430, and a fourth frame 1440. The method analyses each of the first frame 1410, second frame 1420, third frame 1430, and fourth frame 1440 to identify any keypoints located within each respective frame.

In the example of Fig. 14, the method identifies 5 keypoints in the first frame 1410,

4 keypoints in the second frame 1420, 5 keypoints in the third frame 1430, and 6 keypoints in the fourth frame 1440. The unique keypoints identified across all of the frames 1410, 1420, 1430, and 1440 form keypoints in a global keypoint pool 1450. In example of Fig. 14, there are 11 unique keypoints in the global keypoint pool 1450. The identified keypoints for the respective frames and the global keypoint pool are shown in Table 1.

Table 1: Set of keypoints identified for each frame in Fig. 14

Each frame 1410, 1420, 1430, 1440 is assigned a respective coverage value, based on the number of keypoints in that frame relative to the number of unique keypoints in the global keypoint pool 1450. The first frame 1410 includes 5 keypoints out of the total of 11 unique keypoints in the global keypoint pool 1450, resulting in a coverage value of 5/11 for the first frame 1410. The second frame 1420 includes 4 keypoints out of the 11 unique keypoints in the global keypoint pool 1450, resulting in a coverage value of 4/11 for the second frame 1420. The third frame 1430 includes 5 keypoints out of the 11 unique keypoints in the global keypoint pool 1450, resulting in a coverage value of 5/11 for the third frame 1430. The fourth frame 1440 includes 6 keypoints out of the 11 unique keypoints in the global keypoint pool 1450, resulting in a coverage value of 6/11 for the fourth frame 1440.

In one embodiment, the method utilises a quality criteria that defines a number of keyframes that are to be selected. The method selects the required number of keyframes by selecting the frames with the highest coverage values.

In another embodiment, the method utilises a quality criteria that defines a minimum coverage value required of the keyframes. The method then selects those frames having an associated coverage value that meets or exceeds the minimum coverage value defined in the quality criteria. Each frame 1410, 1420, 1430, 1440 is optionally assigned a respective redundancy value, which is optionally utilised in the keyframe selection process. In one example, the redundancy value is based on the number of keypoints in a given frame that appear in other frames within the frame sequence. In another example, the redundancy value for a frame is based on the number of keypoints in that frame that are covered in the global keypoint pool by earlier selected keyframes. Referring to Fig. 14, the first frame 1410 includes 5 keypoints, 2 of which appear in one or more other frames in the video sequence of the example of Fig. 14, resulting in a redundancy value of 2/5 for the first frame 1410. Similarly, the second frame 1420 includes 4 keypoints, 4 of which appear in one or more other frames in the video sequence of the example of Fig. 14, resulting in a redundancy value of 4/4 for the second frame 1420. The third frame 1430 includes 5 keypoints, 5 of which appear in one or more other frames in the video sequence of the example of Fig. 14, resulting in a redundancy value of 5/5 for the third frame 1430. The fourth frame 1450 includes 6 keypoints, 4 of which appear in one or more other frames in the video sequence of the example of Fig. 14, resulting in a redundancy value of 4/6 for the fourth frame 1440.

Other methods of determining coverage values and redundancy values may equally be utilised, without departing from the spirit and scope of the present disclosure. A further example of determining coverage values and redundancy values will be described below with reference to Figs 15A to 15D.

Once the keypoints have been identified in each frame and a coverage value and

redundancy value determined for each frame, the method selects a set of keyframes for the video sequence, based on a quality criteria. In one example, the quality criteria includes a coverage threshold, that each keyframe must satisfy. Thus, a user is able to set a quality criteria that selects as keyframes those frames having a minimum coverage value of 40%. In the example of Fig. 14, such a quality criteria produces a set of keyframes consisting of frame 1 1410, frame 2 1420, and frame 3 1430. In another example, a user sets a quality criteria that selects as keyframes those frames having a minimum coverage value of 50%, in which case the set of keyframes consists of frame 4 1440. In a further example, the quality criteria includes a minimum number, a maximum number, or an exact number of frames that are to be returned by the keyframe selection process.

In another arrangement, the quality criteria include a coverage threshold that each keyframe must satisfy and a redundancy threshold that each keyframe must satisfy. If a user sets the quality criteria to include a coverage threshold that selects as keyframes those frames having a minimum coverage value of 40% and a maximum redundancy value of 70%, that quality criteria produces a set of keyframes consisting of frame 1 1410 and frame 4 1440. In that particular example, the set of keyframes includes 2 frames, which is 50% less than the number of frames in the original video sequence, yet the set of keyframes includes all 11 - keypoints in the global keypoint pool 1450.

The user is able to adjust the number of frames, the contribution threshold, and the redundancy threshold in the quality criteria to achieve the required level of summarisation or compression and quality for the particular application.

In one arrangement, the quality criteria include an influence value derived from the coverage value and the redundancy value. In one implementation, the influence value (selection value) of a frame is the coverage value of that frame less the redundancy value of that frame. In another implementation, a weighting value is applied to either one or both of the coverage value and the redundancy value. In that implementation, the selection value of a frame is a weighted difference between the coverage value and the redundancy value associated with that frame.

In a further arrangement, the quality criteria include a predefined number or percentage of frames that are to be selected. Thus, for a video sequence of 200 frames, a user wanting to reduce by half the data file associated with that sequence selects 100 frames for the compression. The method selects 100 frames from the video sequence based on the coverage values associated with each frame, either alone or in combination with the redundancy values associated with each frame.

Fig. 2 is a functional schematic block diagram of a system 200, upon which methods of keyframe selection in accordance with the present disclosure may be performed. The system 200 includes a camera 210, which in the example of Fig. 2 is a pan-tilt-zoom (PTZ) camera. The camera 210 is adapted to capture a video sequence comprising one or more frames representing the visual content of a scene 220 appearing in a field of view of the camera 210.

The camera 210 includes a lens system 212 coupled to a photo-sensitive image sensor 214. The image sensor receives an optical image of the scene 220 focussed by the lens system 212 and converts the optical image into an electronic signal. The image sensor 214 may be implemented, for example, using a digital charge-coupled device (CCD) or complementary metal oxide semiconductor (CMOS) device. The camera 210 also includes a pan, tilt, and zoom control module 222, which is adapted to receive control inputs from a user to control the lens system 212 to record a desired shot or sequence of frames. Each of the lens system 212, the image sensor 214, and the pan, tilt, and zoom control module 222 is coupled to a bus 216 to facilitate the exchange of control and data signals.

The camera 210 also includes a memory 218 and a processor 220. The memory 218 stores a computer program that when executed on the processor 220 controls operation of the lens system 212 and processing of recorded images stored in the memory 218. In one arrangement, a computer program stored in the memory 218 includes computer program code for implementing a keyframe selection method in accordance with the present disclosure. The camera 210 also includes an input/output (I/O) module 224 that couples to a communications network 230. The I/O module 224 may be implemented using a physical connection, such as a Universal Serial Bus (USB) interface, Firewire interface, or the like. Alternatively, the I/O module may be implemented using a wireless connection, such as Bluetooth, Wi-Fi, or the like. The I/O module 224 may also be adapted to read and write to one or more physical storage media, such as external hard drives, flash memory, and memory cards, including CompactFlash cards, Memory Sticks, Secure Digital (SD) cards, miniSD cards, microSD cards,

The communications network 230 is also coupled to a database 240 and a computer server 250. Images recorded by the camera 210 may be transmitted via the

communications network 230 for storage on the database 240 or the computer server 250. In one arrangement, the server 250 includes a memory for storing a computer program that includes instructions that when executed on a processor of the server 250 implement a video compression method in accordance with the present disclosure. In one example, such a computer program executing on a processor of the computer server 250 is utilised to process a video sequence stored on the database 240.

Methods of keyframe selection in accordance with the present disclosure may equally be practised on a general purpose computer. Video frames captured by a camera are processed in accordance with instructions executing on the processor of the general purpose computer to identify keypoints in each frame, generate a global keypoint pool, and select keyframes On the basis of the identified keypoints. In one arrangement, a video camera is coupled to a general purpose computer for processing of the captured frames. The general purpose computer may be co-located with the camera or may be located remotely from the camera and coupled by a communications link or network, such as the Internet. In another arrangement, video frames are retrieved from storage memory and are presented to the processor for keypoint analysis and keyframe selection.

The keyframe selection system of the present disclosure may be practised using a computing device, such as a general purpose computer or computer server. Fig. 3 is a schematic block diagram of a system 300 that includes a general purpose computer 310. The general purpose computer 310 includes a plurality of components, including: a processor 312, a memory 314, a storage medium 316, input/output (I/O) interfaces 320, and input/output (I/O) ports 322. Components of the general purpose computer 310 generally communicate using a bus 348.

The memory 314 may include Random Access Memory (RAM), Read Only Memory (ROM), or a combination thereof. The storage medium 316 may be implemented as one or more of a hard disk drive, a solid state "flash" drive, an optical disk drive, or other storage means. The storage medium 316 may be utilised to store one or more computer programs, including an operating system, software applications, and data. In one mode of operation, instructions from one or more computer programs stored in the storage medium 316 are loaded into the memory 314 via the bus 348. Instructions loaded into the memory 314 are then made available via the bus 348 or other means for execution by the processor 312 to effect a mode of operation in accordance with the executed instructions.

One or more peripheral devices may be coupled to the general purpose computer 310 via the I/O ports 322. In the example of Fig. 3, the general purpose computer 310 is coupled to each of a speaker 324, a camera 326, a display device 330, an input device 332, a printer 334, and an external storage medium 336. The speaker 324 may include one or more speakers, such as in a stereo or surround sound system.

The camera 326 may be a webcam, or other still or video digital camera for capturing a video sequence, and may download and upload information to and from the general purpose computer 310 via the I/O ports 322, dependent upon the particular implementation. For example, images recorded by the camera 326 may be uploaded to the storage medium 316 of the general purpose computer 310. Similarly, images stored on the storage medium 316 may be downloaded to a memory or storage medium of the camera 326. The camera 326 may include a lens system, a sensor unit, and a recording medium.

The display device 330 may be a computer monitor, such as a cathode ray tube screen, plasma screen, or liquid crystal display (LCD) screen. The display 330 may receive information from the computer 310 in a conventional manner, wherein the information is presented on the display device 330 for viewing by a user. The display device 330 may optionally be implemented using a touch screen, such as a capacitive touch screen, to enable a user to provide input to the general purpose computer 310.

The input device 332 may be a keyboard, a mouse, or both, for receiving input from a user. The external storage medium may be an external hard disk drive (HDD), an optical drive, a floppy disk drive, or a flash drive.

The I/O interfaces 320 facilitate the exchange of information between the general purpose computing device 310 and other computing devices. The I/O interfaces may be

implemented using an internal or external modem, an Ethernet connection, or the like, to enable coupling to a transmission medium. In the example of Fig. 3, the I/O interfaces 322 are coupled to a communications network 338 and directly to a computing device 342. The computing device 342 is shown as a personal computer, but may be equally be practised using a smartphone, laptop, or a tablet device. Direct communication between the general purpose computer 310 and the computing device 342 may be effected using a wireless or wired transmission link.

The communications network 338 may be implemented using one or more wired or wireless transmission links and may include, for example, a dedicated communications link, a local area network (LAN), a wide area network (WAN), the Internet, a telecommunications network, or any combination thereof. A telecommunications network may include, but is not limited to, a telephony network, such as a Public Switch Telephony Network (PSTN), a mobile telephone cellular network, a short message service (SMS) network, or any combination thereof. The general purpose computer 310 is able to communicate via the communications network 338 to other computing devices connected to the communications network 338, such as the mobile telephone handset 344, the touchscreen smartphone 346, the personal computer 340, and the computing device 342. The general purpose computer 310 may be utilised to implement a keyframe selection system in accordance with the present disclosure. In such an embodiment, the memory 314 and storage 316 are utilised to store data relating to captured video frames, a set of keypoints associated with each frame, and a global keypoint pool. Software for

implementing the keyframe selection method is stored in one or both of the memory 314 and storage 316 for execution on the processor 312. The software includes computer program code for effecting method steps in accordance with the method of keyframe selection described herein.

Keypoint Matching

One arrangement utilises a scale-invariant feature transform (SIFT) to implement step 110 of Fig. 1 to detect and describe keypoints in frames of a video sequence. However, it will be appreciated by a person skilled in the relevant art that other methods of identifying keypoints, such as that described in "A performance evaluation of local descriptors", K. Mikolajczyk and C. Schmid, IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 10, no. 27, pp. 1615-1630, 2005, may be equally practised without departing from the spirit and scope of the present disclosure.

For each detected keypoint, there are three steps to calculate an associated SIFT descriptor. First, the image gradient magnitudes and orientations are calculated, sampled from a neighbouring 16x16 region around the keypoint. Second, in order to eliminate the influence introduced by small changes in the position of the window, the magnitude of each sample point is weighted by a Gaussian weighting function. Third, those samples are accumulated into orientation histograms summarising the contents over 4x4 subregions. The length of each orientation vector corresponds to the sum of the gradient magnitudes near that direction of the region. Therefore, the SIFT descriptor of each keypoint is a 128-dimension feature vector (a 4x4 array of orientation histograms with 8 orientation bins in each) that provides context for the neighbourhood of the keypoint.

Straightforward keypoint matching based on SIFT descriptors results in many false matches. Lowe proposed to improve matching robustness by imposing ratio test criterion. An example of such a ratio test criterion may be, for example, that the ratio of the nearest neighbour distance to the second nearest neighbour distance is greater than a given threshold.

However, there still exist two challenging problems. First, the cost of keypoint matching between two target frames is high. In order to match keypoints exhaustively, it is necessary to calculate the distance between every pair of keypoints in both target frames, which is computationally expensive.

In order to relieve this problem and take advantage of the continuity among adjacent frames, one arrangement of a keyframe selection method of the present disclosure implements a matching strategy that considers only those candidate keypoints within a ^' certain radius R of the target keypoint. Meanwhile, false matching can also be reduced by imposing such a constraint.

Second, there are a number of false-positive matches, and as a result, the global pool of keypoints K would contain noisy keypoints. To filter these false matches, one arrangement of a keyframe selection method of the present disclosure implements the RANdom Sample Consensus algorithm (RANSAC) described in "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography", M. A. Fischler and R. C. Bolles, Communications of the ACM, vol. 24, pp. 381-395, 1981 iteratively to detect sets of geometrically consistent keypoint matches. This process is repeated until no further large set of matches (e.g., five matches in a group) can be found.

Keypoint Pool Construction

In order to build a global keypoint pool Cfrom all keypoints k_x \n each frame //to represent the content of a video shot, it is preferable that every two frames //and f_j (a pair) within the shot should go through keypoint matching. However, such a strategy is very costly. For example, there are approximately 60,000 frame pairs for a 10 second video shot at 25 frames per second (fps). Utilising the inherent nature of visual continuity among consecutive video frames, one arrangement utilises an Inter-window Keypoint Chaining scheme to constrain the pairing within a temporal window of size ^without losing the discriminative power of keypoint matching, as illustrated in Fig. 4.

Fig. 4 is a schematic representation of a video sequence 400 comprising a plurality of frames to which an Inter-window Keypoint Chaining scheme is applied. In the example of Fig. 4, the video sequence 400 is broken into a first temporal window 410 of frames and a second temporal window 420 of frames, wherein the first temporal window 410 and the second temporal window 420 overlap. The video sequence includes a frame //with keypoint k_u frame ^ ith keypoint k_2l and frame f_m with keypoint k_3l where k k₂, and X¾ are matched keypoints. Keypoints are only matched within a window and chained across multiple windows. When a keypoint in frame is matched with another keypoint k₂ in frame f_j, and the same keypoint A: is matched with a. third keypoint Xrin frame f, satisfying Il^'-jj <= Wand lm-jl <= W, those matches are joined into a chain, which finally contributes to the same unique keypoint in the global keypoint pool K without matching keypoints between Jand fm.

The window size can be adaptively determined by calculating visual variations between consecutive frames in terms of distribution correlation. On the other hand, true keypoint matches may be dropped during matching. In order to make the matching more reliable, one arrangement also utilises Intra-Window Keypoint Chaining.

Fig. 5 is a schematic representation of Intra-window chaining of keypoints. Fig. 5 shows a video sequence 500 comprising a plurality of frames. A subset of the frames is presented in a temporal window 510. The temporal window 510 includes a first frame 512 with keypoint k_u a second frame 514 with keypoint k₂, and a third frame 516 with keypoint kj. In this example, ki is matched with k₃ but not k₂, and k₂ is matched with A In this case, k k_2l and A are linked by a single chain, which eases the problem of missed matching (e.g., ki is a true match with k₂).

After the keypoint chaining on frames, each keypoint either belongs to a chain of matched keypoints or becomes a singleton keypoint. The method discards all singleton keypoints, which are very likely to be noisy keypoints. Each chain is represented by its HEAD keypoint and the number of keypoints on that chain, denoted by (k_x, N_x). The HEAD keypoint is the first instance of the keypoint in the chain. T e global keypoint pool K\s then formed by aggregating all (A-* /V,). In order to reduce noisy chains, one arrangement optionally further filters less important/unstable global keypoints by setting a threshold Tfor N„ wherein 7^"is a minimum number of frames that are chained, in order to eliminate transient keypoints. In one arrangement, 7^"is set to 5.

Keyframe Selection

The goal of keyframe selection is to represent a video shot with a minimal number of keyframes whilst retaining the highest quality of the scene depicted in the shot. That is, the selected keyframes are best able to represent the video shot while minimizing redundancy among those keyframes.

In accordance with the method and system of the present disclosure, keyframes are selected to cover as many keypoints in the global keypoint pool as possible. Since this can be formulated as a variation of the well-known Set Cover Problem, which has been proven to be nondeterministic polynomial time (NP)-complete, one arrangement implements a greedy algorithm to approximately tackle this issue.

At first, the method selects a frame of the video sequence having the highest number of keypoints with reference to the keypoint pool as an initial keyframe in a set of keyframes. At each iteration, the method selects one of the remaining frames outside the set of keyframes as a keyframe if that frame best helps improve the coverage of keypoints in the global keypoint pool while minimising redundancy. As described above with reference to Fig. 14, the method and system of the present disclosure utilise a coverage (contribution) value, and optionally a redundancy value, to guide the selection process.

In the selection process, the global keypoint pool is separated into two sets, C_roMrna/and Kuncovered- At the beginning of the process, K_uncovered contains all keypoints in the global keypoint pool /(and A^_/erf is empty. For frame * denote a keypoint set associated with that frame as FP^ then the Coverage of the frame to the global keypoint pool is defined as the cardinality of the intersection between FP, and the uncovered set:

C(f = \FP, r\ Κ_{υη∞νεΓ(;}λ ... Eqn(l)

Likewise, Redundancy is defined as how many keypoints the frame contains in Κ_ανΒΚΛ reflecting how redundant the frame is based on the covered content in the shot:

R {fi) =\FP, Π K_∞ve ... Eqn(2)

The influence of frame f, at an iteration is calculated in Eqn(3) as a balance of C(fi) and R(fi) controlled by a weighting factor a. In one implementation, a is set to 1.

Influence (ft) = C(fi)- vR(fi) ... Eqn(3)

A simplified illustration of the calculation is presented in Fig. 6, which shows two frames ft and f Frame /_/has a coverage value of 20 and a redundancy value of 16, resulting in an influence value of 20 - 16 = 4. Frame / has a coverage value of 16 and a redundancy value of 8, resulting in an influence value of 16— 8 = 8. In this example, ft has higher coverage, but also higher redundancy than & so f₂ is favoured during the selection of keyframes.

At the end of each iteration, the method selects the frame with the highest influence value and positive coverage as a keyframe, and updates tOnwer_fand Kuncovered based on the keypoints of the selected keyframe. The iteration repeats until all the keypoints are covered or a predefined coverage threshold of the global keypoint pool K\s reached.

Figs 15A to 15D illustrate the keyframe selection process with reference to the example of Fig. 14. In this example, the quality criteria is set to return 2 frames from the video sequence. Fig. 15A shows the same video sequence of Fig. 14, with first frame 1410, second frame 1420, third frame 1430, and fourth frame 1440. At the beginning of the keyframe selection process, the global keypoint pool 1450 is separated into 2 sets, Κ_∞νΐΚά and Kuncovemd, wherein Kuncovered contains all keypoints in the global keypoint pool 1450 and Kcovered is empty.

Fig. 15B illustrates a first iteration of the keyframe selection process, in which the method selects the frame having the highest number of keypoints as an initial keyframe of a set of keyframes. In this example, the method selects the fourth frame 1440, having 6 keypoints. As defined in Eqn(l), the coverage value of the fourth frame 1440 is the cardinality of the intersection between the set of keypoints in the fourth frame and the present uncovered set, Kuncovered- In this example, the fourth frame value of the fourth frame 1440 contains a set of 6 keypoints and K_uncovered includes all of the keypoints, so the coverage value for the fourth frame 1440 is: C(f₄) - 6. Kcovered is amended to include the set of keypoints contained in the fourth frame 1440 and Kunco ered is amended by removing the set of keypoints contained in the fourth frame 1440.

In the case in which a redundancy value is determined for the selected keyframe, the first selected keyframe has a redundancy value of 0, as redundancy is defined as how many keypoints contained in the selected frame are in which is initially the empty set.

Fig. 15C shows the next iterative step, in which the remaining frames (frames 1410, 1420, 1430) outside the set of keyframes, are considered for selection as the second keyframe in the set of keyframes. In this step, the method determines a coverage value and a redundancy value for each of the frames 1410, 1420, 1430 with reference to the present state of Kcovered and Kuncovere - At this stage of the process, Kcovered includes the keypoints represented by ¥ £ Ω λ * ♦ and Kcove ed includes the keypoints represented by Δ X ∑ ♦ — , The method determines an influence value for each frame 1410, 1420, 1430, based on the respective coverage values and redundancy values, and then selects the frame with the highest influence value in that iteration as the next keyframe. First frame 1410 contains a set of 5 keypoints, 5 of which are in the present and 0 of which are in Kmme* Accordingly, the coverage value of first frame 1410 is C(fi) - 5 and the redundancy value of first frame 1410 is R(fi) = 0. In this example, the weighting factor a for the influence value is set to 1. The influence value for first frame 1410 is

Influence (fx) = C(fj) - aRfi) = 5- 0 = 0.

Second frame 1420 contains a set of 4 keypoints, 2 of which are in the present K_UK0VeKCi and 2 of which are in Kcoven* Accordingly, the coverage value of second frame 1420 is = 2 and the redundancy value of second frame 1420 is R(fy = 2. The influence value for second frame 1420 is Influence {fj = Cfc) - *R(fz) = 2 - 2 = 0.

Third frame 1430 contains a set of 5 keypoints, 1 of which are in the present

4 of which are in Κα,νεκ* Accordingly, the coverage value of third frame 1430 is C(f₃) = 1 and the redundancy value of third frame 1430 is R(f₃) = 4. The influence value for third frame 1430 is Influence (f₃) = C(f₃) - a ffy = 1 - 4 = -3.

Thus, the second iteration selects first frame 1410 as the second keyframe, as first frame 1410 has a higher influence value than second frame 1420 and third frame 1430.

Fig. 15D illustrates the selected set of keyframes, which includes first frame 1410 and fourth frame 1440 and provides a coverage of 100% of the global keypoint pool.

Experimental Results

Experiments were conducted by applying the keypoint based keyframe selection method on 2 datasets. The first dataset relates to case studies, consisting of 4 videos including the widely used in the art Foreman and Coastguard videos and two TV news shots (Tennis video and Zooming video). The second dataset was constructed from the Open Video Project (http://www.openvideo.org) for quantitative evaluation. Table 2 describes the content of the Open Video Project, which consists of 10 video shots across several genres

(e.g., documentary, education, and history). Video Name From Frame To Frame # of

Frames v25 A New Horizon, segment 02 664 900 237 v28 A New Horizon, segment 05 3223 3440 218 v33 Take Pride in America, segment 540 650 111 03 ^"

v39 Senses And Sensitivity, 1838 1934 97 Introduction to Lecture 4 presenter

v40 Exotic Terrane, segment 01 1790 1989 200 v49 America's New Frontier, 150 500 351 segment 07

v57 Oceanfloor Legacy, segment 04 1600 1800 201 v58 Oceanfloor Legacy, segment 08 540 633 94 v63 Hurricane Force - A Coastal 867 1012 146 Perspective, segment 03

v66 Drift Ice as a Geologic Agent, 766 977 212 segment 05

Table 2: The Testing Videos from the Open Video Project

Experimental data indicates that results generally are not affected when the matching radius R \ ^'s set above 100 and the window size Wabove 5. Hence, in these examples, the radius R is set to 100 (i.e., 100 pixels around a target keypoint) to reduce matching search space without sacrificing matching accuracy even in fast motion scenes, and is set to 5 so as to balance the computational cost and chaining accuracy. The threshold 7^"to filter the unstable global keypoint affects the size of the keypoint pool and thus the granularity of details it captures. The experimental results to be described are based on a threshold T= 5 to reduce noisy keypoints without losing noticeable details.

The keypoint based keyframe selection (KBKS) approach of the present disclosure is compared against three known approaches: Iso-content distance, Iso-content distortion, and Clustering. For the Iso-content distance and Iso-content distortion approaches, the KBKS approach was applied using the same Color Layout Descriptor as adopted in Panagiotakis. For the clustering based method, the KBKS approach is applied using the CEDD feature described in "CEDD: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval" S. A. Chatzichristofis and Y. S. Boutalis, in International Conference on Computer Vision Systems, 2008, which is a histogram representing color and texture features.

Fig. 7 shows the four shots, consisting of a first shot 710 sourced from the Foreman video, a second shot 720 sourced from the Coastguard video, a third shot 730 sourced from a Tennis video of a news shot, and a fourth shot 740 sourced from a Zoom video of a news shot. The sample frames for the four shots 710, 720, 730, 740 are labelled in Fig. 7.

Figs 8A to 8E illustrate the results of applying the four techniques (KBKS, Iso-content distance, Iso-content distortion, and Clustering) to the first shot 710 of Fig. 7, derived from the Foreman video sequence. Rg. 8A shows the result of applying the KBKS approach of the present disclosure to the first shot 710, with a quality criteria set to return 2 frames. The KBKS approach returns frames 32 and 229, which yield a coverage of the global keypoints of 73%. Fig. 8B shows the result of applying the KBKS approach of the present disclosure to the first shot 710, with quality criteria set to return 3 frames. The KBKS approach returns frames 44, 97, and 229,. which yield a coverage of the global keypoints of 84%. Fig. 8C shows the result of applying the KBKS approach of the present disclosure to the first shot 710, with a quality criteria set to return 5 frames. The KBKS approach returns frames 1, 44, 97, 101, and 229, which yield a coverage of the global keypoints of 95%.

Fig. 8D shows the result of applying the Clustering technique, with 5 clusters and a quality criteria set to return 5 frames. The Clustering technique returns frames 43, 74, 103, 183, and 194 and yields a coverage of 88%. Fig. 8E shows the result of applying the Iso-content distance technique, with quality criteria set to return 5 frames. The Iso-content distance technique returns frames 1, 161, 178, 191, and 268 and yields a coverage of 85%. Fig. 8F shows the result of applying the Iso-content distortion technique, with a quality criteria set to return 5 frames. The Iso-content distortion technique returns frames 1, 95, 167, 191, and 268 and yields a coverage of 87%.

It is observed that the KBKS approach captures different details when a different number of frames are selected in the quality criteria. For example, the two frames under 73% coverage (frames 32, 229) capture the key content of the foreman and the building. When the number of frames increases to 5 and the returned coverage is increased to 95%, shown in Fig. 8C, different stages of the smiling face of the foreman are captured. In contrast, such details are missing in the results of the other methods, since those methods rely on global features. It is also noticed that the KBKS approach misses the keyframe on the tower and sky. There are two reasons for this omission. One is that the transition is very short and some keypoint chains are discarded. The other is that there are not many keypoints due to a large portion of the uniform region and the influence score of those frames have been affected.

In order to remedy this issue, one arrangement of the KBKS approach of the present disclosure takes global features into account by replacing Eqn (3) with Eqn(4):

Cifd- KRjfi)

InfluenceNew(fi) = ... Eqn(4)

GlobalSim i) where

GlobalSim(fi) =∑ Similarity ( _it fj) ... Eqn(5)

That is, the influence of a frame /_/Will be increased if that frame shares low similarity (i.e., small GlobalSim(/5)) with other frames in terms of colour and edge histogram. Fig. 9 shows the results of taking global features into consideration, as noted above with respect to Eqn(4) and Eqn(5), and a quality criteria of 5 frames, which results in frames 1, 45, 97, 169, and 229 and a coverage of 94%. It is noted that frame 169 is a keyframe showing the tower and sky, which are the features that were missing from the earlier application, illustrating that this approach is able to effectively resolve the "missing sky" problem:

Figs 10A to 10E illustrate the results of applying the four techniques (KBKS, Iso-content distance, Iso-content distortion, and Clustering) to the second shot 720 of Fig. 7, derived from the Coastguard video sequence. The second shot 720 captures a sequence of frames in which a first boat overtakes a second boat.

Fig. 10A shows the result of applying the KBKS approach of the present disclosure to the second shot 720 with a quality criteria set to 4 frames, which returns frames 19, 107, 161, and 264 and yields a coverage 95%. Fig. 10B shows the result of applying the Clustering technique, with 4 clusters, and a quality criteria set to 4 frames, which returns frames 58, 118, 197, and 271 and yields a coverage of 89%. Fig. IOC shows the result of applying the Iso-content distance technique and a quality criteria set to 4 frames, which returns frames 1, 70, 179, and 300 and yields a coverage of 90%. Fig. 10D shows the result of applying the Iso-content distortion technique with a quality criteria of 4 frames, which returns frames 1, 68, 175, and 300 and yields a coverage of 91%. It can be seen from Figs 10A to 10E that the KBKS approach of the present disclosure selects not only the frames with both boats, but also more frames to get a higher coverage of keypoints as the background of the boat (e.g., the building and trees) keeps changing. The other methods do capture both boats, but do not reflect the background change very well. In addition, from the selected keyframes resulting from the KBKS approach, the overtaking process is more readily understandable to a viewer.

The Tennis video, being the third shot 730 of Fig. 7, contains two actions of a tennis player with a very short panning and fading transition in between. The KBKS algorithm clearly identifies these two action frames with a high keypoint coverage of 97%. The clustering- based method achieves the similar result with the help of a predefined number of clusters (i.e., 2), and the Equidistance method selects the first and last frames.

T e fourth shot 740 is a short sequence of zoom-out footage. The KBKS approach selects one keyframe near the end of the shot 740 with a high coverage of 86%, since the content (and corresponding keypoints) in the frames at the beginning of the shot 740 are part of the zoomed-out later frame in the sequence that is selected as the keyframe. For the clustering-based method, if the number of clusters is set to 1, the returned keyframe is the middle frame of the shot. That is, clustering based approaches generally take the frame with average information as representative frames. The Equidistance method has the limitation of selecting both the first and the last frames as a starting point, which is not necessary for many cases such as zooming.

Quantitative Evaluation

Quantitative evaluation of the different approaches was performed by manually selecting "ground-truth" keyframes from the videos described in Table 2. The manual selection was performed by three university students with video processing backgrounds and when calculating metrics, the results were averaged among the three sets of ground-truth of keyframes. The number of target keyframes is set to 5. In order to generate 5 keyframes using the KBKS approach of the present disclosure, an initial coverage value of 50% was utilised and then varied until five keyframes were generated. The following metrics are chosen: Precision, Recall, F-score, and Dissimilarity.

A candidate keyframe is considered to be a match if that keyframe is located no more than X frames apart from a ground-truth keyframe. A ground-truth keyframe matches at most one candidate keyframe. F-score is a combination of both the precision and recall indicating the overall quality. Dissimilarity measures the difference between the candidate keyframes and the ground-truth keyframes. Dissimilarity is defined as:

Dissimilarity =∑f_c minf_td(f_c , f_t) ... Eqn(6) where f_c \s a candidate keyframe and f_t \s a groundtruth keyframe, and ft) is a distance measure of two keyframes, which is the difference of their frame indices. In order to explore the influence of X, various experiments were conducted by varying A'from 10 to 20 while fixing a to 1 and 7^"to 5.

Fig. 11 is a plot of the influence of X, the number of frames a potential keyframe is apart from a ground-truth keyframe, on the F-score for each of the four approaches: Clustering, Iso-content distance, Iso-content distortion, and KBKS. As shown in Fig. 11, the F-score of every method increases and stabilises as the value of ^increases from 10 to 20. While setting a high value for J does not reflect a true match, Jfwas set to 15 in the following experiments. Similarly, experiments were conducted to explore the influence of a in Eqn (3) by setting Xto 15 and 7^"to 5, and varying a from 0 to 2.

Fig. 12 is a plot of F-score as a varies when applying the KBKS approach and shows that a does influence the selection result, however, not in a significant way. F-Score grows when increases from 0 to 0.3, and stabilises between 0.3 and 1.2. This could be explained that a frame with a higher coverage introduces more new visual content and is more likely to introduce less redundancy. For the sake of simplicity, a was set to 1 in the following experiments.

Fig. 13 is a plot of the metrics precision, recall, and F-score for each of the four approaches described with reference to Rgs 7 to 12. Fig. 13 illustrates that the KBKS approach achieves better performance in regards to precision, recall, and F-score relative to the other approaches. Table 3 shows the dissimilarity scores for each of the approaches and indicates that the results of the KBKS approach are more similar to the ground truth than the other methods. The KBKS-fast approach is one embodiment of the KBKS approach and is described in more detail below. Clustering Iso-Content Iso-Content KBKS KBKS-fast Distance Distortion

35.3 29.72 30.72 27.5 28.1

Table 3: Quantitative Evaluation On The Second Dataset: Dissimilarity

Computational Complexity

In this experiment, the frame size of Foreman and Coastguard is 352 x 288, and frame size of the videos in the Open Video project is 352 x 240. With a standard 3.0GHz Dual core desktop computer, for a video shot of 300 frames (i.e., 10 seconds), the total time needed to apply the KBKS approach is roughly 150 seconds broken down into: 150 seconds for the first step (Section IITA) and the second step (Section II.B); and less than 1 second for the third step (Section II.C) and the fourth step (Section III).

The computational cost of the KBKS approach is largely affected by the efficiency of Keypoint Extraction and Matching. Keypoint Extraction costs approximately 0.02 second to process one frame. Keypoint Matching takes approximately 0.1 second to process one frame-pair. Therefore, the time cost of keyframe selection on a video shot with Vframe is roughly:

Time Cost = N * 0.02 + W * N * 0.1 + 1 ... Eqn(7) and complexity is O(N).

When N= 300 and W= 5, the time cost is about 150 seconds. In order to reduce the computational cost, one arrangement utilises a randomized kd-tree forest based matching algorithm, as described in "Optimised kd-trees for fast image descriptor matching"

C. Silpa-Anan and . Hartley, in IEEE Conference on Computer Vision and Pattern

Recognition, 2008, within the window. The matching speed is approximately ten times faster than the conventional matching algorithm. That is, the computational cost of the fast matching algorithm is about 15 seconds for 300 frames. As shown in the rightmost column of Fig. 13 and Table 3, the performance of the fast algorithm (namely KBKS-fast) is still comparable to the original scheme, though approximated matching is employed in the randomised kd-tree forest based matching algorithm. Conclusion

The keypoint based keyframe selection ( BKS) approach described herein provides a keyframe selection method and system based on discriminative keypoints. A video shot is first represented by a global pool of keypoints through keypoint chaining. Second, a greedy algorithm is developed to select suitable keyframes based on the metric of coverage and optionally based on the metric of redundancy.

Industrial Applicability

The arrangements described are applicable to the computer and data processing industries and particularly for the video, imaging, and security industries.

The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

In the context of this specification, the word "comprising" and its associated grammatical constructions mean "including principally but not necessarily solely" or "having" or

"including", and not "consisting only of". Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

As used throughout this specification, unless otherwise specified, the use of ordinal adjectives "first", "second", "third", "fourth", etc., to describe common or related objects, indicates that reference is being made to different instances of those common or related objects, and is not intended to imply that the objects so described must be provided or positioned in a given order or sequence, either temporally, spatially, in ranking, or in any other manner.

Although the invention has been described with reference to specific examples, it will be appreciated by those skilled in the art that the invention may be embodied in many other forms.

Claims

We claim:

1. A method of keyframe selection, comprising the steps of:

identifying unique keypoints in each frame in a video sequence;

forming a global keypoint pool, based on said identified unique keypoints;

associating each frame with a selection value based on a number of unique keypoints identified in that frame; and

selecting keyframes of said video sequence, based upon said selection values.

2. The method according to claim 1, wherein each said selection value includes a coverage value based on the number of unique keypoints identified in that frame relative to a number of keypoints in said global keypoint pool.

3. The method according to either one of daims 1 and 2, wherein said selection value includes a redundancy value based on the number of unique keypoints identified in that frame that were identified in other frames of said video sequence.

4. The method according to daim 1, comprising the further steps of:

assigning a frame having a highest number of keypoints as an initial keyframe of a set of keyframes;

determining said selection value for each frame outside said set of keyframes by: determining a coverage value for that frame, said coverage value being based on the number of unique keypoints identified in that frame relative to a number of keypoints in a set of uncovered keypoints of said global keypoint pool;

determining a redundancy value for that frame, said redundancy value being based on the number of unique keypoints identified in that frame relative to a number of keypoints in a set of covered keypoints of said global keypoint pool; and

computing a weighted difference between said coverage value and said redundancy value of that frame.

5. The method according to daim 1, wherein said selection value includes a redundancy value based on the number of keypoints identified in that frame relative to a number of keypoints in a set of covered keypoints of said global keypoint pool.

6. The method according to any one of claims 1 to 5, wherein said selected keyframes form a summarisation of said video sequence.

7. A computer readable storage medium having recorded thereon a computer program for keyframe selection, said computer program comprising code for performing the steps of: identifying unique keypoints in each frame in a video sequence;

forming a global keypoint pool, based on said identified unique keypoints;

selecting keyframes of said video sequence, based upon said selection values.

8. An apparatus for performing keyframe selection, said apparatus comprising:

a storage device for storing a computer program; and

a processor for executing a program, said program comprising code for performing the method steps of:

identifying unique keypoints in each frame in a video sequence;

forming a global keypoint pool, based on said identified unique keypoints;

selecting keyframes of said video sequence, based upon said selection values.

9. The apparatus according to claim 8, wherein said storage device and processor are components of one of a camera and a general purpose computer.

10. A method of keyframe selection, comprising the steps of:

identifying a set of keypoints in each frame of a video sequence, said video sequence having a plurality of frames;

forming a global keypoint pool derived from said identified sets of keypoints, said global keypoint pool including mutually exclusive sets of covered keypoints and uncovered keypoints;

selecting a frame containing a highest number of keypoints as an initial keyframe in a set of keyframes;

amending said set of covered keypoints to include keypoints contained in said initial keyframe and amending said set of uncovered keypoints to exclude keypoints contained in said initial keyframe; and

iteratively performing the following steps until said set of keyframes satisfies a quality criteria:

determining a selection value for each frame outside said set of keyframes; selecting one of said frames outside said set of keyframes as a keyframe in said set of keyframes, based on said selection value of said frame being higher than selection values associated with other frames outside said set of keyframes; and amending said set of covered keypoints to include keypoints contained in said selected frame and amending said set of uncovered keypoints to exclude keypoints contained in said selected frame.

11. The method according to claim 10, wherein said selection value is derived from a coverage value based on the set of keypoints identified in that frame and keypoints in said set of uncovered keypoints during that iteration.

12. The method according to either one of claims 10 and 11, wherein said selection value is further derived from a redundancy value associated with that frame, said redundancy value being based on the set of keypoints identified in that frame and keypoints in said set of covered keypoints during that iteration.

13. The method according to daim 12, wherein said selection value is an influence value determined by a difference between said coverage value and said redundancy value.

14. The method according to any one of claims 12 and 13, wherein said selection criteria includes at least one of a predefined number of keyframes, a coverage threshold, and a redundancy threshold.

15. The method according to claim 14, wherein said coverage threshold defines a minimum coverage value associated with keyframes.

16. The method according to claim 14, wherein said redundancy threshold defines a maximum redundancy value associated with keyframes.

17. The method according to any one of claims 11 to 13, wherein said set of keyframes is a summarisation of said video sequence.

18. The method according to any one of claims 10 to 17, comprising the further step of: utilising said set of keyframes as a summarisation of said video sequence in an application selected from the group consisting of: previewing video data, reviewing video footage, video editing, broadcasting to a mobile computing device, and selecting I-frames in a video compression application.

19. A camera system for keyframe selection, said camera system comprising:

a lens system for focussing on a scene;

a camera module coupled to said lens system to capture a video sequence of said scene; .

a storage device for storing a computer program; and

a processor for executing the program, said program comprising:

code for identifying unique keypoints in each frame in a video sequence;

code for forming a global keypoint pool, based on said identified unique keypoints;

code for associating each frame with a selection value based on a number of unique keypoints identified in that frame; and

code for selecting keyframes of said video sequence, based upon said selection values.