WO2002023891A2

WO2002023891A2 - Method for highlighting important information in a video program using visual cues

Info

Publication number: WO2002023891A2
Application number: PCT/EP2001/010112
Authority: WO
Inventors: Mohamed Abdel-Mottaleb
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2000-09-13
Filing date: 2001-08-30
Publication date: 2002-03-21
Also published as: EP1320992A2; JP2004509529A; WO2002023891A3

Abstract

A method for highlighting important developments in a video clip of a sports event, such as a video clip of a soccer match, by inferring the developments from a cue provided from low level features in the video clip. The method detects sequences of frames in the video clip having a certain preselected visual or audible cues. The number of frames in each sequence having the cue is then compared to a predefined threshold number. If the number of frames in a sequence is equal to or greater than the threshold number, an important development is declared in the frames immediately preceding the threshold meeting frame sequence.

Description

Method for highlighting important information in a video program using visual cues

The present invention relates to content-based video retrieval and browsing, and niore particularly, to a method for automatically identifying important information or developments in video clips of sports events.

Many video applications call for browsing methods which enable one to browse through a large amount of video material to find clips which are of a certain importance. Such applications may include for example, interactive TV and pay-per-view systems. Customers who use interactive TV and pay-per-view systems want to see sections of programs before renting them. Video browsers enable the customers to find programs of interest.

Most work in content-based video retrieval and browsing is based on low-level features such as color, texture, shape and camera motion. Although low-level features can be useful for certain applications, many other interesting applications require the use of higher level semantic information. Bridging the gap between low-level features and high-level semantic information is not always easy. In most cases when higher level semantic information is required, manual annotation using keywords is always used.

One of the important applications for video archiving and retrieval is for sports such as soccer, football, etc. Accordingly, a method is needed which enables automatic extraction of high level information using low level features.

The present invention is directed to a method for automatically identifying important developments in video clips of sporting events, especially soccer matches. The method comprises detecting sequences of frames in a video clip of a sporting event that have a preselected cue indicative of a possible important development in frames of the video clip immediately preceding the frame sequences having the preselected cue; comparing the number of frames in each of the frame sequences having the cue to a predefined threshold number; and declaring an important development in the frames immediately preceding each frame sequence if the number of frames in that sequence is equal to or greater than the threshold number.

The method further involves acquiring the preselected cue from low level features in the image in each frame of the sequence. In such an embodiment, the preselected cue is based on changes in the camera's center of attention. More particularly, when an important development occurs in the video clip, the camera typically focuses on the viewers or players, and thus, the images in the sequence of frames immediately subsequent to the frames with the important development have little or no grass areas.

The advantages, nature, and various additional features of the invention will appear more fully upon consideration of the illustrative embodiments now to be described in detail in connection with accompanying drawings wherein:

Fig. 1 is a flowchart outlining an algorithm that performs an illustrative embodiment of the method of the present invention;

Fig. 2 is a block diagram of a computer for implementing the present invention; and

Fig. 3 is a block diagram of the internal structure of the computer for implementing the present invention.

The method of the present invention extracts high level information from multiple images or video using low level features in order to achieve advancements in content-based retrieval and browsing. This is accomplished in the present invention by specifying a particular domain of interest and using knowledge specific to that domain to automatically extract high level information based on low level features. One especially useful application for the present invention is in highlighting segments of important developments in video clips of sports events, including but not limited to soccer matches and football games. Such video clips typically include video, audio, and textual (close- captioning) information.

The method of the present invention highlights important developments in a video clip by inferring the developments from one or more cues which are provided from low level features and textual information of the video clip. More particularly, the method detects sequences of frames in the video clip having a certain preselected visual, audible, and/or textual (close captioning) cue. The number of frames in each sequence having the cue(s) is then compared to a predefined threshold number. If the number of frames in a sequence is equal to or greater than the threshold number, an important development is declared in the frames immediately preceding the threshold meeting frame sequence with the cue. It has been found that important developments in video clips of sports events are typically marked with a visual cue which relates to changes in the camera's center of attention. For example, after an important development has taken place in a sports event such as a soccer match, the video camera usually focuses on the stadium viewers or the players. When the camera focuses on the viewers or players, little or none of the grass of the playing field can be seen in the camera' s field of view.

Using changes in the camera's center of attention, the method of the present invention detects sequences of frames in the video clip with images that have little or no grass areas of the playing field. The number of frames in each sequence is compared to a predefined threshold number. If the number of frames in the sequence is equal to or greater than the threshold number, an important development is declared in the frames immediately preceding the threshold meeting frame sequence that has little or no grass areas. The threshold is based on the assumption that if the number of frames in the sequence with little or no grass areas of the playing field is significant, the camera must be focusing on the viewers or the players. Consequently, it is likely that the frames immediately preceding that sequence of frames includes an important development such as the scoring of a goal in the case of a soccer match.

Fig. 1 shows a flowchart which outlines an illustrative embodiment of an algorithm for performing the method of the present invention as it applies to highlighting segments of important events in a video clip of a soccer match. The algorithm in step SI detects sequences of frames in the video clip in which there are little or no grass areas. In step

52, if the number of frames in the sequence is larger than a predefined threshold, then in step

53, an important event is declared in the previous set of frames in the video clip.

In the detecting step SI, the algorithm detects green areas which have colors similar to grass. The algorithm is trained to differentiate the green colors from the other colors in each frame so that the grass areas in the frame can be identified. This is accomplished using patches from a training set of images of grass areas which have been extracted from the soccer match in the video clip, or from one or more previous soccer matches. The algorithm learns from the patches how the grass areas translate into the values of the color green. Given an image in a frame of the video clip, the training is used to judge whether a given pixel in the frame is grass.

The algorithm is trained by calculating red and green normalized colors (r,g), where: r = R/(R+G+B), g = G/(R+G+B) for each pixel in the training patches and obtaining the normalized histogram for the class grass. A color histogram of an image is obtained by dividing a color space, such as red, green, and blue, into discrete image colors (called bins) and counting the number of times each discrete color appears by traversing every pixel in the image.

This normalized histogram can be considered as the probability density function for the class grass, p(pixel value I grass). The detection step SI is accomplish in the algorithm by marking pixels in each frame that have a value of p(pixel value | grass) greater than a preselected threshold as pixels of grass. Based on the above pixel classification, the algorithm in step SI looks for connected components that have similar color to the grass in the image of each frame and if they are large enough, it is assumed that the camera is focusing on the playing field. If, however, the connected grass color components found in the image of the frame are small, then it is assumed that the camera is either focusing on the viewer or the players. If only small grass color components are detected for a short period of time in step S2, for example in only one-three or four frames, then no important event is declared in step S3. However, if small grass color components are detected for a relatively long period of time, for example in 200-300 frames, then an important event is declared in step S3.

The results obtained with the algorithm can be further refined using other cues either from the same modality or from other modalities, such as audio or closed captions. Cues from the same modalities or different modalities can be used to confirm the identity of the detected important occurrences or activities and more importantly, to classify the detected important occurrences or activities into semantic classes, such as goals, attempted goals, penalties, injuries, fights between players and the like, and rank them by importance. In one embodiment, the method of the Fig.ure 1 is implemented by a computer readable code executed by a data processing apparatus. The code may be stored in a memory within the data processing apparatus or read/downloaded from a memory medium such as a CD-ROM or floppy disk. In other embodiments, hardware circuitry may be used in place of, or in combination with, software instructions to implement the invention. The invention, for example, can also be implemented on a computer 30 shown in Fig. 2.

The computer 30 may include a network connection 31 for interfacing to a data network, such as a variable-bandwidth network or the Internet, and a fax/modem connection 32 for interfacing with other remote sources such as a video or a digital camera (not shown). The computer 30 may also include a display for displaying information (including video data) to a user, a keyboard for inputting text and user commands, a mouse for positioning a cursor on the display and for inputting user commands, a disk drive for reading from and writing to floppy disks installed therein, and a CD-ROM drive for accessing information stored on CD-ROM. The computer 30 may also have one or more peripheral devices 38 attached thereto inputting images, or the like, and a printer for outputting images, text, or the like.

Fig. 3 shows the internal structure of the computer 30 which includes a memory 40 that may include a Random Access Memory (RAM), Read-Only Memory (ROM) and a computer-readable medium such as a hard disk. The items stored in the memory 40 include an operating system 41, data 42 and applications 43. The operating system 41 may be a windowing operating system, such as UNIX; although the invention may be used with other operating systems as well such as Microsoft Windows95.

In addition to the method of Fig. 1, the applications stored in the memory 40 include a video coder 44, a video decoder 45 and a frame grabber 46. The video coder 44 encodes video data in a conventional manner, and the video decoder 45 decodes video data which has been coded in the conventional manner. The frame grabber 46 allows single frames from a video signal stream to be captured and processed.

Also included in the computer 30 are a central processing unit (CPU) 50, a communication interface 51, a memory interface 52, a CD-ROM drive interface 53, a video interface 54 and a bus 55. The CPU 50 comprises a microprocessor or the like for executing computer readable code, i.e., applications, such those noted above, out of the memory 50. Such applications may be stored in memory 40 (as noted above) or, alternatively, on a floppy disk in disk drive 36 or a CD-ROM in CD-ROM drive 37. The CPU 50 accesses the applications (or other data) stored on a floppy disk via the memory interface 52 and accesses the applications (or other data) stored on a CD-ROM via CD-ROM drive interface 53. Input video data may be received through the video interface 54 or the communication interface 51. The input video data may be decoded by the video decoder 45. Output video data may be coded by the video coder 44 for transmission through the video interface 54 or the communication interface 51.

While the foregoing invention has been described with reference to the above embodiments, various modifications and changes can be made without departing from the spirit of the invention. Accordingly, all such modifications and changes are considered to be within the scope of the appended claims.

Claims

CLAIMS:

1. A method for automatically identifying important occurrences or activities in video clips of sporting events, the method comprising the steps of: a) providing a video clip of a sporting event generated by a camera; b) detecting sequences of frames in the video clip that have a preselected cue indicative of a possible important development in frames of the video clip immediately preceding the frame sequences having the preselected cue; c) comparing the number of frames in each of the frame sequences having the cue to a predefined threshold number; d) declaring an important development in the frames immediately preceding each frame sequence if the number of frames in that sequence is equal to or greater than the threshold number.

2. The method according to claim 1, wherein preselected cue is visual.

3. The method according to claim 1, wherein the preselected cue is based on changes in the center of attention of the camera.

4. The method according to claim 1, wherein each frame in the sequence has an image and the preselected cue is acquired from the images.

5. The method according to claim 4, wherein the preselected cue includes the images having little or no grass areas.

6. The method according to claim 1 , wherein the sporting event shown in the video clip is a soccer match.

7. The method according to claim 1, wherein the preselected cue is provided from low level features of the video clip.

8. The method according to claim 1, wherein the preselected cue is provided from low level visual features of the video clip.

9. The method according to claim 8, wherein the low level visual features include color.

10. The method according to claim 1 , wherein the preselected cue is provided from low level audio features of the video clip.

11. The method according to claim 1 , wherein the preselected cue is provided from textual information of the video clip.

12. The method according to claim 11 , further comprising the step of confirming the identity of the important occurrences or activities using the textual information of the video clip.

13. The method according to claim 11 , further comprising the step of classifying the important occurrences or activities in semantic classes using the textual information of the video clip.

14. The method according to claim 1, wherein the preselected cue includes a plurality of preselected cues.

15. The method according to claim 1, wherein the preselected cues include low level visual and audio features of the video clip and textual information of the video clip.

16. An apparatus for automatically identifying important occurrences or activities in video clips of sporting events, comprising:

- a memory for storing executable code; and

- a processor for executing the code stored in the memory so as to (a) provide a video clip of a sporting event generated by a camera, (b) detect sequences of frames in the video clip that have a preselected cue indicative of a possible important development in frames of the video clip immediately preceding the frame sequences having the preselected cue, (c) compare the number of frames in each of the frame sequences having the cue to a predefined threshold number and (d) declare an important development in the frames immediately preceding each frame sequence if the number of frames in that sequence is equal to or greater than the threshold number.