WO2009001278A1

WO2009001278A1 - System and method for generating a summary from a plurality of multimedia items

Info

Publication number: WO2009001278A1
Application number: PCT/IB2008/052470
Authority: WO
Inventors: Prarthana Shrestha; Johannes Weda; Mauro Barbieri
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2007-06-28
Filing date: 2008-06-23
Publication date: 2008-12-31

Abstract

A summary is generated from a plurality of multimedia items. The system comprises a plurality of client devices (102, 104, 106) interconnected by network means. Each client device (102, 104, 106) extracts at least one feature from a multimedia item. The interconnection of the plurality of client devices (102, 104, 106) enables a summary to be generated from the extracted at least one feature of each of a plurality of multimedia items.

Description

System and method for generating a summary from a plurality of multimedia items

FIELD OF THE INVENTION

The present invention relates to a system and method for generating a summary from a plurality of multimedia items.

BACKGROUND OF THE INVENTION

As camcorders have become less expensive, more people are using them to record events. A growing number of people are using mobile devices such as a mobile telephone with embedded capturing devices such as cameras and microphones since these devices make recordings readily and effortlessly available. Ambient devices which are mounted on ceilings or walls, or embedded in cars to record the surroundings for security or surveillance purposes may also be used where the recordings can be made available.

The multimedia content produced by camcorders and mobile terminals is commonly used for entertainment purposes. For example, the multimedia content is used to provide an overview of an event such as a vacation, a birthday, a party, a wedding, etc. On these occasions, it is common for multiple participants to record the event from different perspectives. It is, therefore, beneficial for participants to obtain a summary of the recorded multimedia content that provides the best representation of the event, according to their preferences and interests.

The inventors have developed a system for generating a summary of a plurality of multimedia items. This system comprises a plurality of client devices (for example, ambient cameras, storage devices, mobile devices, etc) and a server device. In this system, each client device transmits a multimedia file to a server device. The server device receives the transmitted multimedia files, extracts features from each multimedia file, synchronizes the features, and automatically generates a summary that includes the most suitable parts of each multimedia file. The server device then transmits the generated summary to the client devices.

This system provides client devices with a personalised summary of the multimedia items. However, each client device is required to transmit entire multimedia files to a server device, even if the final summary only requires the exchange of short parts of the multimedia files. This system is, therefore, inefficient in terms of network bandwidth. Furthermore, since the multimedia files have to be decrypted at the server device to enable the server device to extract features from the multimedia files, the privacy of the client is not preserved.

SUMMARY OF THE INVENTION

The present invention seeks to provide a system that uses minimum bandwidth and minimum time for exchanging data with a server.

This is achieved, according to an aspect of the present invention, by a system for generating a summary from a plurality of multimedia items, the system comprising: a plurality of client devices, each client device extracting at least one feature from a multimedia item; and network means for interconnecting the plurality of client devices to enable generation of a summary from the extracted at least one feature of each of a plurality of multimedia items. This is also achieved, according to another aspect of the present invention, by a client device for enabling generation of a summary from a plurality of multimedia items, the device comprising: an extractor for extracting at least one feature from a multimedia item; and a transceiver for transmitting the extracted at least one feature to at least one other device and for receiving a summary generated from the extracted at least one feature of each of a plurality of multimedia items.

This is also achieved, according to yet another aspect of the present invention, by a server device for generating a summary from a plurality of multimedia items, the device comprising: a transceiver for receiving at least one feature extracted from a multimedia item from at least one client device; and means for generating a summary from the received at least one feature of each of a plurality of multimedia items, the transceiver transmitting the generated summary to at least one client device.

This is also achieved, according to yet another aspect of the present invention, by a method for generating a summary of a plurality of multimedia items, the multimedia items being created by a plurality of interconnected client devices, the method comprising the steps of: receiving at least one feature extracted from a multimedia item from at least one client device; and generating a summary from the received at least one feature of each of a plurality of multimedia items.

In this way, only extracted features of the multimedia items are transmitted to a central server device, which uses less bandwidth compared to transmitting entire multimedia items. In practice, the bandwidth required to transmit the extracted features is almost negligible compared to the bandwidth required to transmit the entire multimedia items.

Furthermore, transmitting the extracted features preserves the privacy of the client devices since the content of the multimedia items cannot be reconstructed from the features alone. For example, extracted features such as the date and time of when a picture was taken, or the GPS coordinates of where a picture was taken do not reveal any information regarding the content of the picture. Also, features extracted for matching similar faces do not allow a face to be reconstructed, but can be used to detect multimedia items that include the same person. The extracted features, therefore, only provide a sufficient quantity of information to enable the generation of a summary of multimedia items representative of an entire event.

It is possible for the client devices to control their privacy and the quality of the generated summary that they receive. For example, if the client device requires a greater privacy and does not require a summary of particularly high quality, the client device can choose to extract only very low-level features (such as GPS coordinates). If, on the other hand, the client device requires a summary of higher quality, the client device can choose to extract high-level features (such as environment recognition, face recognition, and event recognition). In this way, the client devices are able to preserve their privacy by controlling the features that are extracted and transmitted.

According to an embodiment of the present invention, the network means may comprise means for enabling the extracted at least one feature of each client device to be exchanged between the plurality of client devices, each of the plurality of client devices generating a summary from the exchanged features. In this way, each client device can generate a summary by exchanging the extracted features among other client devices, without having to transmit the features to a central server, thus conserving bandwidth. Furthermore, since only extracted features are exchanged between client devices and not entire multimedia items, each client device preserves their privacy. Alternatively, the network means may comprise a central server device for generating the summary from the extracted at least one feature received from each of the plurality of clients.

The central server device may generate a skeleton summary from the extracted at least one feature received from each of the plurality of client devices and each of the client devices may generate a full summary from said skeleton summary. The skeleton summary may include, for example, references to the multimedia items originating from multiple users. The full summary includes parts of the multimedia items.

At least one of the plurality of client devices may be enabled to receive partial content of any of the plurality of multimedia items from any other client devices required by the skeleton summary to generate the full summary.

The central server device may generate a new skeleton summary if one of the other client devices is unavailable. The central server device may also update the skeleton summary based upon content of multimedia items of available client devices. The skeleton summary may include instructions to generate a summary in the event that a client is unavailable.

The client device may further comprise an output device for playback of the summary and/or an input means for manually editing the summary. In this way, the summary can easily be edited and shared. The present invention can be applied to video content or digital photograph collections and is not limited to audiovisual data but can also be applied to multimedia streams including other sensor data, such as place, time, temperature, physiological data, etc. It can be easily applied to purposes that require combining audio/video/images from multiple recordings such as news summarization, creating mash-ups from individual videos, surveillance, etc.

A summary may be considered a subset of the entire content provided by all the client devices.

BRIEF DESCRIPTION OF THE DRAWINGS For a more complete understanding of the present invention, reference is made to the following description in conjunction with the accompanying drawings, in which:

Fig. 1 is a simplified schematic of a system for generating a summary from a plurality of multimedia items according to an embodiment of the present invention;

Fig. 2 is a simplified schematic of one example of a client device according to an embodiment of the present invention; and

Fig. 3 is a simplified schematic of a server device according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS Figure 1 is a simplified schematic of a system according to an embodiment of the present invention. The system of Figure 1 is based on a client-server architecture.

The system of Figure 1 comprises a plurality of interconnected client devices 102, 104, 106 and a central server device 108. The client devices may be, for example, mobile devices/phones with embedded cameras or ambient devices such as digital video still cameras, surveillance cameras, microphones, etc.

With reference to Figure 2, each client device 102, 104, 106 comprises a capture device 202, for example, a camera or microphone. The capture device 202 may be embedded in the client device (as shown) or separate and connected to the client device. The capture device is connected to an extractor 204. The output of the extractor 204 is connected to a generating means 212. The output of the extractor 204 is also connected to a transceiver 206. The output of the transceiver 206 is output on an output terminal 208. The client device 102 also comprises an input terminal 210 for input into the transceiver 206. The output of the transceiver 206 is connected to the generating means 212. The client device 102 also comprises an input interface 214 (for example, a keypad). The output of the input interface 214 is connected to the generating means 212. The output of the generating means 212 is connected to an output device 216. The output device 216 may, for example, be a display and display driver, a storage device, or a network connection for remote storage.

A multimedia item is input into the extractor 204 from the capture device 202 of the client device 102. The extractor 204 extracts at least one feature from the multimedia item. The extracted features may be, for example, the camera time or audio fingerprints or may be specific features such as colour, faces, camera angle, audio volume, etc. The server device may instruct certain client devices to extract certain features based on the capabilities of the client devices. For example, a very powerful client device (such as a PC) could be instructed by the server to extract all types of advanced features, whereas a less powerful client device (such as a mobile terminal) could be instructed by the server device to extract only basic, low-level features.

The extracted features are input into the transceiver 206 and the transceiver 206 transmits the extracted features to the server device 108 via the output terminal 208. The transceiver 206 may encode the extracted features before transmitting them using, for example, the MPEG-7 standard. The extracted features may be transmitted, for example, wirelessly or by a wired link via a dedicated network or the Internet. In this way, each client device 102, 104, 106 performs feature extraction locally and transmits only the extracted features to the server device 108. In other words, the client devices 102, 104, 106 are not required to transmit their entire multimedia files to the server device 108, which minimizes the usage of bandwidth. In practice, the bandwidth required to transmit the extracted features is almost negligible compared to the bandwidth required to transmit the entire multimedia files. With reference to Figure 3, the server device 108 comprises an input terminal

302 for input into a transceiver 304. The output of the transceiver 304 is connected to a processor 306. The output of the processor 306 is connected to the transceiver 304 for output on an output terminal 308. The output of the processor 306 is also connected to a storage means 310. The transceiver 304 of the server device 108 receives the features transmitted from each of the client devices 102, 104, 106 via the input terminal 302 and inputs the features into the processor 306. The processor 306 may also synchronise the content using audio fingerprinting, for example. Alternatively, the processor 306 may automatically select multimedia items from all the multimedia items received without synchronising the content and instead, observing the content features.

The processor 306 then analyses the received features and generates a skeleton summary for each client device 102, 104, 106 based on the received features. The processor 306 may generate a single skeleton summary for all the client devices 102, 104, 106 or, alternatively, the processor 306 may generate personalised skeleton summaries for each client device based on preferences indicated by a user of each client device.

The skeleton summary is a list of references to the multimedia content required and to the client devices that own the required multimedia content. The skeleton summary may be a list of time stamps that refer to the camera times or recording times of the client device and may additionally contain editing instructions. For example, instructions such as which filters to apply, where and when to apply the filters, what transitions to use between the different multimedia segments and when to enhance video quality (shaking, blur), etc.

The processor 306 inputs the generated skeleton summary into the transceiver 304. The transceiver 304 transmits the generated skeleton summary to each client device 102, 104, 106 via the output terminal 308. If the processor 306 generates multiple personalised skeleton summaries, different skeleton summaries may be transmitted to each of the client devices 102, 104, 106. The transceiver 304 may encode the skeleton summary before transmitting it using, for example, the MPEG-7 standard. The processor 306 may also input the generated skeleton summary into a storage means 310 and the storage means 310 stores the generated skeleton summary.

The transceiver 206 of each client device 102, 104, 106 receives the generated skeleton summary from the server device 108 via the input terminal 210. The transceiver 206 of each client device 102, 104, 106 communicates with the transceivers of the other client devices via the output terminal 208 to retrieve the required multimedia content indicated by the references in the skeleton summary. In this way, the usage of bandwidth is minimised as only the multimedia content that is required for the final video summary is retrieved. Each client device may have various privacy settings to allow or deny other client devices access to the multimedia content of that particular client device. When a client device denies another client device access to its multimedia content or when multimedia content is not available due to the client device being offline or due to the content having been moved, renamed or changed, the other client device is informed. The client device then either waits until the multimedia content becomes available or requests that the server device 108 composes a new skeleton summary. The server device 108 may constantly update the generated skeleton summary based on the multimedia content that is currently available. Also, the skeleton summary may include instructions on the actions that could be taken if certain multimedia content is unavailable, such as an instruction to use another multimedia content.

Once the transceiver 206 has retrieved the required multimedia content via the input terminal 210, the transceiver 206 inputs the retrieved multimedia content into the generating means 212. The generating means 212 then generates a full summary based on the skeleton summary. The generating means 212 may receive preferences indicated by a user via the user interface 214 and may generate the summary based on the received preferences. In this way, a user can manually edit the summary. The generating means 212 outputs the final summary into the output device 216. The output device 216 can then playback the summary.

In an alternative embodiment of the present invention, the system is based on a fully distributed peer-to-peer architecture. This system is similar to that shown in Figure 1 in that it comprises a plurality of interconnected client devices 102, 104, 106 that are enabled to communicate with each other, the only difference being that the system does not require a central server device. According to the alternative embodiment of the present invention, each client device 102, 104, 106 is configured as previously described with reference to Figure 2.

With reference to Figure 2, a multimedia item is input into the extractor 204 from the capture device 202 of the client device 102. The extractor 204 extracts at least one feature from the multimedia item. The extractor 204 inputs the extracted features into the generating means 212 and the transceiver 206.

The transceiver 206 transmits the extracted features to the other client devices via the output terminal 208 and receives the features transmitted from the other client devices via the input terminal 210. In this way, each client device receives all the features from all the other client devices. The transceiver 206 may encode the extracted features before transmitting them using, for example, the MPEG-7 standard. The transceiver 206 inputs the received features into the generating means 212.

The generating means 212 synchronises the extracted features and the received features and generates a summary. In this way, each client device performs synchronisation and summary generation locally. The generating means 212 may receive preferences indicated by a user via the user interface 214 and may generate the summary based on the received preferences. In this way, a user can manually edit the summary. The generating means 212 outputs the generated summary on the output device 216. The output device 216 can then playback the summary. Although embodiments of the present invention have been illustrated in the accompanying drawings and described in the foregoing detailed description, it will be understood that the invention is not limited to the embodiments disclosed, but is capable of numerous modifications without departing from the scope of the invention as set out in the following claims. The invention resides in each and every novel characteristic feature and each and every combination of characteristic features. Reference numerals in the claims do not limit their protective scope. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements other than those stated in the claims. Use of the article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. 'Means', as will be apparent to a person skilled in the art, are meant to include any hardware (such as separate or integrated circuits or electronic elements) or software (such as programs or parts of programs) which reproduce in operation or are designed to reproduce a specified function, be it solely or in conjunction with other functions, be it in isolation or in co-operation with other elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the apparatus claim enumerating several means, several of these means can be embodied by one and the same item of hardware. 'Computer program product' is to be understood to mean any software product stored on a computer-readable medium, such as a floppy disk, downloadable via a network, such as the Internet, or marketable in any other manner.

Claims

CLAIMS:

1. A system for generating a summary from a plurality of multimedia items, said system comprising: a plurality of client devices, each client device extracting at least one feature from a multimedia item; and network means for interconnecting said plurality of client devices to enable generation of a summary from said extracted at least one feature of each of a plurality of multimedia items.

2. A system according to claim 1, wherein said network means comprises means for enabling said extracted at least one feature of each client device to be exchanged between said plurality of client devices, each of said plurality of client devices generating a summary from said exchanged features.

3. A system according to claim 1, wherein said network means comprises a central server device for generating said summary from said extracted at least one feature received from each of said plurality of clients.

4. A system according to claim 3, wherein said central server device generates a skeleton summary from said extracted at least one feature received from each of said plurality of client devices.

5. A system according claim 4, wherein each of said client devices generates a full summary from said skeleton summary.

6. A system according to claim 5, wherein at least one of said plurality of client devices is enabled to receive partial content of any of said plurality of multimedia items from any other client devices required by said skeleton summary to generate said full summary.

7. A system according to claim 3, wherein said central server device generates a full summary from said extracted at least one feature received from each of said plurality of client devices.

8. A client device for enabling generation of a summary from a plurality of multimedia items, the device comprising: an extractor for extracting at least one feature from a multimedia item; and a transceiver for transmitting the extracted at least one feature to at least one other device and for receiving a summary generated from the extracted at least one feature of each of a plurality of multimedia items.

9. A client device according to claim 8, wherein said at least one other device is at least one other client device and wherein said transceiver is operative to receive said summary from said at least one other client device.

10. A client device according to claim 8, wherein said client device further comprises an output device for playback of said summary.

11. A computer program product comprising a plurality of program code portions for enabling a programmable device to act as the client according to any one of claims 8 to

10.

12. A server device for generating a summary from a plurality of multimedia items, the device comprising: a transceiver for receiving at least one feature extracted from a multimedia item from at least one client device; and means for generating a summary from said received at least one feature of each of a plurality of multimedia items, the transceiver transmitting said generated summary to at least one client device.

13. A method for generating a summary of a plurality of multimedia items, said multimedia items being created by a plurality of interconnected client devices, the method comprising the steps of: receiving at least one feature extracted from a multimedia item from at least one client device; and generating a summary from said received at least one feature of each of a plurality of multimedia items.

14. A method according to claim 13, wherein generating a summary comprises generating a skeleton summary.

15. A method according to claim 14, wherein the method further comprises the step of: generating a new skeleton summary if one of said other client devices is unavailable.

16. A method according to claim 14 or 15, wherein the method further comprises the step of: updating said skeleton summary based upon content of multimedia items of available client devices.

17. A method according to any one of claims 14 to 16, wherein said skeleton summary includes instructions to generate a summary in the event that a client is unavailable.

18. A method according to claim 13, wherein generating a summary comprises generating a full summary,

19. A computer program product comprising a plurality of program code portions for carrying out the method according to any one of claims 13 to 18.