WO2004109549A2

WO2004109549A2 - System and method for performing media content augmentation on an audio signal

Info

Publication number: WO2004109549A2
Application number: PCT/IB2004/050822
Authority: WO
Inventors: Martin Franciscus Mckinney; Jan Alexis Daniel Nesvadba; Dirk Jeroen Breebaart
Original assignee: Koninklijke Philips Electronics N. V.
Priority date: 2003-06-05
Filing date: 2004-06-02
Publication date: 2004-12-16
Also published as: WO2004109549A3

Abstract

The invention describes a system (1) for performing media content augmentation on an audio signal (2). The system comprises a speech identifier (3) for identifying speech content in the audio signal (2); a speech-to-text converter (5) for converting the speech content into a digital text format (6); a key phrase identifier (7) for identifying key phrases (19) in the digital text (6); a search engine (8) for searching a source of information (9) for material relating to the key phrases (19), and a search result compiler (10) to provide a user with results of the search (11). Moreover the invention describes an appropriate method for performing media content augmentation on an audio signal (2).

Description

System and method for performing media content augmentation on an audio signal

This invention relates in general to a system and method for performing media content augmentation on an audio signal, and, in particular, to a system and method for providing media content augmentation in an audio device.

An audio signal can be received by a user as, for example, a radio signal or as part of an audio-visual signal originating from a television broadcast or an audio- visual device. An audio device can be, for example, a radio, a receiver, etc., or an audio-visual device such as a television set, a DVD player, VCR, multimedia system, mobile telephone, etc. Regardless of the source of the audio signal, such a signal may consist of speech, music, sound effects and other audio contents.

The programs received by a user can be of different natures, e.g. news broadcasts, commercials, feature films, online shopping, health programs etc. While listening to or viewing such programs, a user may hear a reference to a word or phrase that he does not recognise. The word or phrase may be a new buzzword, a slang word, or may denote a product with which the user is not acquainted.

Sometimes words are created and used by particular groups, for example rappers, who write music and communicate using words which are not used by the majority of people. Examples of other groups might be small ethic communities, whose language might be partially adopted by the surrounding communities. Some of these words might be taken over into the vernacular and used commonly in everyday speech, whilst other words might remain local to the groups that created them. Some words might persist in language whereas others may be of a short-lived nature.

If the user wishes to find out more about a particular word or phrase, he is limited to consulting standard dictionaries, which may be available in printed form, or as online internet dictionaries. In the case of new words, such as buzzwords or slang, the user may not be able to find a reference in the dictionaries available, since some time usually elapses before such words are included in new editions of the dictionary, if at all. A similar problem might be experienced by persons learning a foreign language, who listen to or view foreign language programs. Foreign language dictionaries are also generally restricted to "normal" language and may include only a relatively small proportion of new words, slang, buzzwords etc.

In the case of commercials, the user might hear a reference to a product with which he is not familiar, and wish to learn more about it. The product might be new on the market and therefore of interest to the user. For example, he might wish to know more about the ingredients in the case of a food product, or technical information regarding a device, or information about side-effects in the case of a medicinal product. Equally, the user might see the product advertised or mentioned in a foreign-language broadcast, and might wish to locate a supplier. Such information can be difficult to locate, and requires a relatively high level of effort on the part of the user, for example by looking for suppliers in telephone directories or by first locating and then making contact with the manufacturers.

Other information of interest regarding a product might be its price or availability. Since products are often priced differently by various suppliers, the user might be interested in locating the supplier offering the most attractive price. To this end, the user must compare prices himself, by shopping around to find suppliers of the product in which he is interested. The user might also make use of a price agency, which may, in return for a percentage of the sale price, locate the supplier with the lowest price for a particular product. The user must first locate such an agency and then request a price comparison. If the user is watching a program about a foreign country, he might wish to find out more about how to plan a visit to that country. Interesting information in this case might be travel connections, route planning, visa requirements, vaccine recommendations, currency information etc. The user can locate such information by contacting travel bureaus, researching in libraries or online, and reading appropriate literature. Again, to locate all the desired infoπnation requires effort and dedication on the part of the user.

Therefore, an object of the present invention is to provide a system and a method which can be used to easily provide informative media content augmentation on an audio input. To this end, the present invention provides a system for performing media content augmentation on an audio input signal, wherein the system comprises a speech identifier for identifying the speech content in the audio signal, a speech-to-text converter for converting the speech content into a digital text format, a key phrase identifier for identifying key phrases in the digital text; a search engine for searching a source of information for material relating to the key phrases; and a result compiler for providing the user with the results of the search. Here, the key phrases may be single words or may consist of groups of words e.g. complete sentences. Therefore, for the sake of simplicity, any reference to "word" in the following text is assumed to refer also to "phrase", and vice versa, without restriction of the invention.

An appropriate method for media content augmentation of an audio input comprises identifying the speech content in the audio-visual signal, converting the speech content into a digital text format, identifying key phrases in the digital text, searching a source of information for material relating to the key phrases, and providing the user with results of the search.

The system thus provides an easy way of quickly searching for and locating information of any type concerning words identified on an audio signal which are of interest to the user, who no longer needs to invest time and resources by initiating and carrying out such a search on his own. The dependent claims and the subsequent description disclose particularly advantageous embodiments and features of the invention.

The modules which perform speech identification and conversion to digital text can be realised by one skilled in the art by using off-the-shelf components. These modules may also be realised as a single component, using available software and/or hardware components. The digital text created by the speech-to-text converter can then be analysed for content by the key phrase identifier which first processes the digital text to filter out uninteresting words, such as definite/indefinite articles, conjunctions etc. What remains is a list of possible key phrases which might be candidates for an information search. One possible method of operation of the key phrase identifier is described further on in the following text.

The source of information searched with regard to information relating to the key phrases might be, for example, an information database, the internet, or an intranet. The type of information to be located might be in the form of sound clips, text, video clips, URLs, pictures etc. The result compiler organises the located material into a manner suitable for presentation to the user, for example in the foπn of a text summary with embedded graphics and hyperlinks, or a collection of video clips. In one embodiment of the invention, the result compiler is incorporated into the system in such a way as to be able to present to the user the results of the information in the same device which contains the audio receiver. The results of the information search may equally be stored in a memory for later retrieval and perusal, and may be made available for processing or viewing on another device. In a preferred embodiment of the invention, a dictionary is incorporated in the system, for use by the key phrase identifier in identification of key phrases. The user can access the dictionary by means of a suitable interface. Thus, the dictionary can be updated and extended as required. A particularly advantageous embodiment of the invention is such that the dictionary contains a list of known words (phrases). The words already contained in the dictionary are excluded from any information search, since they are words already known to the user. Any word that is "new", i.e. does not exist in the current version of the dictionary, is a potential candidate for an information search. The dictionary can actively be updated, i.e. once a search has been successfully carried out for a new, hitherto unknown word, the user can indicate, via the interface, that this word is to be entered in the dictionary and therefore excluded from all future searches. Further, the user can specify words (phrases) that are to be included in an information search, for example words that are used in everyday language might also be used in the name of a product in which the user is interested, or appear in the title of a book. This would avoid automatic exclusion from an information search of products owing to their product names consisting of common words. Context analysis can be performed on the digital text using state-of-the-art techniques to identify phrases which consist of everyday words, e.g. "Gone with the Wind", "cure for the common cold" etc., but which might as a whole be of interest to the user.

In a particularly advantageous embodiment of the invention, the system makes use of a computer network interface to search a computer network for references to the key phrases, for example in the form of URLs, links, etc. The interface can be realised by means of, for example, a modem, ISDN or DSL connection, and any hardware and software required. A further embodiment of the interface might use a wireless connection to make contact with the computer network. The computer network with which the system makes contact might be a local intranet or the world-wide web (internet). The search engine of the system might also make use of the services of existing, possibly more powerful search engines (for example a meta-crawler) to perform parallel searches, thereby minimising the amount of time required to obtain the desired results.

A further preferred embodiment of the invention allows the user to control the manner in which a search is to be carried out and the manner in which the results of the search are to be presented, by specifying a set of preferences, for example, automatic or manual result presentation. Therefore, the system preferably comprises a suitable interface, which may be the same interface as utilized for dictionary access.

In automatic mode, the system might continually perform the information search based on words identified in the audio signal, and update a list of relevant internet pages that the user can choose to view immediately or later on. In a manual mode, the system might continuously display lists of words recently identified on the audio signal that the user can highlight and then choose to initiate an (internet) search based on the highlighted word(s).

The type of infonnation sought on the internet might be specified in the user preferences or may depend on the type of program being viewed. Genre information extracted from the audio-visual signal (electronic program guide) or obtained from an external source, for example a meta-data service provider on the internet, might be used to identify the type of program being received. For example, the user might be watching a history program and may wish to look for educational material relating to the subject matter of the broadcast. Another use could involve internet comparison shopping engines to quickly compare prices of items advertised. Further, the user might wish to learn of the search results as soon as they are available, in which case the user would specify by means of appropriate commands that the results of the information search are to visually overlay the program which is currently being watched on TV, for example by inserting closed captions into the audio-visual signal or by presenting the results in picture-in-picture form; or that the program being listened to or watched is to be interrupted to present the results immediately. The user might require the option of having the results displayed continuously on a separate screen such as a television or computer screen. On the other hand, it might suit the user better to examine the results of the search at a later stage, in which case the user would indicate this by means of entering the appropriate commands to store the search results in a memory until required.

The commands and preferences entered by a user might also be stored in a personal user profile. To this end, the system preferably offers the possibility of storing at least one user profile, more preferably a plurality of user profiles, so that anyone using the system can activate his own previously stored profile without having to enter anew his own preferences each time.

A user profile might cover all preferences about the manner in which the search is to be carried out, the type of information to be searched for, and the manner in which the results are to be presented to the user, for example, whether a user wishes to observe the results immediately in closed-caption form, that certain types of product advertised in commercials are to be excluded from or included in augmentation, and that nature documentaries are to be augmented whereas talk-shows are to be excluded, that a particular search engine or meta-crawler is to be invoked in augmentation, that the augmentation is to be supplemented using information provided by third-party metadata service providers etc. Global preferences might also be stored so that a particular mode might be activated by any user, for example an educational mode, in which documentaries and history programs are singled out for augmentation while excluding other types of program; commercial mode where only the commercials are used to locate information regarding the products and services advertised, etc.

Preferably, the system comprises a comparator which is used to compare qualitatively similar information in the results of the information search. If the key phrase denotes, for example, a new model of car, the information returned might contain items of text relating to technical performance, or details regarding suppliers, or purchase and leasing conditions. It is advantageous to compare similar types of information, so that duplicate information can be removed from the results, and so that intelligent comparisons can be made. The user might wish to have the results sorted according to the type of information they represent, for example, a list of prices might be sorted in ascending order with the most attractive price at the top of the list. The comparator can also organise the search results into a manner suitable for the desired mode of perusal, for example, audio clips for outputting on a radio, video and text material for viewing on a screen, or a document suitable for printing.

For application of the invention to an audio-visual signal, the invention advantageously comprises an audio identifier to identify the audio content of an audiovisual signal, to facilitate use of such a system in, for example, home entertainment centres which can receive audio- visual signals from various sources (TV, VCR, DVD etc). A copy of the audio content is diverted to the speech-to-text identifier for further processing as already described. A preferred feature of the invention comprises a computer program for performing all the steps involved in identifying speech in the audio content of the input signal and the key phrases therein, and carrying out an information search concerning the key phrases according to the user's specifications, i.e. most or all of the components of the system, such as speech identifier, speech-to-text converter, key phrase identifier, augmentation module etc. are realised in the form of software and/or hardware modules. Any required software might be encoded on a processor of the audio device, or be encoded on a separate processor, so that an existing audio device might be adapted to benefit from the features of this invention.

Other objects and features of the present invention will become apparent from the following detailed descriptions considered in conjunction with the accompanying drawing. It is to be understood, however, that the drawing is designed solely for the purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims.

The sole figure, Fig.l, is a schematic block diagram of a system for automatic media content augmentation in accordance with an embodiment of the present invention.

In the description of the following figure, which does not exclude other possible realisations of the invention, the system is shown to incorporate an audiovisual device 17, for example a home entertainment system, TV, multimedia device or similar. For the sake of clarity, an interface 14 between the user and the system has been included only schematically in the diagram. It is understood, however, that the system includes a means of interpreting commands issued by the user in the usual manner of a user interface and also means for outputting the audio-visual signal, for example, TV loudspeakers, TV screen etc.

Fig. 1 shows a media content augmentation system 1 in which an audio identifier 15 identifies the audio content of an audio-visual input stream

16, and passes an audio signal 2, which is a copy of the identified audio content, to a speech processing module 4. Meanwhile, the original audiovisual stream 16 is passed to the audio-visual device 17.

The speech processing module 4 comprises a speech identifier 3 which identifies the speech content on the audio signal 2, and a speech-to-text converter 5 which converts the identified speech content to a digital text 6. The digital text 6 is passed on to a key phrase identifier 7. The key phrase identifier 7 performs some initial processing on the digital text 6 and isolates potential words that might be of interest to the user 20. The key phrase identifier 7 performs a check to see whether an identified word is already covered by a dictionary 12, or specifically tagged for exclusion from or inclusion in an information search. A word not already covered by the dictionary 12 and not excluded from a search is a key phrase 19 and is passed on accordingly to the augmentation module 25.

The augmentation module 25 in this example comprises a search engine 8, a comparator 18, and a result compiler 10. The search engine 8 can access an external computer network 9, for example the internet, by means of a computer network interface 13. By means of appropriate commands and parameters, an information search is initiated, and the results of the search are analysed by the comparator 18, which can categorise the results into similar types of information and perform intelligent comparisons. The result compiler 10 is used to compile the results 11 of the search and/or the comparison into a manner suitable for presentation to the user 20. The user 20 might wish to view the results on the television screen of the audio-visual device 17, or he might want a printout or other hard copy of the results.

The user 20 can influence the media augmentation procedure by entering preferences and commands 21 via the user interface 14 of the audio-visual device 17. The preferences and commands 21 are stored in a local database 22 and are used to control the augmentation procedure. The augmentation may be further supplemented by external program genre information 26 obtained from the external computer network 9, e.g. by downloading relevant information from the internet 9 via the augmentation modul 25 passing this information 26 to the key phrase identifier 7. The user may also update the dictionary 12 by specifying words that are to be excluded from or included in an information search.

The system 1 described in this example is shown as an extension of an audio-visual device 17. However, all of the additional components described (automatic speech recognition 4, key phrase identifier 7, dictionary 12, preferences memory 22, augmentation module 25,) might be integrated to present a single device along with the audio-visual device 17, or might be realised as part of a personal computer system which is connected to an audio-visual device 17. The system might also be realised, for example, as a set-top box connected to an audio-visual device 17.

Although the present invention has been disclosed in the form of preferred embodiments and variations thereon, it will be understood that numerous additional modifications and variations could be made thereto without departing from the scope of the invention. For example, the dictionaries can be updated or replaced as desired by downloading new versions from the internet. In this way, the media content augmentation system can make use of the most up-to-date data available.

For the sake of clarity, it is to be understood that the use of "a" or "an" throughout this application does not exclude a plurality, and "comprising" does not exclude other steps or elements.

Claims

CLAIMS:

1. A system (1) for performing media content augmentation on an audio signal (2), said system comprising: a speech identifier (3) for identifying speech content in the audio signal (2); a speech-to-text converter (5) for converting the speech content into a digital text format (6); a key phrase identifier (7) for identifying key phrases (19) in the digital text (6); a search engine (8) for searching a source of information (9) for material relating to the key phrases (19); and a search result compiler (10) to provide a user with results (11) of the search.

2. The system of claim 1, wherein the system (1) contains a dictionary (12) to store a list of phrases which are to be included in or excluded from a search for material relating to the key phrases (19).

3. The system of claim 1 or claim 2, containing a computer network interface (13) for locating references to the key phrases in a computer network (9).

4. The system according to any preceding claim, wherein the system (1) comprises an interface (14) for inputting phrases and/or user preferences (21).

5. The system according to any preceding claim, comprising a comparator (18) for comparing similar types of information relating to the key phrases (19) in the located material.

6. The system according to any preceding claim, comprising an audio identifier (15) for identifying the audio content (2) in an audio-visual signal (16).

7. An audio device (17) comprising a system according to any of the preceding claims.

8. A method for automatic media content augmentation of an audio signal (2), which method comprises: identifying the speech content in the audio-visual signal

(2); converting the speech content into a digital text format (6); identifying keyphrases in the digital text (6); searching a source of information (9) for material relating to the key phrases; providing the user with results of the search.

9. A method according to claim 8 wherein the search of the source of information for augmentation of the key phrases, is performed according to preferences specified in a user profile.

10. A computer program to carry out all the steps of a method according to claim 8 or 9, whereby the computer program is implemented as part of an audio device.