GB2459308A

GB2459308A - Creating a metadata enriched digital media file

Info

Publication number: GB2459308A
Application number: GB0807180A
Authority: GB
Inventors: John Cowell; Scott Linfoot
Original assignee: De Montfort University
Current assignee: De Montfort University
Priority date: 2008-04-18
Filing date: 2008-04-18
Publication date: 2009-10-21
Also published as: WO2009127805A1; GB0807180D0

Abstract

A method for creating a metadata enriched digital media file comprises (i) providing digital media tie (step Al) containing digital audio data, the digital audio data comprising digital speech audio content (ii) analysing the digital speech audio content (step A2) using a speech-to-text processing engine to identify key text-data in the digital speech audio content, (iii) storing the identified key text-data as metadata (step A3), and (iv) attaching the metadata to the digital media file (step A4). The digital audio data may additionally comprise digital non-speech audio content, and the method may thus further comprise separating the digital speech audio content and the digital non-speech audio content (step A5, Figure 2) prior to analysing the digital speech audio content. The method may additionally comprise analysing the digital non-speech audio content using a non-speech audio processing engine and generating text information descriptive of the analysed digital non-speech audio content (step A6, Figure 2), and may further comprise storing the generated text information as part of the metadata (step A7, Figure 2) before attaching the metadata to the digital media file (step A4, Figure 2).

Description

TITLE

Metadata Enriched Digital Media

TECHNICAL FIELD

The present invention relates generally to metadata enriched digital media. More particularly, embodiments of the present invention relate to a method for creating a metadata enriched digital media file and/or to a computer-implemented method for creating a metadata enriched digital media file and/or to a computer program for creating a metadata enriched digital media file.

BACKGROUND ART

Vast archives of media content, especially video and film content, have been created and established by the media industry over many years (in some cases for over 120 years). Historically, this content has been stored in archives on physical media, such * 15 as film or video tape. In order to be able to retrieve relevant content from such archives, it has been known to classify the content by creating a basic description of **.* the content and storing the description on microfiche or paper record cards, for example. Much of the existing content is, however, not classified. a

More recently, there has been a trend to store newly created video and film content digitally, in the form of digital media files. Such digital media files typically comprise * : digital video data, digital audio data, and may also comprise other digital data.

There is also an increasing desire to convert content stored on physical media, such as film or video tapes, into digital media files, for example due to the problems that can be associated with deterioration over time of the physical media and any associated content description, and due to the vast amount of storage space that is required to store physical media.

Digital media content, such as video and film content, is a valuable commercial resource that can be sold, or more commonly licensed, to third parties by the owner of the content, for example the broadcaster in the case of news footage. For example, digital media content can be readily accessed and distributed via a digital data network, such as the internet or a GPRS network, or can be provided on tape or disk.

The cost to the content owner of supplying the digital media content is, therefore, minimal. The income that the content owner can generate from licensing their digital media content is, however, very significant, resulting in a potentially very high profit margin. For example, the income generated by providing worldwide usage rights for a one year period in a recent thirty second news clip may be in the order of �4,000 whilst the income generated by providing usage rights for a news clip of the same length for a one year period in the UK only may be in the order of �500.

Whilst the cost of supplying the digital media content may be minimal relative to the income that can be generated, there are currently significant up-front costs associated with the indexing and classification of the digital media content, whether it is stored digitally from the outset or whether it is converted into digital format from an existing * 15 physical medium. For example, in order to make the digital media content readily * accessible via a digital data network, it is necessary for a potential purchaser to be S...

. able to search for relevant content. * ** * S S

:. Currently, trained individuals, commonly known as shotlisters', are employed to view media content, typically video data and audio data together, and generate a list *: . of keywords relevant to the media content. Those keywords are then embedded as * metadata in a digital media file containing the digital media content, and that metadata can then be searched using an appropriate search engine to facilitate retrieval of appropriate digital media files. The ITN Source� website, www.itnsource.com, is an example of an internet-based search engine through which digital media content, created by ITN and other sources, can be purchased directly from ITN as downloadable digital electronic files.

The shotlisting' process is expensive and very time consuming. For example, it can take a shotlister between five and ten hours to generate suitable keywords to classify one hour of video footage. Whilst it may be practically and economically feasible to employ shotlisters to view and classify all newly generated media content, the task of viewing and classifying archived media content that needs to be converted into digital format from existing physical media is simply not feasible due to the large volume of such archived media content.

SUMMARY OF THE INVENTION

According to a first aspect of the present invention, there is provided a method for creating a metadata enriched digital media file, the method comprising: (i) providing a digital media file containing digital audio data, the digital audio data comprising digital speech audio content; (ii) analysing the digital speech audio content using a speech-to-text processing engine to identify key text-data in the digital speech audio content; (iii) storing the identified key text-data as metadata; (iv) attaching the metadata to the digital media file.

* * 15 According to a second aspect of the present invention, there is provided a system for creating a metadata enriched digital media file, the system comprising a processor S...

operable to: (i) provide a digital media file containing digital audio data, the digital audio data comprising digital speech audio content; (ii) analyse the digital speech audio content using a speech-to-text processing engine to identify key text-data in the digital speech audio content; * : * (iii) store the identified key text-data as metadata; (iv) attach the metadata to the digital media file.

According to a third aspect of the present invention, there is provided a computer-implemented method for creating a metadata enriched digital media file, said computer-implemented method comprising: (i) providing a digital media file containing digital audio data, the digital audio data comprising digital speech audio content; (ii) analysing the digital speech audio content using a speech-to-text processing engine to identify key text-data in the digital speech audio content; (iii) storing the identified key text-data as metadata; (iv) attaching the metadata to the digital media file.

According to a fourth aspect of the present invention, there is provided a computer program comprising computer program instructions for creating a metadata enriched digital media file, and comprising: means for providing a digital media file containing digital audio data, the digital audio data comprising digital speech audio content; means for analysing the digital speech audio content using a speech-to-text processing engine to identify key text-data in the digital speech audio content; means for storing the identified key text-data as metadata; means for attaching the metadata to the digital media file.

According to a fifth aspect of the present invention, there is provided a computer program for creating a metadata enriched digital media file, the computer program * 15 comprising computer program instructions that when loaded into a computer provide: means for providing a digital media file containing digital audio data, the ***.

digital audio data comprising digital speech audio content; means for analysing the digital speech audio content using a speech-to-text processing engine to identify key text-data in the digital speech audio content; means for storing the identified key text-data as metadata; *: .. means for attaching the metadata to the digital media file.

**.*S* * * According to a sixth aspect of the present invention, there is provided a record medium embodying the computer program defined above.

Optional, and sometimes preferred, features of the invention are defined in the dependent claims.

The text-data may comprise words andlor phrases. The key text-data may therefore comprise key words and/or key phrases. The pre-defined text-data may therefore comprise pre-defined words and/or pre-defined phrases.

DRAWINGS

Figure 1 is a flow diagram illustrating a first embodiment of a method for creating a metadata enriched digital media file according to the present invention; Figure 2 is a flow diagram illustrating a second embodiment of a method for creating a metadata enriched digital media file according to the present invention; and Figure 3 is a flow diagram illustrating a third embodiment a method for creating a metadata enriched digital media file according to of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Embodiments of the present invention will now be described by way of example only, and with reference to the accompanying drawings.

In the following description there are presented various embodiments of methods for creating a metadata enriched digital media file. In all of the embodiments, the digital * 15 media file may be an existing reviously created) digital media file or may be created by digitising existing analogue media content, for example a radio broadcast or a **.

television broadcast such as a film, a news programme, or a documentary.

* Alternatively, the digital media file may be created based on a live feed of digital * media. For example, most live television broadcasts are now transmitted in the form **.

of a digital media stream. Typically, the digital media file is an MPEG format digital media file, for example an MPEG-2 or an H.264 digital media file. Other digital file * : formats are, however, entirely within the scope of the present invention.

Depending on the type of media content, the digital media file may comprise only digital audio data, which would typically be the case wheh the media content is a radio broadcast for example, or may comprise both digital audio data and digital video data, which would typically be the case when the media content is a television broadcast, such as a film, a news programme or a documentary, for example.

Furthermore, the digital audio data may comprise only digital speech audio content or may comprise both digital speech audio content and digital non-speech audio content, as will be discussed in more detail later in this specification.

As discussed in the background art section of this specification, digital media files which comprise digital media content such as radio broadcasts (audio only) or television broadcasts (audio and video) such as films, news programmes, documentaries, and the like, are a valuable commercial resource since the rights in the media content can be sold, and more commonly licensed, to third parties. However, in order to facilitate access to the digital media files, for example by downloading the digital media files via a digital data network such as the internet, it is necessary for potential purchasers to be able to search for relevant media content.

Embodiments of the present invention aim to provide the required searchability by enriching digital media files with inetadata which is indicative of the media content of the digital media files.

Figure 1 illustrates one embodiment of a method for creating a metadata enriched digital media file. The method comprises initially providing a digital media file (step *: * Al) as described above. The step of providing the digital media file may, for example, * * comprise retrieving a previously created digital media file or may comprise creating a digital media file in the manner outlined above. * ***

In the embodiment of Figure 1, the digital media file contains only digital audio data.

* *, For example, the digital media file may be created based on a live or recorded radio * * broadcast, either analogue or digital, or on a radio broadcast archived in an analogue fonnat. Moreover, the digital audio data comprises only digital speech audio content.

In the case of a radio broadcast, for example, the digital speech audio content typically comprises the speech of a presenter, newsreader, guest, or the like.

After providing the digital media file in step Al, the digital speech audio content of the digital audio data is analysed using a speech-to-text processing engine to thereby identify key text-data in the digital speech audio content (step A2). For example, the speech-to-text processing engine may be configured to identify key text-data in the form of key words andlor key phrases in the digital speech audio content.

After the key text-data has been identified using the speech-to-text processing engine, it is stored as metadata (step A3). Finally, the stored metadata is attached to the digital media file (step A4) created in step Al to create the metadata enriched digital media file.

In the above embodiment, the digital audio data contains only digital speech audio content. As indicated above, in some circumstances, it is possible that the digital media file may contain digital audio data that comprises digital non-speech audio content in addition to digital speech audio content. For example, where the digital audio data is derived from an outside news radio broadcast, there is often digital non-speech audio content in the fonn of background noise as well as digital speech audio content in the form of the commentary provided by the reporter.

In some situations, the non-speech audio content can be indicative of the content of the digital audio data, and hence ultimately indicative of the content of the digital * a.

media file containing that digital audio data. For example, the non-speech audio *.

* . a.. content could comprise explosions, gunfire, or the like, which could be indicative of the fact that the content of the digital media file relates to a war report. In such situations, it may, therefore, be desirable to analyse the digital non-speech audio content in addition to analysing the digital speech audio content to provide further a. information about the content of the digital media file in addition to the information * : . provided by the analysis of the digital speech audio content, as described above with reference to Figure 1.

Figure 2 illustrates an embodiment of a method for creating a metadata enriched digital media file containing digital audio data that comprises both digital speech audio content and digital non-speech audio content. The method illustrated in Figure 2 is similar to the method illustrated in Figure 1, and corresponding method steps are therefore designated using corresponding reference numerals.

After providing the digital media file in step Al, which contains digital audio data comprising both digital speech audio content and digital non-speech audio content, the digital speech audio content and digital non-speech audio content are separated into separate audio content streams (step A5). After separation, the stream of digital speech audio content is analysed using a speech-to-text processing engine in the maimer described above with reference to Figure 1 (step A2) to identify key text-data in the digital speech audio content. Optionally, the stream of digital non-speech audio content is analysed separately, using a non-speech audio processing engine (step A6).

The non-speech audio processing engine generates text information that is descriptive of the analysed digital non-speech audio content. For example, where the non-speech audio processing engine identifies that the non-speech audio content is an explosion or gunfire, the generated text information may be the keywords explosion' andlor gunfire'.

Where the optional analysis of the digital non-speech audio content is carried out using the non-speech audio processing engine, both the key text-data identified by the speech-to-text processing engine in step A2 and the text information generated by the S...

*.... non-speech audio processing engine in step A6 are stored as metadata in step A7, before the stored metadata is finally attached to the digital media file in step A4. Of * course, if the optional analysis of the digital non-speech audio content is omitted from S. the method described with reference to Figure 2, only the key text-data identified by *: * the speech-to-text processing engine in step A2 is stored as metadata, and step A7 * : . thus corresponds to step A3 described with reference to Figure 1.

As indicated above, there are situations in which the digital media file may contain both digital video data and digital audio, the latter comprising at least digital speech audio content and possibly also comprising digital non-speech audio content.

Referring now to Figure 3, there is illustrated an embodiment of a method for creating a metadata enriched digital media file when the digital media file contains both digital video data and digital audio data. Some of the steps of the method illustrated in Figure 3 correspond to the steps described above with reference to Figures 1 and 2, and corresponding method steps are therefore designated using corresponding reference numerals.

As discussed above, typical examples of digital media files containing both digital video data and digital audio data are derived from (but are of course not limited to) television news broadcasts, documentaries, films, and the like. In the case of a studio-based television news broadcast, for example, the digital media file will typically contain digital video data and digital audio data comprising only digital speech audio content. On the other hand, in the case of an outside television news broadcast, the digital media file will typically contain digital video data and digital audio data comprising both digital speech audio content and digital non-speech audio content.

In accordance with the method illustrated in Figure 3, a digital media file is initially provided (step Al) containing digital video data and digital audio data. In the illustrated embodiment, the digital audio data comprises both digital speech audio content and digital non-speech audio content, and such a digital media file may, therefore, be derived from an outside television news broadcast, for example. *.

After providing the digital media file in step Al, the digital video data is optionally analysed in step A8 to partition it into individual digital video data portions.

Practically, the individual digital video data portions correspond to individual shots' within the digital video data. For example, where the digital media file is derived from * a television news broadcast, one shot may relate to a first news article, another shot * : . * may relate to a second news article, and so on. Partitioning of the digital video data in this manner is highly advantageous since it enables particular metadata to be associated with a particular portion of the digital media file provided in step Al, and thus enables a relevant portion of the metadata enriched digital media file to be retrieved and purchased, for example by downloading via a digital data network such as the internet.

Both the digital video data and the digital audio data of the digital media file share a common time datum, and in accordance with the method of Figure 3, the digital audio data is also partitioned into corresponding individual digital audio data portions in step A8. Each partitioned individual digital video data portion thus has an associated individual digital audio data portion.

As indicated above, partitioning of the digital video data, and hence of the digital audio data, into individual digital video data portions and individual digital audio data portions, is optional, and may be omitted from the method described above with reference to Figure 3 if it is not necessary. For example, partitioning may be unnecessary if the content of the digital media file is of a very short length in time.

After the partitioning has taken place in step A8 or directly after providing the digital media file in step Al if the partitioning step is omitted, the digital video data and the digital audio data are separated in step A9. For example, this separation can be carried out by demultiplexing the digital audio data packets from the digital video data packets using their packet identifiers.

*. *.* The remainder of the method of the embodiment of Figure 3 is the same as that S...

described above with reference to Figure 2. Specifically, in step A5, the partitioned digital audio data is separated into digital speech audio content and digital non-speech audio content. The digital speech audio content is then analysed iii step A2, using a speech-to-text processing engine, to identify key text-data in the digital speech audio *: content and the digital non-speech audio content is optionally analysed in step A6, * *1S using a non-speech audio processing engine, to generate text information that is descriptive of the analysed digital non-speech audio content. The identified key text-data and optionally the generated text information (when optional step A6 is performed) are then stored as metadata in step A7, before the metadata is finally attached to the digital media file in step A4.

Although in the embodiment described above with reference to Figure 3 the digital media file provided in step Al contains digital audio data comprising both digital speech audio content and digital non-speech audio content, it is perfectly feasible that the digital audio data could comprise only digital speech audio content, such that the resultant digital media file would contain digital video data and digital audio data -Il-comprising digital speech audio content alone. In this case, the method described above with reference to Figure 3 would be adapted by omitting steps A5 (the separation of the digital speech audio content and the digital non-speech audio content) and A6 (the analysis of the digital non-speech audio content). Consequently, in step A7, only the key text-data identified from the digital speech audio content by the speech-to-text processing engine (in step A2) would be stored as metadata, before that that metadata was finally attached to the digital media file in step A4.

In some or all of the above described embodiments, the speech-to-text processing engine is desirably configured so that it is operable to analyse the digital speech audio content in step A2 at a speed much faster than real-time. This is particularly advantageous where existing archived media content is being converted into digital media files for storage in digital format since it enables large quantities of media content to be rapidly and efficiently converted into metadata enriched digital media files. * ** * * . * .. I...

In typical embodiments, the speech-to-text processing engine is operable to convert the digital speech audio content into text and thereafter data-mine the converted text 0 to identify key text-data in the converted text. Data-mining of the converted text is typically carried out with reference to a dictionary containing key text-data, such as * ** key words andlor key phrases, for example. The dictionary may, for example, be a Master' dictionary.

In some embodiments, the speech-to-text processing engine may comprise a plurality of contextual dictionaries, in addition to the aforesaid Master' dictionary, containing key text-data relevant to a particular context, such as politics or science to name but two examples. If converted digital speech audio content is data-mined based on the key text-data contained in a relevant contextual dictionary, the accuracy of the identification of key text-data in the digital audio content is improved.

Each of the contextual dictionaries is typically associated with pre-defined text data, such as pre-defined words or phrases, relevant to the particular context, and each of -12 - the contextual dictionaries is actuable in response to the identification of the pre-defined text data by the speech-to-text processing engine.

Embodiments of the method may thus comprise analysing the digital speech audio content using the speech-to-text processing engine to identify the presence of said pre-defined text data in the digital speech audio content. For example, the speech-to-text processing engine may be configured to analyse all of the digital speech audio content to identify whether said pre-defined text data is present. Alternatively, and perhaps more desirably, in order to increase the speed of analysis the speech-to-text processing engine may be configured to analyse only a portion of the digital speech audio content to identify whether said pre-defined text data is present. This latter proposition is quite realistic since if the portion of the digital speech audio content that is analysed is an initial portion, there is a high probability that any pre-defined text data will be present in the initial portion. For example, if the digital media file is a news article, the pre-defined text data is likely to be present in the digital speech audio * iS * ** content of the digital audio data at the beginning of the news article to properly **** introduce the news article. * * . I

* In the event that pre-defined text data is identified by the speech-to-text processing I..

engine, the contextual dictionary associated with the identified pre-defined text data is *, ** activated. Thereafter, the digital audio content is converted into text by the speech-to- * * * text processing engine and data-mined to identify key text-data contained in the activated contextual dictionary.

As described above, in the embodiments of Figures 2 and 3 in which the digital media file contains digital audio data comprising digital speech audio content and digital non-speech audio content, separation of the digital speech audio content and the digital non-speech audio content is carried out in step A5. In addition to being desirable from the point of view of enhancing the analysis of the digital speech audio content by the speech-to-text processing engine due to the elimination of background noise, the optional analysis of the digital non-speech audio content by the non-speech audio processing engine may provide further information (the aforesaid generated text -13 -information) which is descriptive of the content of the digital media file and which therefore enhances the metadata that is attached to the digital media file in step A4.

Perfect separation of a mixed source of digital speech audio content and digital non-speech audio content may be difficult. Typically, however, the digital speech audio content and digital non-speech audio content are separated into separate channels using the time or frequency domains, or a combination of the two. The separated digital non-speech audio content is then subtracted from the original mixed source of digital speech audio content and digital non-speech audio content to leave only digital speech audio content which is passed to the speech-to-text processing engine for analysis in step A2, as aforesaid. As indicated above, the separated digital non-speech audio content is optionally also passed to the non-speech audio processing engine for analysis in step A6, for example if there is a desire to generate text information descriptive of the content of the digital media file based on the digital non-speech audio content of the digital audio data. * * * I * * I.

* *** Where analysis of the separated digital non-speech audio content is performed using a non-speech audio processing engine, any suitable methodology or combination of * methodologies may be used to perform the analysis. The analysis may be performed I..

using time-based features, for example, such as the High Zero-Crossing Ratio, the ., Low Short Energy Ratio, the Noise Frame Ratio or the Attack, Decay, Sustain, Release (ADSR) envelope. Alternatively or additionally, the analysis may be performed using frequency-based features, for example, such as the Mel Frequency Cepstral Coefficients, Fast Fourier Transform or Wavelet Transform.

Typically, a number of methodologies are employed and the results combined to provide a classifier' on the basis of which a sound can be identified, and each classifier has text information associated therewith that is descriptive of the identified sound. It is this descriptive text information that is generated by the non-speech audio processing engine in optional step A6 and stored as part of the metadata in step A7.

As mentioned above in relation to the embodiment described with reference to Figure 3, in some circumstances it can be desirable to partition the digital video data into individual digital video data portions, to thereby permit partitioning of the digital audio data into corresponding individual digital audio data portions due to the common time datum shared by the digital video data and the digital audio data. Any suitable methodology may be employed to effect said partitioning.

One suitable methodology may be based on the detection of intra-coded frames (I-frames) in the digital video data. For example, I-frames are typically present in digital video data every 15 frames, and with the predictive-coded frames (P-frames) and bidirectionally-predictive-coded frames (B-frames) that follow each I-frame form a group of pictures. An I-frame can, however, be inserted during creation of the digital media file using a suitable algorithm whenever there is a shot transition and the partitioning of the digital video data into individual digital video data portions in step A8 can thus be implemented based on the detection of I-frames. Alternatively, if an I- : :: frame detection methodology is not available, other methodologies such as shot I...

* . . *. boundary detection, hue histogram comparison and discrete cosine transformation (DCT) clustering may be employed.

As discussed earlier in this specification, it would be desirable to enable the content of the metadata enriched digital media files created in accordance with the methods of * * * any of the embodiments of the present invention to be readily searched on the basis of key text-data, such as key words and/or key phrases, to facilitate the identification of digital media files with content relevant to a particular subject.

In order to enhance this searchability, embodiments of the method may optionally comprise storing the key text-data identified in step A2, and optionally the text information generated in step A6 where this step is performed, in a markup language-based format, for example as an Extensible Markup Language (XML) file. Such a file may be readily incorporated into a searchable database which may, as indicated above, be accessible via a digital data network such as the internet. Simple searches based on key text-data, such as key words and/or key phrases, can thus be performed -15-via the searchable database which is typically desirably configured to then allow the purchase of any identified digital media files for use, for example by downloading the digital media files, typically on a licence-based arrangement. Such an arrangement may, for example, specify the territories in which use of the digital media file is permitted and the period of time for which such use is permitted.

As will be understood by persons skilled in the art, the methods according to the different embodiments of the present invention are typically implemented in the form of a computer program comprising computer program instructions for creating a metadata enriched digital media file.

Although embodiments of the present invention have been described in the preceding paragraphs with reference to various examples, it should be understood that various modifications may be made to those examples without departing from the scope of the present invention, as claimed. * ** Is * * I. I... * I 41. * ._ * I * I.. I 5I* _. I

I II * I.

I

I I -16-

Claims

CLAIMS1. A method for creating a metadata enriched digital media file, the method comprising: (i) providing a digital media file containing digital audio data, the digital audio data comprising digital speech audio content; (ii) analysing the digital speech audio content using a speech-to-text processing engine to identify key text-data in the digital speech audio content; (iii) storing the identified key text-data as metadata; (iv) attaching the metadata to the digital media file.
2. A method according to claim 1, wherein during step (ii), the speech-to-text processing engine is operable to convert the digital speech audio content into text and to data-mine the converted text to identify key text-data in the converted text.
3. A method according to claim 1 or claim 2, wherein the digital media file additionally comprises digital video data, and the method comprises separating the *.* *** digital video data and the digital audio data prior to said step of analysing the digital .: :* speech audio content of the digital audio data. * ***
4. A method according to any preceding claim, wherein the digital audio data * additionally comprises digital non-speech audio content, and the method comprises * : * . separating the digital speech audio content and the digital non-speech audio content prior to said step of analysing the digital speech audio content.
5. A method according to claim 4 when dependent on claim 3, wherein the method comprises separating the digital speech audio content and the digital non-speech audio content after said step of separating the digital video data and the digital audio data.
6. A method according to claim 3, claim 4 when dependent on claim 3, or claim 5, wherein the method comprises analysing the digital video data to partition the digital video data into individual digital video data portions, and partitioning the digital audio data associated with the digital video data into corresponding individual digital audio data portions, said step of analysing the digital video data being performed before said step of separating the digital video data and the digital audio data.
7. A method according to any preceding claim, wherein the speech-to-text processing engine comprises a plurality of contextual dictionaries each of which is associated with pre-defined text data and actuable in response to the identification of the pre-defined text data by the speech-to-text processing engine, and wherein the method comprises: analysing at least part of the digital speech audio content using the speech-to-text processing engine to identify pre-defined text-data in the digital speech audio content; upon identification of pre-defined text data, activating the contextual dictionary associated with the identified predefined text-data.
8. A method according to claim 7, wherein the method comprises performing L:' :* said analysis of at least part of the digital speech audio content and activating the * contextual dictionary prior to performing steps (ii) to (iv) such that during step (ii), *.the digital speech audio content is analysed using the speech-to-text processing engine * ** to identify key text-data contained in the activated contextual dictionary. 5.555 * S
9. A method according to any of clams 4 to 8, wherein the method further comprises analysing the digital non-speech audio content using a non-speech audio processing engine and generating text information descriptive of the analysed digital non-speech audio content using said processing engine.
10. A method according to claim 9, wherein the generated text information is stored as part of the metadata and attached to the digital media file during step (iv).
11. A method according to any preceding claim, wherein the method comprises storing the identified key text-data in a markup language-based format.

-18 -
12. A method according to claim 9, claim 10, or claim 11. when dependent or ultimately dependent on claim 9, wherein the method comprises storing the generated text information in a markup language-based format.
13. A computer program comprising computer program instructions for creating a metadata enriched digital media file, and comprising: means for providing a digital media file containing digital audio data, the digital audio data comprising digital speech audio content; means for analysing the digital speech audio content using a speech-to-text processing engine to identify key text-data in the digital speech audio content; means for storing the identified key text-data as metadata; means for attaching the metadata to the digital media file.

* 15
14. A computer program for creating a metadata enriched digital media file, the ** computer program comprising computer program instructions that when loaded into a *.** * *** computer provide: means for providing a digital media file containing digital audio data, the digital audio data comprising digital speech audio content; means for analysing the digital speech audio content using a speech-to-text * * processing engine to identify key text-data in the digital speech audio content; * : * means for storing the identified key text-data as metadata; means for attaching the metadata to the digital media file.
15. A record medium embodying the computer program defined in claim 13 or claim 14.
16. A method for creating a metadata enriched digital media file substantially as hereinbefore described and/or as shown in the accompanying drawings.