WO2006077196A1

WO2006077196A1 - Method for generating a text-based index from a voice annotation

Info

Publication number: WO2006077196A1
Application number: PCT/EP2006/050193
Authority: WO
Inventors: Delphine Charlet; Michel Plu
Original assignee: France Telecom
Priority date: 2005-01-19
Filing date: 2006-01-12
Publication date: 2006-07-27
Also published as: EP1839213A1

Abstract

The invention relates to a method for generating a text-based index (16) associated with images from a voice annotation (15). The inventive method consists in carrying out a speech recognition (25) applied to the voice annotation (15) with a predetermined recognisable vocabulary (17) in such a way that it makes it possible to search at least one word contained in the recognisable vocabulary (17) in the voice annotation, wherein said word or words identified during the search form the text-based index (16). Said method also comprises a step for defining the recognisable vocabulary consisting in defining (22) a context and in searching (24) a list of words associated with the context in a language system and in forming the recognisable vocabulary (17).

Description

METHOD FOR GENERATING A TEXTUAL INDEX FROM A VOICE ANNOTATION

1. Field of the invention

The field of the invention is that of image management. More specifically, the invention relates to a technique for producing and associating textual indexes (also called textual annotations) to images. The invention applies in particular, but not exclusively, to the association of textual indexes with an image or a video sequence taken by a portable electronic device such as a digital camera, a digital video camera, a mobile telephone or a handheld computer. .

2. Solutions and disadvantages of the prior art

In general, digital photography has profoundly changed the access to the image. Indeed the multiplication of electronic devices with a

15 "digital shooting" function and the zero cost associated with each shot have led to the proliferation of digital photos. Currently, more and more users have a digital camera allowing them to make and store a large amount of images in a removable memory or integrated in the device.

Generally, descriptor indexes (hereinafter also referred to as descriptor elements) are used to facilitate the management of these many images. Two methods of description are known in the state of the art for producing and associating descriptor indexes with images.

The "objective" description by image analysis is a first method

25 according to which the analysis of an image makes it possible to provide descriptor elements, for example of the portrait, companion landscape, sea, mountain, etc. type. This first method also allows the recognition of person or monument contained in a reference dictionary, and thus, if necessary to obtain the corresponding descriptor indexes.

The "intentional" description by textual or vocal annotation is a second method according to which the user annotates a photograph so as to declare that which seems to be the most relevant and / or what may be missing from the image, for example a relationship, the age of a person, etc. It is well known in the prior art that the "intentional" description can be used as a complementary mode of the "objective" description. These description methods, used alone or in combination, have been an important advance in the descriptor element and image management mechanism. However, they have a number of disadvantages.

First of all, the "objective" description by image analysis only makes it possible to generate descriptor indexes concerning the strict content of the photo, without any other information.

Another disadvantage of this first known method lies in the fact that performance in recognition of faces, monuments, etc. are still quite limited. Contrary to the "objective" description by image analysis, the "intentional" description by annotation is a descriptor index production technique allowing to acquire and provide annotations relating to the details and / or the meaning given by the user to digital images.

Nevertheless, when a textual annotation is performed almost simultaneously with the shooting, the ergonomics of this second known method is limited by the fact that the user must, in the conditions of taking pictures that are sometimes uncomfortable and difficult (rain, snow, etc.), enter manually using an alphanumeric keyboard, usually small, annotation. On the other hand, when a textual annotation is carried out after the shooting, for example at the moment when a user classifies his multiple photos in a digital album by means of a personal computer, this method does not necessarily favor the intelligibility in term of knowledge of the content of each photo and the number of photos to index. Indeed, this classification can be done long after the shooting, the user is forced to look at each photo to determine its content and produce its textual annotation. By elsewhere, the user may not remember what has been photographed, for example the name of a particular monument. In addition, the person who ranks the photos in the digital album may not be the user who made the different shots, so this person does not necessarily know the exact content of each photo and can by therefore produce irrelevant or erroneous textual annotations.

To remedy these problems, it is traditionally envisaged to use an "intentional" description by voice annotations. This allows for example a user to record, just after shooting, a sentence by means of a microphone embedded in a mobile phone type device, digital camera, etc..

Thus, the user knows what he has just photographed, and recording a voice annotation is a much simpler and more ergonomic task than entering a text annotation. The difficulty is however displaced in the exploitation of these vocal annotations. Indeed, it is not enough to store voice annotations associated with the photos, but it is necessary to produce textual indexes from these vocal annotations.

For this purpose, it is necessary to perform voice recognition on each voice annotation, to transcribe the entire voice annotation or only one or more keywords contained in it. Several modes can be considered to perform this speech recognition.

Among these modes of treatment, we find:

a mode "phonetic indexing" as described in particular in the following publication: Ferrieux, A. & Peillon, S.: "Phoneme-level indexing for fast and vocabulary independent voice / voice retrieval" (or "fast phonetic indexing and recovery of vocal vocabulary "in French), ESCA ETRW workshop on Accessing information in Spoken Audio, Cambridge, England, 19-20 April 1999, pp. 60-63;

a mode "keyword detection" as described in particular in the following publication: Rosé, RC, Paul, DB: "A Hidden Markov- Model Based Keyword Recognition System" (or "Markov Model Based Keyword Recognition Device" in English), ICASSP 1990, pp. 129-132; and a mode "oral-written transcription" as described in particular in the following publication: Makhoul et al., "Speech and Languages Technologies for Audio Indexing and Retrieval" (or "Voice and Language Technologies for Indexing and Sound Recovery" in French), Proceedings of the IEEE, vol.88, n.8, August 2000, pp.1338-1353.

The current technique of generating text indexes from voice annotations, by applying a voice recognition to these voice annotations, has drawbacks.

Indeed, in "phonetic indexing" mode, as in "keyword detection" mode, it is essential to define the words or expressions that we are looking for (which we also call the searched vocabulary). Moreover, in "oral-written transcription" mode, it is also necessary to define the complete vocabulary model (and relations between the words of the vocabulary, called language model) that one wishes to transcribe. Since the performance of voice recognition is dependent on the size of the vocabulary to be recognized, it is not realistic to look for voice annotations "all possible words", if such a list exists. The user who wishes to use his voice annotations (to generate text indexes by voice recognition) is compelled to establish, for each voice annotation, a list of keywords or expressions that he seeks (or more precisely that the voice recognition search) in the annotation. In other words, the user must define himself, for each voice annotation, a vocabulary to recognize in this voice annotation.

The disadvantage of this current solution is that it is a task that can be extremely tedious and has an intrinsic limit: the user can find in his annotation only keywords he seeks . Moreover, this search being done after the shooting, the user may have forgotten the precise names of the monuments he photographed

(or ignore which monuments were photographed if the person doing the ranking is not the one who took the photos). As a result, he will not be able to search for these specific names by voice recognition.

In addition, and particularly for the reasons set out above, this method of producing text indexes from voice annotations is not optimal.

3. OBJECTIVES OF THE INVENTION The object of the invention is notably to overcome these disadvantages of the prior art. More specifically, an objective of the invention is to provide a technique for generating reliable textual indexes, based on voice annotations, which is simple and effective to implement, especially in terms of defining the vocabulary to be recognized by a user. voice recognition performed on these voice memos.

Another object of the invention is to provide such a technique, which is ergonomic and eliminates, or at least limits, the manual input operations to be performed by the user to define the vocabulary to recognize by voice recognition. Yet another object of the invention is to provide such a technique, which is particularly well suited to users wishing, after the shooting, search and / or easily sort a multitude of images, without having to remember the exact content of each image.

The invention also aims to provide such a technique which, in at least one embodiment, is inexpensive and compatible with all existing digital cameras.

4. Summary of the invention

These objectives, as well as others that will appear later, are achieved by means of a method of generating at least one text index associated with a set of images comprising at least one image, starting from at least one voice annotation previously associated with said set of images, said method comprising a voice recognition step applied to said at least one voice annotation with a predetermined recognizable vocabulary, so as to perform a search in said at least one voice annotation of at least one word contained in said vocabulary to be recognized, the one or more words identified during the search forming said text index. According to the invention, such a method advantageously comprises a step of defining said vocabulary to be recognized, which itself comprises the following steps: definition of a context; search in a linguistic system of a list of words associated with said context and forming said vocabulary to be recognized.

Thus, the invention is based on an entirely new and inventive approach to the definition of the vocabulary to be recognized. Indeed, this definition of the vocabulary to be searched is done automatically, from a context.

The user is freed from this tedious task. It should be noted that if the user can be involved in defining the context (for example to provide a theme of a series of photos), it is not mandatory and in any case much less restrictive than to have to define only the vocabulary to look for (case of the current technique).

It should be noted that the present invention covers both the case in which one or more word (s) of the vocabulary to be recognized can (wind) be identified in the vocal annotation, that the case in which no word of the vocabulary to recognize is present (and therefore could not be identified) in the voice annotation.

Indeed, it is also a result of knowing that there are no vocabulary words to recognize in the voice annotation. Furthermore, when a user has previously recorded and associated a voice annotation to a set of photos, the textual indexes produced according to the invention are assigned to this set of photos.

According to an advantageous aspect of the invention, said set of images comprises at least one photo and / or at least one video sequence. Preferably, said context comprises at least one context element belonging to the group comprising: at least one information relating to said set of images, provided by at least one user through a man / machine interface; at least one piece of information relating to the geographical position of the place of shooting of said set of images, provided by a location device; at least one information relating to said set of images, resulting from the processing of said set of images by an image analysis module; at least one user profile comprising at least one profile information relating to a user; at least one piece of information understood in a default context. Generally speaking, a context element is information relating to the set of images (for example a photo) coming from a software tool, from the creator (photographer) or from any other user having knowledge of the existence of this Photo. In addition, contextual information of different natures can be combined to provide an even more precise context.

By default context, we mean for example the first names and / or the most common surnames. In this case, in the absence of any other context element, we use this default context which makes it possible to search among the given names and / or common names.

Advantageously, said method comprises a step of selecting a language model according to at least one context element of said context, and in that said voice recognition step is performed in transcription mode, with the language model selected. The invention also relates to a computer program product comprising program code instructions for executing the steps of the aforementioned method, when said program is executed on a computer.

The invention further relates to a storage medium, possibly totally or partially removable, readable by a computer, storing a set of instructions executable by said computer to implement the above method. The invention also relates to a device for generating at least one text index associated with a set of images comprising at least one image, from at least one voice annotation previously associated with said set of images, comprising means for defining said vocabulary to be recognized, including themselves:

means for defining a context;

means for searching in a linguistic system for a list of words associated with said context and forming said vocabulary to be recognized.

The invention also relates to an apparatus for recording images and recording associated voice annotations, comprising the device for generating at least one aforementioned text index.

The invention also relates to an apparatus for managing / viewing images and recording associated voice annotations, comprising the device for generating at least one aforementioned text index. 5. List of figures

Other features and advantages of the invention will appear more clearly on reading the following description of a preferred embodiment, given as a simple illustrative and nonlimiting example, and the appended drawings, among which: FIG. 1 shows a functional chain for generating textual indexes of a particular embodiment of the generation method of the invention; FIG. 2 represents a flowchart of a particular embodiment of the generation method of the invention; FIG. 3 presents the structure of a particular embodiment of a device for generating textual indexes according to the invention; FIG. 4 shows the structure of a particular embodiment of an image taking apparatus according to the invention; and FIG. 5 shows the structure of a particular embodiment of an image management / vision apparatus according to the invention.

6. Detailed description of the invention The general principle of the invention is based on a technique for automatically generating a vocabulary to be recognized (from a context), used to identify by voice recognition applied to a voice annotation of the keywords or expressions, so as to generate a textual index associated with an image. The automatic production technique of a vocabulary to be recognized according to the invention can notably make different context-element-providing modules coexist. Such a technique can notably use in parallel and / or jointly several context elements of identical and / or distinct natures to provide a vocabulary to be recognized more precisely. This method of generating textual indexes according to the invention makes for example a language system comprising a lexicon and a semantic network.

This approach is illustrated in particular in FIG.

In a conventional manner, it is assumed that a voice recognition module 12 receives: on a first input, a voice annotation 15 associated with a photo 14

(for example by means of a device having both the digital shooting function and the associated voice annotation sockets); on a second input, a vocabulary 17 to recognize, so as to allow searching in the voice annotation 15 of the words contained in the vocabulary to recognize, then automatically generate a textual index 16 which will be associated with this photo.

According to the invention, the vocabulary to be recognized 17 is automatically generated by a generation module 11 which receives the picture 14 as input. This generation module of the vocabulary to be recognized 11 comprises several modules for providing context elements, among which one finds an image processing module 111, a textual input module 112, a geographic location module 113 (for example of the GPS type), and a user profile management module 116. This module 11 also comprises a data module. extraction of the vocabulary to be recognized 114 which receives the aforementioned contextual elements and questions according to these a linguistic system 115. The linguistic system 115 thus receives as input a set of context elements (themes) specific to a photo or a series of photos.

We now describe a first example of application of the method according to the invention, wherein the context elements are themes entered by the user. For example, it is assumed that with the help of the text input module 112, the user has entered a theme (for example "Pink Granite Coast") for the application to search the linguistic system for the vocabulary (tourist in this example) associated with this theme.

In a first step, the linguistic system 115 searches from these contextual elements the language used by the user.

In a second step, this system looks for each element of context in a data structure specific to this language, also called lexicon. Each lexicon is composed of words associated with concepts. The set of words corresponds to all inflected forms of language, place nominations, monuments, or any geographical element such as rivers, rivers. These elements and nominations are for example defined in a thesaurus such as the TGN (Getty Thesaurus of geography names) (see www.getty.edu/research/tools/vocabulary/tgn/).

The linguistic system 115 then searches, for each concept associated in the lexicon to each context element, the neighboring concepts in a data structure called semantic network. This semantic network connects concepts by typed relationships. These relationships can be, for example, synonymy, composition, geographic inclusion or other relationships. By definition, a concept is said to be close to another if there is a relationship between them. The linguistic system 115 then returns the set of associated words in the lexicon to each of these neighboring concepts.

The vocabulary to be searched is then composed of the words returned by the linguistic system 115 (including the context elements entered by the user) for each set of context elements entered by a user. It should be noted that several of these sets can be associated with the same picture if several users have produced context elements for this photo. In a second example of application of the method of the invention, the theme of the photo series is coupled with information resulting from the analysis of the photo by the image processing module 111. For example, the user has made many photos all over Brittany, and when he returns the theme Brittany in its indexing application, the language system generates a vocabulary too large to allow speech recognition with satisfactory performance (tens of thousands of entrances, between religious heritage, prehistoric, maritime, monumental, landscape). The image analysis module 111 recognizes a church in the picture and the linguistic system 115 is then restricted (by the extraction module 114) to the only list of the remarkable churches of Brittany, of more reasonable size for the speech recognition .

In a third example of application of the method of the invention, the image analysis module 111 recognizes that there is a person in the photo, and the application (and more specifically the extraction module 114) also launches voice search for first names or familiar names, in a list that can be either generic (ie the 2000 most common first names, as well as the names of family links (mom, dad, etc) is personalized by the user, a combination of both, so that the application will be able to produce, by rotating the voice recognition module 12 in keyword detection mode on the vocabulary of the churches and on the first names, starting from the vocal annotation: "that it is Patrick in front of Brélévenez ", the textual indexes:" Patrick, Brélévenez ".

In a fourth example of application of the method of the invention, the application having knowledge that the photo contains a person and a church, uses a language model defined specifically for the theme (person, place) and allows the recognition module 12 to produce the written transcript of the entire vocal annotation: "That's Patrick in front of Brélévenez".

In a fifth example of application of the method of the invention, the application has a profile of the user (names of his relatives, hobbies, vocabulary of text annotations of previously indexed photos), provided by the management module of user profiles 116, and generates the vocabulary to search according to this profile.

In a sixth example of application of the method of the invention, the extraction module 114 interrogates the linguistic system 115 with a theme (for example "the pink granite coast") but imposing a geographical restriction resulting from a piece of information. position given for example by a GPS function available on some cameras.

FIG. 2 illustrates the sequential sequence of the different steps in a particular embodiment of the method according to the invention.

A shooting phase comprises a first step 20, during which the user takes a picture by means of an electronic device having both "digital pictures" and "voice annotations" functions.

A voice recording phase comprises a step 21, during which the user records a voice annotation using the apparatus used in step 20.

The following steps relate to a definition phase of the vocabulary to be recognized.

In step 22, context elements are defined from one or more information that can be provided directly by the user himself (via a human / machine interface 112 of the microphone, keyboard, etc. type). a positioning system of the GPS type 113, or by an image processing module 111 (information relating to an image analysis), a user profile management module 116, etc.

In step 23, the language used by the user is searched from at least one of these context elements.

Next, during step 24, a list of words associated with the context elements is searched in an appropriate linguistic system 115, so as to establish a vocabulary to be recognized.

The next step is related to a phase of generating textual indexes. Finally, during step 25, the voice activity of the user is processed by voice recognition means 12, so as to perform a search in the voice annotation previously recorded in step 21 of words or expressions contained in the vocabulary to be recognized. Words or phrases identified during the search form the textual index.

FIG. 3 shows the structure of a text index generation device 32 according to the invention, which comprises a memory 322, and a processing unit 321 equipped with a microprocessor μP, which is controlled by a computer program (or application) 323 implementing the method according to the invention. The processing unit 321 receives as input a voice annotation 31 associated with a set of images. The microprocessor μP processes this voice annotation, according to the instructions of the program 323, to generate textual indexes 33 representative of the words identified in the voice annotation.

FIG. 4 shows the structure of an image taking apparatus 41 according to the invention, which comprises the text index generating device 32 described in FIG. 3, and an image taking unit 411 equipped with an image sensor Ci, which cooperates with a voice recording unit 412. The image pickup unit 411 receives a signal representative of an image 42 captured by the image sensor Ci, and the image unit voice record 412 receives a signal representative of a voice annotation 43. These two signals are transmitted to the text index generator 32, which first analyzes the signal representative of an image and then uses the representative signal a voice annotation, so as to automatically produce relevant textual indexes.

FIG. 5 shows the structure of an image management / vision apparatus 51 according to the invention, which comprises the text index generating device 32 described in FIG. 3, and a management unit / image view. 511, which cooperates with a voice recording unit 512. The management unit / image view 511 captures an image 52 in a given context, the voice recording unit 512 receives a voice annotation 53 associated with this image . These two pieces of information are transmitted to the text index generating device 32, which initially analyzes the captured image, then performs a voice recognition in the voice annotation, so as to produce textual indexes representative of the words identified in the text. voice annotation. In summary, the invention proposes a method for generating a text index associated with at least one image from a voice annotation, recorded by means of an image management / viewing apparatus or a video recording device. digital shooting equipped with a voice memo recorder. This method has many advantages, including that of automatically defining a vocabulary to be recognized from a context (for example a theme entered by the user, a GPS data, a user profile, an image analysis, etc.). , so as to perform a search in the voice annotation of words or phrases contained in such a vocabulary to recognize.

Claims

A method of generating at least one text index associated with a set of images comprising at least one image (14), from at least one voice annotation (15) previously associated with said set of images, said method comprising a voice recognition step (25) applied to said at least one voice annotation (15) with a predetermined recognizable vocabulary (17), so as to search in said at least one voice annotation (15) of at least a word contained in said vocabulary to be recognized (17), the word or words identified during the search forming said textual index (16), characterized in that it comprises a step of defining said vocabulary to be recognized, including itself : a step of defining a context (22); a search step (24) in a linguistic system of a list of words associated with said context and forming said vocabulary to be recognized.

2. Method according to claim 1, characterized in that said set of images comprises at least one photo (14) and / or at least one video sequence.

3. Method according to any one of claims 1 and 2, characterized in that said context comprises at least one context element belonging to the group comprising: at least one information relating to said set of images, provided by at least one user through at a man / machine interface

(112); at least one piece of information relating to the geographical position of the place of shooting of said set of images, provided by a location device (113); at least one information relating to said set of images, resulting from the processing of said set of images by an image analysis module (111); at least one user profile comprising at least one profile information relating to a user; at least one piece of information understood in a default context.

4. Method according to any one of claims 1 to 3, characterized in that said method comprises a step of selecting (23) a language model according to at least one context element of said context, and in that said voice recognition step (25) is performed in transcription mode with the selected language model.

Computer program product (323), characterized in that it comprises program code instructions for performing the steps of the method according to any one of claims 1 to 4, when said program is executed on a computer.

6. Storage medium (322), possibly totally or partially removable, readable by a computer, storing a set of instructions executable by said computer to implement the method according to any one of claims 1 to 4.

7. Device for generating (32) at least one textual index associated with a set of images comprising at least one image, from at least one voice annotation previously associated with said set of images, said device comprising means voice recognition system which, when they process said at least one voice annotation with a predetermined recognizable vocabulary, perform a search in said at least one voice annotation of at least one word contained in said vocabulary to be recognized, the word or words identified during the research forming said textual index, characterized in that it comprises means for defining said vocabulary to be recognized, comprising themselves: means for defining a context; means for searching in a linguistic system for a list of words associated with said context and forming said vocabulary to be recognized.

Apparatus for taking pictures (41) and recording associated voice annotations, for taking a set of images comprising at least one image, and associating at least one voice annotation with said set of images, characterized in that it comprises a device (32) for generating at least one textual index according to claim 7.

Apparatus for managing / viewing images (51) and recording associated voice annotations, for receiving a set of images comprising at least one image, and receiving or generating at least one voice annotation associated with said set of images, characterized in that it comprises a generating device (32) of at least one text index according to claim 7.