WO2004008344A1 - Annotation d'images numeriques avec des entrees textuelles - Google Patents

Annotation d'images numeriques avec des entrees textuelles Download PDF

Info

Publication number
WO2004008344A1
WO2004008344A1 PCT/SG2002/000157 SG0200157W WO2004008344A1 WO 2004008344 A1 WO2004008344 A1 WO 2004008344A1 SG 0200157 W SG0200157 W SG 0200157W WO 2004008344 A1 WO2004008344 A1 WO 2004008344A1
Authority
WO
WIPO (PCT)
Prior art keywords
predetermined
image
cluster
digital images
titles
Prior art date
Application number
PCT/SG2002/000157
Other languages
English (en)
Inventor
Seng Chu Tele Tan
Philippe Mulhem
Jiayi Chen
Original Assignee
Laboratories For Information Technology
Centre National De La Recherche Scientifique
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Laboratories For Information Technology, Centre National De La Recherche Scientifique filed Critical Laboratories For Information Technology
Priority to PCT/SG2002/000157 priority Critical patent/WO2004008344A1/fr
Priority to AU2002345519A priority patent/AU2002345519A1/en
Publication of WO2004008344A1 publication Critical patent/WO2004008344A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Definitions

  • the present invention relates to the annotation of digital images using text and refers particularly, though not exclusively, to the annotation of images using text recorded in a plurality of fields, each field having a predetermined title.
  • a further object of the invention is to record the key information under a plurality of fields, each field having a predetermined title.
  • the present invention provides a method for annotating images wherein key information for each image is stored with each image as text.
  • the key information is stored under a plurality of fields, with each field having a predetermined title.
  • the predetermined titles form part of the key information.
  • the predetermined titles of the fields may be determined by a user or maybe pre-set by a supplier. They may be event, place, people, and date; in any order.
  • the key information may be input as audio and converted to text using an automatic speech recognition engine.
  • the key information may be input by keyboard, keypad, touch screen or imaging.
  • each predetermined title is input before the key information for the field relevant for that predetermined title.
  • each of the predetermined titles may be matched to its counterpart word in the audio input. All words that are after the predetermined title and before the next occurring predetermined title or the end of the audio input, whichever occurs first, may then be extracted as a description for that field.
  • the automatic speech recognition engine preferably should be able to edit its vocabulary, correct frequently occurring transcription errors, incorporate new words into its vocabulary, and provide alternatives in addition to final recognition result while recognizing the audio input.
  • character information for an image may also recorded for that image.
  • the character information may be information such as, for example, global positioning system coordinates, and camera-related information.
  • the key information and the character information are preferably stored on a database.
  • the database may store the digital images singularly as a single image database. Alternatively, or additionally, the database may store the digital images in clusters as a multiple image database.
  • the key information for a plurality of digital images may be clustered by phonetically similar words that occur in a majority of the digital images in the cluster. Clustering may be achieved by using a nearest neighbour clustering algorithm. This may be based on threshold. When the descriptions for a given predetermined title are the same for the plurality of digital images, the digital images are clustered by descriptions of a different one of the predetermined titles. To prevent misplacement of a digital image, further clustering processing is conducted for fringe relocation. Data from the clusters can then be used to update the descriptions.
  • Words occurring in the majority of the digital images may be used in the clustering process. Also, clustering may be achieved by using a nearest-neighbour clustering algorithm based on threshold. Words that occur in the majority of the key information of all digital images in a cluster may be taken into account in the clustering process. One or more of the predetermined titles may be dependent on another of the predetermined titles during the clustering process.
  • Fringe relocation may be by assigning a value representing a dominant element in a field throughout all images in the cluster.
  • a distance between the single image value for a single digital image and its cluster is then determined, as are subsequent distances between the single image value and its adjacent clusters.
  • a normalized value of the distance and the subsequent distances is then obtained.
  • the normalized distance and the normalized subsequent distances are compared, the single digital image is placed in the cluster where the normalized distance, or the normalized subsequent distances, are the lowest.
  • a weighing rule may be applied for the plurality of fields in determining the subsequent distances.
  • the matching of the predetermined titles is by keyword spotting; or may be by using the automatic speech recognition results and searching for the predetermined titles.
  • the present invention also provides a computer useable medium comprising a computer program code that is configured to cause a processor to execute one or more functions as described above.
  • Figure 1 is a flow chart for a speech annotation process
  • Figure 2 is an illustration of an annotation speech structure
  • Figure 3 is an illustration of the field segmentation process.
  • the indexing process assumes that the image captured, and the results of the speech annotation process, are stored in a memory device in any multi-media format (e.g. MPEG, AVI and so forth).
  • Figure 1 shows the indexing processing after the digital still/moving images, and the associated speech annotations, are downloaded to the host.
  • the first module 10 is to separate the multi-media content (image plus audio) into the image 12 and audio fields 14.
  • the image content can either be
  • the further processing is to: (a) increase the signal-to-noise ratio
  • CBIR Content-Based Image Retrieval
  • the audio or speech content is first preprocessed at 20 to enhance the quality of the speech signals before being transcribed into text at the Automatic Speech Recognition (“ASR") module 22.
  • the ASR 22 can be any commercial off-the-shelf engine 28 that preferably has the flexibility for word editing and training in its vocabulary structure. This is a useful feature as some words of local flavor or the name of places and persons may not reside in the indigenous vocabulary of the ASR. The ability to add these words can further improve the performance of the speech recognition engine 28.
  • the engine 28 should preferably support the following additional customization functions:
  • the pre-determined field titles can be emphasized using the first function (a) in order to achieve high recognition rates for the field words.
  • the name of family members can also be trained into the ASR engine as, for home photographs, family members will quite often be in the photograph.
  • the field titles, and commonly used field descriptions are stored in a field-based dictionary 36.
  • the second function (b) can be especially helpful to determine the original content of the speech by providing more detailed information about the process.
  • a pre-determined syntax for the input structure of the speech is preferably used.
  • Structured speech has been used to control many speech-activated devices such as cell phones, and other handheld devices.
  • a relatively high recognition accuracy of these implementations is achieved by restricting the vocabulary of commands and indexed words.
  • the key information of images can be extracted from the speech annotation. These extracted terms will be used as index descriptions of the image.
  • index descriptions of the image.
  • the user may be given the flexibility to define content sub- categories or titles that best suit their indexing needs. Alternatively, these may be pre- set by the supplier.
  • a digital camera intended for general domestic use may be set by the manufacturer as Event, Place, People and Date as these would be the most commonly used, and most relevant, titles or sub-categories for everyday general domestic use. These sub-categories or titles are for fields in the speech structure.
  • the basic speech structure is shown in Figure 2.
  • the field titles can vary with the categories of the photos/video. For example, as is explained above, in a home photo scenario, it may be useful to categorize the photos into the following fields: Event, Place, People and Date. Following each field is the list of the elements or description of the field. Querying the elements of these fields will enable retrieving the content of the "album".
  • a field word-detecting algorithm is used to ascertain the location of the fields within the text.
  • the first is using the ASR results 24. This algorithm sequentially searches through the words in the text and their alternatives, and then matches the selected word with the list of keywords, comprising the fields titles words entry.
  • the prior word level training of the ASR module enables the fields titles words to be detected with relatively high confidence.
  • the interval between the detected field titles i.e. F strictly and F Formula + ⁇
  • the field segmentation can also be regarded as a form of keyword spotting
  • WLS may be carried out based on signal-level processing methods. Because the set of pre-defined field words or titles is preferably relatively small (e.g. four), it is appropriate to create a template for each word. The minimum number of fields is two with no theoretical maximum, although practicality such as available memory, and processing speed will provide the maximum limit in each case. Additional filler templates may also be needed to absorb all other words. Templates may be represented by a sequence of feature vectors. Upon establishment, the speech annotation is compared with the templates through, for example, dynamic-time warping to determine which part of the speech signal is most similar to which template. Thus the beginning and ending points of the field words can be approximately determined. This facilitates the recognition of content in the field.
  • Figure 3 shows an illustration of the field segmentation process.
  • the transcribed annotation is shown at the top.
  • the four field titles in this illustration are Event, Place, People and Date.
  • the field segmentation process will yield the locations of the uttered field words within the text, either through ASR 24 results or through signal level processing 26.
  • the ensuing text (within in the shaded rectangular sub field box) before the next field word (for example "wedding ceremony") will contain the description of the respective field. Because of the sequential way of detecting field words, there are no restrictions on how these field words need to be organized for each audio input. For example, a speech annotation in the form of "Date... People... Place... Event" will, after segmentation, lead to the same field content as that of "Event...Place...People...Date.. ". This allows flexibility in defining the sub- categories or title in a speech structure, and performing annotations for images of different categories.
  • the content of the ensuing field title, before the next field title or the end of the annotation, whichever occurs first is the description of the field category.
  • These are extracted at 38.
  • "wedding ceremony” are the textual elements describing the Event field. These elements can be stored directly as the field meta data of the accompanying digital image, or may passed through a parser to ⁇ extract higher level information.
  • every segment generated can be re-fed into the ASR engine 28 for the ASR engine 28 to recognize as belonging to the corresponding field. This may improve recognition, as well as the resulting extraction performance..
  • Digital images may be associated with information stored in any character format representation (for example, ASCII, ISO-8859-1 or UNICODE). Examples of such information are:
  • GPS Global Positioning System
  • camera-related information such aperture, zoom information, speed, use of flash, landscape/portrait mode, and so forth.
  • This information describes each image, and is processed by an adequate character-based extraction process 30 and stored in the database 16 to be used for visualization and/or retrieval purposes.
  • the extracted fields are stored in the Single Image Database ("SID") 32.
  • SID 32 stores the speech-based and character-based extracted fields for each image. The storage is for each image separately from all other images.
  • the database 32 allows effective and efficient storage of index information pertaining to the image.
  • the database 32 facilitates the retrieval of relevant images, as well as providing the required information for report generation 34 purposes.
  • the accuracy of the field element extraction process 38 is dependent on the recognition performance of the ASR engine 28.
  • clustering 40 may be used.
  • the collection of images is partitioned using a predefined structure of at least one field content.
  • the clustering process is then used to group together images that are similar. Images that are similar according to the clustering fields may also have strong similarities in the non-clustering fields.
  • the cluster can then be indexed using representative elements of the image fields. For fields corresponding to text extracted from speech, it is possible to represent the cluster fields by a group of phonetically similar words that occur in a majority of the images of the cluster.
  • intersection of field attributes may be used to represent the major features of the images of the cluster.
  • a nearest neighbour clustering algorithm based on threshold may achieve the clustering of digital images.
  • the general nearest neighbour clustering algorithm is described 10 below.
  • the clusters may be of a manageable size. It is preferred that the maximum number of images in a cluster be limited to a maximum number, the maximum number may be predetermined, or may be determined by the processing capabilities of the host. In this clustering rule, the similarity between images is the
  • H-oto 8 Event USA Holiday Place Ferry Ride to Ellis Island People Jack, Jill Date 14 April 19922.55 pm TS 3
  • the fields that describe a cluster have the same names as those coming from the images. From each field of the images in a cluster that is dependent from the clustering criterion, the value for the corresponding field of the cluster is computed. For the fields that are
  • a dependant attribute from a clustering criterion is an attribute for which its value is expected to be similar across the images of the cluster.
  • Event and Place attribute are dependent on the Date clustering
  • Adaptation algorithm may be used. This may be implemented in any of a number of ways including by the following steps:
  • each dependent field of a cluster is assigned a value to represent its dominant element throughout all images in this cluster. Meanwhile, each image in a cluster is labeled with a field-specific distance metric to describe its disparity relative to the dominant value in each field.
  • index of closest cluster arg Min [d f (FringePhoto)]
  • i indicates the index of the n+1 clusters in the above comparison (n being the adjacent clusters of the initial cluster of the Fringelmage), and di(Fringelmage) represents the normalized distance between the fringe image and each cluster i. If the index of closest cluster corresponds the current cluster for the fringe image, no fringe relocation needs to be performed and hence proceed directly to step 4. Otherwise, re-categorize the fringe image into the cluster corresponding the index of closest cluster. Update the dominant elements in each field of both clusters involved in the relocation and recalculate related distance values. 3. Identify the image with the second largest value of distance. Repeat Step 3 until no more fringe images need to be relocated. 4. Select another cluster and repeat the process until all clusters are processed.
  • the number of neighbouring clusters may be greater than two for clustering fields greater than two .
  • Photo 6 is relocated to TS 2 after
  • the data from the cluster fields may also be used to update the image field database at 42. This is because the information related to the clusters is considered trustworthy because it abstracts information coming from several images. For instance, in the Place field of the photograph 3 of Table 1, "cited" is a wrongly extracted word. If phonetic similarity determines that Cited and City are close to each other, and that Cited is not a word of the Event Field of the cluster containing the photo 3, then the event field of photo 3 of Table 1 should be updated to contain City not Cited.
  • the data concerning the cluster composition and the cluster fields are stored in the multiple image database 44.
  • the multiple image database 44 should allow efficient and effective retrieval of clusters that are relevant to queries formulated by a user or an external system.
  • the database can be used as a tool to complement report generation 34. This would usually be done after a surveying operation.
  • the single 32 and multiple 44 image databases may provide descriptions of single or multiple images at various granularity of content. This, together with manual intervention, may help expedite the report generation process.
  • the present invention also extends to a computer useable medium having a computer program code that is configured to cause a processor to execute one or more functions described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Processing Or Creating Images (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé permettant d'indexer les informations contenues dans une image numérique à des fins de communication et d'extraction. Une entrée vocale structurée est utilisée pour rattacher les informations contenues à des catégories de domaines. Ces domaines peuvent être adaptés en fonction des besoins de l'utilisateur. Un moteur de reconnaissance vocale automatique permet de transcrire les signaux vocaux et d'extraire la description du contenu. Les descriptions peuvent être stockées dans les domaines sous forme de métadonnées décrivant le contenu de l'image correspondante.
PCT/SG2002/000157 2002-07-09 2002-07-09 Annotation d'images numeriques avec des entrees textuelles WO2004008344A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/SG2002/000157 WO2004008344A1 (fr) 2002-07-09 2002-07-09 Annotation d'images numeriques avec des entrees textuelles
AU2002345519A AU2002345519A1 (en) 2002-07-09 2002-07-09 Annotation of digital images using text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2002/000157 WO2004008344A1 (fr) 2002-07-09 2002-07-09 Annotation d'images numeriques avec des entrees textuelles

Publications (1)

Publication Number Publication Date
WO2004008344A1 true WO2004008344A1 (fr) 2004-01-22

Family

ID=30113484

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2002/000157 WO2004008344A1 (fr) 2002-07-09 2002-07-09 Annotation d'images numeriques avec des entrees textuelles

Country Status (2)

Country Link
AU (1) AU2002345519A1 (fr)
WO (1) WO2004008344A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006077196A1 (fr) 2005-01-19 2006-07-27 France Telecom Procede de generation d'index textuel a partir d'une annotation vocale
EP1770599A3 (fr) * 2005-09-29 2008-04-02 Sony Corporation Appareil et procédé de traitement d'informations et programme utilisé conjointement
EP1850251A3 (fr) * 2006-04-28 2008-05-14 FUJIFILM Corporation Afficheur d'images
CN105931641A (zh) * 2016-05-25 2016-09-07 腾讯科技(深圳)有限公司 字幕数据生成方法和装置

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5940121A (en) * 1997-02-20 1999-08-17 Eastman Kodak Company Hybrid camera system with electronic album control
US6054990A (en) * 1996-07-05 2000-04-25 Tran; Bao Q. Computer system with handwriting annotation
US6128446A (en) * 1997-12-11 2000-10-03 Eastman Kodak Company Method and apparatus for annotation of photographic film in a camera
EP1081588A2 (fr) * 1999-09-03 2001-03-07 Sony Corporation Appareil de traitement d'information, méthode de traitement d'information, et support d'enregistrement de programme
US6301586B1 (en) * 1997-10-06 2001-10-09 Canon Kabushiki Kaisha System for managing multimedia objects
WO2001086511A2 (fr) * 2000-05-11 2001-11-15 Lightsurf Technologies, Inc. Systeme et methodologie permettant d'acceder a des images photographiques et a des attributs de multiples dispositifs client disparates
US20020022960A1 (en) * 2000-05-16 2002-02-21 Charlesworth Jason Peter Andrew Database annotation and retrieval
EP0699941B1 (fr) * 1994-08-30 2002-05-02 Eastman Kodak Company Appareil photographique équipé d'un dispositif de reconnaissance de parole
US6397181B1 (en) * 1999-01-27 2002-05-28 Kent Ridge Digital Labs Method and apparatus for voice annotation and retrieval of multimedia data
US20020069070A1 (en) * 2000-01-26 2002-06-06 Boys Donald R. M. System for annotating non-text electronic documents
US20020087564A1 (en) * 2001-01-03 2002-07-04 International Business Machines Corporation Technique for serializing data structure updates and retrievals without requiring searchers to use locks

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0699941B1 (fr) * 1994-08-30 2002-05-02 Eastman Kodak Company Appareil photographique équipé d'un dispositif de reconnaissance de parole
US6054990A (en) * 1996-07-05 2000-04-25 Tran; Bao Q. Computer system with handwriting annotation
US5940121A (en) * 1997-02-20 1999-08-17 Eastman Kodak Company Hybrid camera system with electronic album control
US6301586B1 (en) * 1997-10-06 2001-10-09 Canon Kabushiki Kaisha System for managing multimedia objects
US6128446A (en) * 1997-12-11 2000-10-03 Eastman Kodak Company Method and apparatus for annotation of photographic film in a camera
US6397181B1 (en) * 1999-01-27 2002-05-28 Kent Ridge Digital Labs Method and apparatus for voice annotation and retrieval of multimedia data
EP1081588A2 (fr) * 1999-09-03 2001-03-07 Sony Corporation Appareil de traitement d'information, méthode de traitement d'information, et support d'enregistrement de programme
US20020069070A1 (en) * 2000-01-26 2002-06-06 Boys Donald R. M. System for annotating non-text electronic documents
WO2001086511A2 (fr) * 2000-05-11 2001-11-15 Lightsurf Technologies, Inc. Systeme et methodologie permettant d'acceder a des images photographiques et a des attributs de multiples dispositifs client disparates
US20020022960A1 (en) * 2000-05-16 2002-02-21 Charlesworth Jason Peter Andrew Database annotation and retrieval
US20020087564A1 (en) * 2001-01-03 2002-07-04 International Business Machines Corporation Technique for serializing data structure updates and retrievals without requiring searchers to use locks

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006077196A1 (fr) 2005-01-19 2006-07-27 France Telecom Procede de generation d'index textuel a partir d'une annotation vocale
EP1770599A3 (fr) * 2005-09-29 2008-04-02 Sony Corporation Appareil et procédé de traitement d'informations et programme utilisé conjointement
US7693870B2 (en) 2005-09-29 2010-04-06 Sony Corporation Information processing apparatus and method, and program used therewith
EP1850251A3 (fr) * 2006-04-28 2008-05-14 FUJIFILM Corporation Afficheur d'images
CN105931641A (zh) * 2016-05-25 2016-09-07 腾讯科技(深圳)有限公司 字幕数据生成方法和装置
CN105931641B (zh) * 2016-05-25 2020-11-10 腾讯科技(深圳)有限公司 字幕数据生成方法和装置

Also Published As

Publication number Publication date
AU2002345519A1 (en) 2004-02-02
AU2002345519A8 (en) 2004-02-02

Similar Documents

Publication Publication Date Title
JP5230358B2 (ja) 情報検索装置、情報検索方法、プログラム及び記憶媒体
US8107689B2 (en) Apparatus, method and computer program for processing information
JP5801395B2 (ja) シャッタクリックを介する自動的メディア共有
JP5576384B2 (ja) データ処理装置
US7672508B2 (en) Image classification based on a mixture of elliptical color models
CN100568238C (zh) 图像搜索方法及装置
US9009163B2 (en) Lazy evaluation of semantic indexing
JP4367355B2 (ja) 写真画像検索装置、写真画像検索方法、記録媒体、およびプログラム
US8117210B2 (en) Sampling image records from a collection based on a change metric
JP2005510775A (ja) コンテンツをカテゴリ化するためのカメラメタデータ
US20070255695A1 (en) Method and apparatus for searching images
US7451090B2 (en) Information processing device and information processing method
EP2406734A1 (fr) Classification d'images automatique et semi-automatique, annotation et marquage à l'aide de paramètres d'acquisition d'images et de métadonnées
JP2000276484A (ja) 画像検索装置、画像検索方法及び画像表示装置
US20060026127A1 (en) Method and apparatus for classification of a data object in a database
US8255395B2 (en) Multimedia data recording method and apparatus for automatically generating/updating metadata
CN104798068A (zh) 视频检索方法和装置
WO2009031924A1 (fr) Procédé de création d'un système d'indexation pour la recherche d'objets contenus dans des images numériques
JP5289211B2 (ja) 画像検索システム、画像検索プログラムおよびサーバ装置
WO2004008344A1 (fr) Annotation d'images numeriques avec des entrees textuelles
JP2001357045A (ja) 画像管理装置,画像管理方法および画像管理プログラムの記録媒体
Kuo et al. MPEG-7 based dozen dimensional digital content architecture for semantic image retrieval services
Chen et al. A method for photograph indexing using speech annotation
JP2006202237A (ja) 画像分類装置および画像分類方法
AOKI et al. MPEG-7 Content-Based Image Retrieval with Spatial and Temporal Information

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP