EP1194870A1 - Fundamental entity-relationship models for the generic audio visual data signal description - Google Patents
Fundamental entity-relationship models for the generic audio visual data signal descriptionInfo
- Publication number
- EP1194870A1 EP1194870A1 EP00946974A EP00946974A EP1194870A1 EP 1194870 A1 EP1194870 A1 EP 1194870A1 EP 00946974 A EP00946974 A EP 00946974A EP 00946974 A EP00946974 A EP 00946974A EP 1194870 A1 EP1194870 A1 EP 1194870A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- relationships
- levels
- semantic
- ofthe
- syntactic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
Definitions
- the present invention relates to techniques for describing multimedia information, and more specifically, to techniques which describe both video and image information, or audio information, as well as to content of such information.
- the techniques disclosed are for content-sensitive indexing and classification of digital data signals (e.g., multimedia signals).
- textual annotations are used for indexing- a cataloguer manually assigns a set of key words or expressions to describe an image. Users can then perform text-based queries or browse through manually assigned categories.
- recent techniques in content-based retrieval have focused on indexing images based on their visual content. Users can perform queries by example (e.g., images that look like this one) or user-sketch (e.g., image that looks like this sketch). More recent efforts attempt automatic classification of images based on their content: a system classifies each image, and assigns it a label (e.g., indoor, outdoor, contains a face, etc.). In both paradigms there are classification issues which are often overlooked, particularly in the content-based retrieval community.
- the main difficulty in appropriately indexing visual information can be summarized as follows: (1) there is a large amount of information present in a single image (e.g., what to index?), and (2) different levels of description are possible (e.g., how to index?).
- a portrait of a man wearing a suit It would be possible to label the image with the terms "suit” or "man”.
- a category label implies explicit (e.g., the person in the image is a man, not a woman), and implicit or undefined information (e.g., from that term alone it is not possible to know what the man is wearing).
- multimedia databases which permit users to search for pictures using characteristics such as color, texture and shape information of video objects embedded in the picture.
- characteristics such as color, texture and shape information of video objects embedded in the picture.
- the need to search for multimedia content is not limited to databases, but extends to other applications, such as digital broadcast television and multimedia telephony.
- MPEG-7 MPEG-7 standardization effort. Launched in October 1996, MPEG-7 aims to standardize content descriptions of multimedia data in order to facilitate content-focused applications like multimedia searching, filtering, browsing and summarization. A more complete description ofthe objectives ofthe MPEG-7 standard are contained in the International Organisation for Standardisation document ISO/IEC JTC1/SC29/WG11 N2460 (Oct. 1998), the content of which is inco ⁇ orated by reference herein.
- the MPEG-7 standard has the objective of specifying a standard set of descriptors as well as structures (referred to as "description schemes") for the descriptors and their relationships to describe various types of multimedia information.
- MPEG-7 also proposes to standardize ways to define other descriptors as well as “description schemes” for the descriptors and their relationships. This description, i.e. the combination of descriptors and description schemes, shall be associated with the content itself, to allow fast and efficient searching and filtering for material of a user's interest.
- MPEG-7 also proposes to standardize a language to specify description schemes, i.e. a Description Definition Language ("DDL”), and the schemes for binary encoding the descriptions of multimedia content.
- DDL Description Definition Language
- MPEG is soliciting proposals for techniques which will optimally implement the necessary description schemes for future integration into the MPEG-7 standard.
- three different multimedia-application arrangements can be considered. These are the distributed processing scenario, the content- exchange scenario, and the format which permits the personalized viewing of multimedia content.
- a description scheme must provide the ability to interchange descriptions of multimedia material independently of any platform, any vendor, and any application, which will enable the distributed processing of multimedia content.
- the standardization of interoperable content descriptions will mean that data from a variety of sources can be plugged into a variety of distributed applications, such as multimedia processors, editors, retrieval systems, filtering agents, etc . Some of these applications may be provided by third parties, generating a sub-industry of providers of multimedia tools that can work with the standardized descriptions ofthe multimedia data.
- a user should be permitted to access various content providers' web sites to download content and associated indexing data, obtained by some low-level or high-level processing, and proceed to access several tool providers' web sites to download tools (e.g. Java applets) to manipulate the heterogeneous data descriptions in particular ways, according to the user's personal interests.
- An example of such a multimedia tool will be a video editor.
- a MPEG-7 compliant video editor will be able to manipulate and process video content from a variety of sources if the description associated with each video is MPEG-7 compliant.
- Each video may come with varying degrees of description detail, such as camera motion, scene cuts, annotations, and object segmentations.
- MPEG-7 aims to provide the means to express, exchange, translate, and reuse existing descriptions of multimedia material.
- multimedia players and viewers that employ the description schemes must provide the users with innovative capabilities such as multiple views of the data configured by the user.
- the user should be able to change the display's configuration without requiring the data to be downloaded again in a different format from the content broadcaster.
- DS3++ multimedia
- DS4 application
- DS2 video
- the Generic Visual DS has evolved in the AHG on Description Schemes to the Generic Audio Visual Description Scheme ("AV DS") (AHG on Description Scheme, “Generic Audio Visual Description Scheme for MPEG-7 (V0.3)", ISO/IEC JTC1/SC29/WG11 MPEG99/M4677, Vancouver, Canada, July 1999).
- AV DS Generic Audio Visual Description Scheme
- the Generic AV DS describes the visual content of video sequences or images and, partially, the content of audio sequences; it does not address multimedia or archive content.
- the basic components ofthe Generic AV DS are the syntactic structure DS, the semantic structure DS, the syntactic-semantic links DS, and the analytic/synthetic model DS.
- the syntactic structure DS is composed of region trees, segment trees, and segment/region relation graphs.
- the semantic structure DS is composed of object trees, event trees, and object/event relation graphs.
- the syntactic-semantic links DS provide a mechanism to link the syntactic elements (regions, segments, and segment/region relations) with the semantic elements (objects, events, and event/object relations), and vice versa.
- the analytic/synthetic model DS specifies the projection/registration/conceptual correspondence between the syntactic and the semantic structure.
- the semantic and syntactic elements which we will refer to as content elements in general, have associated attributes. For example, a region is described by color/texture, shape, 2-D geometry, motion, and deformation descriptors. An object is described by type, object-behavior, and semantic annotation DSs.
- the Generic AV DS includes content elements and entity- relation graphs.
- the content elements have associated features, and the entity-relation graphs describe general relationships among the content elements. This follows the Entity-Relationship (ER) modeling technique (P. P-S. Chen, "The Entity-Relation Model - Toward a Unified View of Data", ACM Transactions on Database Systems, Vol. 1, No. 1, pp. 9-36, March 1976).
- ER Entity-Relationship
- the current specification of these elements in the Generic AV DS is too generic to become a useful and powerful tool to describe audio-visual content.
- the Generic AV DS also includes hierarchies and links between the hierarchies, which is typical of physical hierarchical models. Consequently, the Generic AV DS is a mixture of different conceptual and physical models. Other limitations of this DS may be the rigid separation ofthe semantic and the syntactic structures and the lack of explicit and unified definitions of its content elements.
- the Generic AV DS describes images, video sequences, and, partially, audio sequences following the classical approach for book content descriptions: (1) definition ofthe physical or syntactic structure ofthe document; the Table of Contents; (2) definition ofthe semantic structure, the Index; and (3) definition ofthe locations where semantic notions appear. It consists of (1) syntactic structure DS; (2) semantic structure DS; (3) syntactic-semantic links DS; (4) analytic/synthetic model DS; (5) visualization DS; (6) meta information DS; and (7) media information DS.
- the syntactic DS is used to specify physical structures and the signal properties of an image or a video sequence defining the table of contents ofthe document.
- segment DS may be used to define trees of segments that specify the linear temporal structure ofthe video program. Segments are a group of continuous frames in a video sequence with associated features: time DS, meta information DS, media information DS. A special type of segment, a shot, includes editing effect DS, key frame DS, mosaic DS, and camera motion DS.
- the region DS may be used to define a tree of regions. A region is defined as group of connected pixels in a video sequence of an image with associated features: geometry DS, color/texture DS, motion DS, deformation DS, media information DS, and meta information DS.
- the segment/region relation graph DS specifies general relationships among segments and regions, e.g. spatial relationships such as "To The Left Of; temporal relationships such as "Sequential To”; and semantic relationships such as "Consist Of.
- the semantic DS is used to specify semantic features of an image or a video sequence in terms of semantic objects and events. It can be viewed as a set of indexes. It consists of (1) event DS; (2) object DS; and (3) event/object relation graph DS.
- the event DS may be used to form trees of events that define a semantic index table for the segments in the segment DS. Events contain an annotation DS.
- the object DS may be used to form trees of objects that define a semantic index table for the objects in the object DS.
- the event/object relation graph DS specifies general relationships among events and objects.
- the syntactic-semantic links DS are bi-directional between the syntactic elements (segments, regions, or segment/region relations) and the semantic elements (events, objects, or event/object relations).
- the analytic/synthetic model DS specifies the projection/registration/conceptual correspondence between syntactic and semantic structure DSs.
- the media and meta information DS contains descriptors of the storage media and the author-generated information, respectively.
- the visualization DS contains a set of view DS to enable efficient visualization of a video program. It includes the following views: multi-resolution space-frequency thumbnail, key-frame, highlight, event, and alternate views. Each one of these views is independently defined.
- the Generic AV DS includes content elements (i.e. regions, objects, segments, and events), with associated features. It also includes entity-relation graphs to describe general relationships among content elements following the entity- relationship model.
- entity-relation graphs to describe general relationships among content elements following the entity- relationship model.
- a drawback ofthe current DS is that the features and the relationships among elements can have a broad range of values, which reduces their usefulness and expressive power.
- a clear example is the semantic annotation feature in the object element.
- the value ofthe semantic annotation could be a generic ("Man"), a specific ("John Doe"), or an abstract (“Happiness”) concept.
- the initial goal ofthe development leading to the present invention was to define explicit entity-relationship structures for the Generic AV DS to address this drawback.
- the explicit entity-relationship structures would categorize the attributes and the relationships into relevant classes. During this process, especially during the generation of concrete examples (see the baseball example shown in
- the full specification ofthe Generic DS could be represented using an entity-relationship model.
- the entity-relation models provided in Figures 7-9 for the baseball example in Figure 6, include the functionality addressed by most ofthe components ofthe Generic AV DS (e.g. the event DS, the segment DS, the object DS, the region DS, the syntactic-semantic links DS, the segment region relation graph DS, and the event/object relation graph DS) and more.
- the entity-relationship (E-R) model is a popular high-level conceptual data model, which is independent ofthe actual implementation as hierarchical, relational, or object-oriented models, among others.
- the current version ofthe Generic DS seems to be a mix of multiple conceptual and implementation data models: the entity- relationship model (e.g. segment/region relation graph), the hierarchical model (e.g. region DS, object DS, and syntactic-semantic links DS), and the object-oriented model (e.g. segment DS, visual segment DS, and audio segment DS).
- the current Generic AV DS separates the definition of "real objects" in the region and the object DSs, which may cause inefficient handling ofthe descriptions.
- the content elements, especially the object and the event lack explicit and unified definitions in the Generic DS.
- the current Generic DS defines an object as having some semantic meaning and containing other objects.
- event/object relation graphs can describe general relationships among objects and events.
- objects are linked to corresponding regions in the syntactic DS by the syntactic-semantic links DS. Therefore, the object has a distributed definition across many components ofthe Generic Visual DS, which is less than clear.
- the definition of an event is very similar and as vague Entity-Relationship Models For Generic AV DS
- the Entity-Relationship (E-R) model first presented in P. P-S. Chen, "The Entity-Relation Model - Toward a Unified View of Data", ACM Transactions on Database Systems, Vol. 1, No. 1, pp. 9-36, March 1976 describes data in terms of entities and their relationships. Both entities and relationships can be described by attributes.
- the basic components ofthe entity-relationship model are shown in Figure 1.
- the entity, the entity attribute, the relationship, and the relationship attribute correspond very closely to the noun (e.g. a boy and an apple), the adjective (e.g. young), the verb (e.g. eats), and the verb complement (e.g. slowly), which are essential components for describing general data.
- "A young boy eats an apple slowly" which could be the description of a video shot, is represented using an entity- relationship model in Figure 2. This modeling technique has been used to model the contents of pictures and their features for image retrieval.
- An object ofthe present invention is to provide content description schemes for generic multimedia information. Another object ofthe present invention is to provide techniques for implementing standardized multimedia content description schemes.
- Still a further object ofthe present invention is to provide a technique for organizing content embedded in multimedia information based on distinction of entity attributes into syntactic and semantic.
- Syntactic attributes can be categorized into different levels: type/technique, global distribution, local structure, and global composition.
- Semantic attributes can be categorized into different levels: generic object, generic scene, specific object, specific scene, abstract object, and abstract scene.
- Syntactic relationships can be categorized into spatial, temporal, and audio categories.
- Semantic relationships can be categorized into lexical and predicative categories. Spatial and temporal relationships can be topological or directional; audio relationships can be global, local, or composition; lexical relationships can be synonymy, antonymy, hyponymy/hypernymy, or meronymy/holonymy; and predicative relationships can be actions (events) or states.
- a further object ofthe present invention is to describe each level, and entity relationships, in terms of video and audio signal classification.
- Another object ofthe present invention is to provide fundamental and explicit entity-relationship models to address these issues by indexing the content- element attributes, the relationships among content elements, and the content elements themselves.
- This work is based on the conceptual framework for indexing visual information presented in A. Jaimes and S.-F. Chang, "A Conceptual Framework for Indexing Visual Information at Multiple Levels", Submitted to Internet Imaging 2000, which has been adapted and extended for the Generic AV DS.
- the work in other references e.g.. S. Paek, A. B. Benitez, S.-F. Chang, C.-S. Li, J. R. Smith, L. D. Bergman, A. Puri, C. Swain, and J.
- the ten-level visual structure presented provides a systematic way of indexing images based on syntax (e.g., color, texture, etc.) and semantics (e.g., objects, events, etc.), and includes distinctions between general concept and visual concept.
- syntax e.g., color, texture, etc.
- semantics e.g., objects, events, etc.
- We define different types of relations e.g., syntactic, semantic
- semantic information table to summarize important aspects related to an image (e.g., that appear in the non-visual structure).
- the present invention proposes to index the attributes ofthe content elements based on the ten-level conceptual structure presented in A. Jaimes and S.-F. Chang, "A Conceptual Framework for Indexing Visual Information at Multiple Levels", Submitted to Internet Imaging 2000, which distinguishes the attributes based on syntax (e.g. color and texture) and semantics (e.g. semantic annotations) as shown in Figure 3.
- syntax e.g. color and texture
- semantics e.g. semantic annotations
- the first four levels ofthe visual structure refer to syntax, and the remaining six refer to semantics.
- the syntax levels are type/technique, global distribution, local structure, and global composition.
- the semantic levels are generic object, generic scene, specific object, specific scene, abstract object, and abstract scene.
- syntactic and semantic relationships as shown in Figure 4. Syntactic relationships are divided into spatial, temporal, and visual. Spatial and temporal attributes are classified into topological and directional classes. Syntactic-attribute relationships can be further indexed into global, local, and composition. Semantic relationships are divided into lexical and predicative.
- Lexical relationships are classified into synonymy, antonymy, hyponymy/hypernymy, and meronymy/holonymy. Predicative relationships can be further indexed into action and event.
- syntactic and semantic elements Syntactic elements can be divided into region, animated-regions, and segment elements; semantic elements can be indexed in object, animated-object, and event elements.
- semantic elements can be indexed in object, animated-object, and event elements.
- Figure 1 is a generic Entity-Relationship (E-R) model
- Figure 2 provides an example of an entity-relation model for the scenario "A young boy eats an apple in 4 minutes”.
- Figure 3 represents the indexing visual structure by a pyramid
- Figure 4 shows relationships as proposed at different levels ofthe visual structure
- Figure 5 sets forth fundamental models of each proposed type of content element
- Figure 6 pictorially displays a baseball batting event image
- Figure 7 is a conceptual description ofthe Batting Event for the Baseball batting event image displayed in Figure 6;
- Figure 8 is a conceptual description ofthe Hit and the Throw Events for the Batting Event of Figure 6;
- Figure 9 is a conceptual description ofthe Field Object for the Batting
- Figure 10 conceptually represents analysis of non-visual information
- Figure 11 illustrates how visual and non- visual information may be used semantically to characterize an image or its parts.
- Figure 12 illustrates relationships at different levels ofthe audio structure. Elements within the syntactic levels are related according to syntactic relationships. Elements within the semantic levels are related according to syntactic and semantics relationships.
- entity- relationship models are the most widely used conceptual models. They provide a high degree of abstraction and are hardware and software independent. There exits specific procedures to transform these models into physical models for implementation, which are hardware and software dependent. Examples of physical models are the hierarchical model, the relational model, and the object-oriented model.
- the E-R conceptual framework in the context of MPEG-7 is discussed in J. R. Smith and C.-S. Li, ""An E-R Conceptual Modeling Framework for MPEG-7"", Contribution to ISO/IEC JTC1/SC29/WG11 MPEG99, Vancouver, Canada, July 1999.
- syntactic and semantic attributes can refer to several levels (the syntactic levels are type, global distribution, local structure, and global composition; the semantic levels are generic object/scene, specific object/scene, and abstract object scene; see Figure 3.
- syntactic and semantic relationships can be further divided into sub-types referring to different levels (syntactic relationships are categorized into spatial, temporal, and visual relationships at generic and specific levels; semantic relationships are categorized into lexical and predicative; see Figure 4.
- semantic relationships are categorized into lexical and predicative; see Figure 4.
- An important difference with the Generic AV DS is that our semantic elements include not only semantic attributes but also syntactic attributes. Therefore, if an application would rather not distinguish between syntactic and semantic elements, it can do so by implementing all the elements as semantic elements.
- Figure 6 shows a video shot of a baseball game representing as a Batting Event and a Batting Segment (segment and event as defined in the Generic AV DS).
- Figure 7 includes a possible description of the Batting Event as composed of a Field Object, a Hit Event, a Throw Event, a temporal relationship "Before” between the Throw and the Hit Events, and some visual attributes.
- Figure 8 presents descriptions ofthe Throw and the Hit Events and relationships among them.
- the Throw Event is the action that the Pitcher Object executes over a Ball Object towards the Batter Object, "Throws". We provide some semantic attributes for the Pitcher Object.
- the Hit Event is the action that the Batter Object executes over the same Ball Object, "Hit".
- Figure 9 shows the decomposition ofthe Field Object into three different regions, one of which is related to the Pitcher Object by the spatial relationships "On top of. Some visual attributes for one of these regions are provided.
- the proposed visual structure contains ten levels: the first four refer to syntax, and the remaining six refer to semantics.
- An overview ofthe visual structure is given in Figure 3.
- the width of each level is an indication ofthe amount of knowledge required there.
- the indexing cost of an attribute can be included as a sub-attribute ofthe attribute.
- the syntax levels are type/technique, global distribution, local structure, and global composition.
- the semantic levels are generic object, generic scene, specific object, specific scene, abstract object, and abstract scene. While some of these divisions may not be strict, they should be considered because they have a direct impact in understanding what the user is searching for and how he tries to find it in a database. They also emphasize the limitations of different indexing techniques (manual and automatic) in terms ofthe knowledge required.
- the indexing visual structure is represented by a pyramid. It is clear that the lower the level in the pyramid, the more knowledge and information is required to perform the indexing there.
- the width of each level is an indication of the amount of knowledge required - for example, more information is needed to name specific objects in the same scene.
- the syntactic attribute includes an enumerated attribute, level, whose value is its corresponding syntactic level in the visual structure ( Figure 3) - i.e. type, global distribution, local structure, or global composition - or "not specified".
- the semantic attributes also include an enumerated attribute, level, whose value is its corresponding semantic level in the semantic structure ( Figure 3) - i.e. generic object, generic scene, specific object, specific scene, abstract object, and abstract scene - or "not specified”.
- syntactic and semantic attributes Another possibility of modeling the different types of syntactic and semantic attributes would be to subclass the syntactic and the semantic attribute elements to create type, global distribution, local structure, and global composition syntactic attributes; or generic object, generic scene, specific object, specific scene, abstract object, abstract scene attributes (some of these types do not apply for all object, animated object, and event), respectively.
- the type/technique in the previous level gives general information about the visual characteristics ofthe image or the video sequence, but gives little information about the visual content.
- Global distribution aims to classify images or video sequences based on their global content and is measured in terms of low-level perceptual features such as spectral sensitivity (color), and frequency sensitivity
- global distribution features may include global color (e.g., dominant color, average, histogram), global texture (e.g., coarseness, directionality, contrast), global shape (e.g. aspect ratio), global motion (e.g. speed and acceleration), camera motion, global deformation (e.g. growing speed), and temporal/spatial dimensions (e.g. spatial area and temporal dimension).
- global color e.g., dominant color, average, histogram
- global texture e.g., coarseness, directionality, contrast
- global shape e.g. aspect ratio
- global motion e.g. speed and acceleration
- camera motion global deformation
- temporal/spatial dimensions e.g. spatial area and temporal dimension
- the Local Structure level is concerned with the extraction and characterization ofthe components. At the most basic level, those components result from low-level processing and include elements such as the Dot, Line, Tone, Color, and Texture.
- a binary shape mask describes the Batting Segment in Figure 6 (see Figure 7).
- Other examples of local structure attributes are temporal/spatial position (e.g. start time and centroid), local color (e.g. MxN Layout), local motion, local deformation, local shape/2D geometry (e.g. bounding box).
- Such elements have also been used in content-based retrieval systems, mainly on query by user-sketch interfaces such as VisualSEEk .
- the concern here is not with objects, but rather with the basic elements that represent them and with combinations of such elements- a square, for example, is formed by four lines.
- Specific Objects refer to identified and named objects. Specific knowledge ofthe objects in the image or the video sequence is required, and such knowledge is usually objective since it relies on known facts. Examples include individual persons (e.g., the semantic annotation "Peter Who, Player #3 ofthe Yankees" in Figure 6) or objects (e.g. the stadium name)
- This level is analogous to Generic Scene with the difference that here there is specific knowledge about the scene. While different objects in the visual material may contribute in different ways to determine the specific scene depicted, a single object is sometimes enough.
- a picture that clearly shows the White House, for example, can be classified as a scene ofthe White House, based only on that object.
- a specific-scene attribute with value is "Bat by player #32 Yankees" is specified.
- This indexing level is the most difficult one in the sense that it is completely subjective and assessments between different users may vary greatly. The importance of this level was shown in experiments where viewers used abstract attributes to describe images. For example, a woman in a picture may represent anger by one observer and pensiveness to another. For the Pitcher Object in Figure 8, an abstract- scene attribute with value "Speed" is specified.
- the Abstract Scene level refers to what the image as a whole represents. It may be very subjective. Users sometimes describe images in abstract terms such as sadness, happiness, power, heaven, and compassion, as for objects. For the Batting Event in Figure 7, an abstract-scene attribute with value "Good strategy" is specified. Types of Relationships
- Relationships at the syntactic levels ofthe visual structure can only occur in 2D space because there is no knowledge of objects at these levels to determine 3D relationships.
- syntactic levels there can only be syntactic relationships, i.e. spatial (e.g. "Next to"), temporal (e.g. "In parallel”), and visual (e.g. "Darker than") relationships, which are based uniquely based on syntactic knowledge.
- Spatial and temporal attributes are classified in topological and directional classes.
- Visual relationships can be further indexed into global, local, and composition.
- relationships among content elements could occur in 3D. As shown in Figure 4, elements within these levels could be associated with not only semantic relationships but also syntactic relationships (e.g.
- topological i.e. how boundaries of elements relate
- orientation or directional i.e. where the elements are placed relative to each other (see Table 1).
- topological relationships are “To be near to”, “To be within”, and “To be adjacent to”
- directional relationships are “To be in front of, "To be to the left of, and “To be on top of.
- Well-known spatial relationship graphs are 2D String , R 2 , and Attributed-Relational Graphs.
- temporal relationships In a similar fashion, we classify the temporal relationships into topological and directional classes (see Table 1). Examples of temporal topological relationships are "To happen in parallel”, “To overlap”, and “To happen within”; examples of directional temporal relationships are "To happen before”, and “To happen after”.
- the parallel and sequential relationships of SMIL World Wide Web Consortium, SMIL web site http://www.w3.Org/AudioVideo/#SMIL) are examples of temporal topological relationships.
- Visual relationships relate elements based on their visual attributes or features. These relationships can be indexed into global, local, and composition classes (see Table 1).
- a visual global relationship could be "To be smother than” (based on a global texture feature)
- a visual local relationship could be “To accelerate faster” (based on a motion feature)
- a visual composition relationship could be "To be more symmetric than” (based on a 2D geometry feature).
- Visual relationships can be used to cluster video shot/key frames based on any combination of visual features: color, texture, 2D geometry, time, motion, deformation, and camera motion.
- Table 1 Indexing structure for syntactic relationships and examples.
- Figure 7 shows how the Batting Event is defined by its composing elements (i.e. the Batting Segment, the Field Object, the Hit Event, and the Throw Event), and the relationships among them (i.e. Temporal relationship "Before” from Hit Event to Throw Event).
- the Batting Event and its composing elements are associated by a spatial-temporal relationship "Composed of.
- Semantic Relationships can only occur among content elements at the semantic levels ofthe ten-level conceptual structure. We divide the semantic relationships into lexical semantic and predicative relationships. Table 2 summarizes the semantic relationships including examples.
- Table 2 Indexing structure for semantic relationships and examples.
- the lexical semantic relationships correspond to the semantic relationships among nouns used in WordNet. These relationships are synonymy (pipe is similar to tube), antonymy (happy is opposite to sad), hyponymy (a dog is an animal), hypernymy (an animal and a dog), meronymy (a musician is member of a musical band), and holonymy (a musical band is composed of musicians).
- the predicative semantic attributes refer to actions (events) or states among two ore more elements. Examples of action relationships are “To throw” and “To hit”. Examples of state relationships are “To belong” and “To own”. Figure 8 includes two action relationships: “Throw” and "Hit”. Instead of only dividing the predicative semantic into actions or states, we could use the partial relational semantic decomposition used in WordNet. WordNet divides verbs into fifteen (15) semantic domains: verbs of bodily care and functions, change, cognition, communication, competition, consumption, contact, creation, emotion, motion, perception, possession, social interaction, and weather verbs. Only those domains that are relevant for the description of visual concept could be used.
- Figure 8 shows the use of semantic relationships to describe the actions of two objects: the Pitcher Object "Throws” the Ball Object at the Batter Object and the Batter Object "Hits” the Ball Object.
- semantic elements can have not only semantic attributes and relationships but also syntactic attributes and relationships (e.g. an object can be described by a color histogram and a semantic annotation descriptors).
- Our approach differs from the current Generic AV DS in that our semantic (or high-level) elements include syntactic and semantic information solving the rigid separation ofthe syntactic and the semantic structures.
- the syntactic elements As shown in Figure 5, we further classify the syntactic elements into region, animated region, and segment elements. In a similar way, the semantic elements are classified into the following semantic classes: object, animated object, and event. Region and object are spatial entities. Segment and event are temporal entities. Finally, animated-region and animated-object are hybrid spatial-temporal entities. We explain each type in section accordingly.
- syntactic element is a content element in image or video data that is described only by syntactic attributes, i.e. type, global distribution, local structure, or global composition attributes (see Figure 5).
- syntactic attributes i.e. type, global distribution, local structure, or global composition attributes (see Figure 5).
- Syntactic elements can only be related to other elements by visual relationships. We further categorize the syntactic elements into region, animated-region, and segment elements. These elements are derived from the syntactic element through inheritance relationships.
- the region element is a pure spatial entity that refers to an arbitrary, continuous or discontinuous section of an image or a video frame.
- a region is defined by a set of syntactic attributes, and a graph of regions that are related by spatial and visual relationships (see Figure 5). It is important to point out that the composition relation is of type spatial, topological. Possible attributes of regions are color, texture, and 2D geometry.
- the segment element is a pure temporal entity that refers to an arbitrary set of contiguous or not contiguous frames of a video sequence.
- a segment is defined by a set of syntactic features, and a graph of segments, animated regions, and regions that are related by temporal and visual relationships (see Figure 5).
- the composition relation is of type temporal, topological. Possible attributes of segments are camera motion, and the syntactic features.
- the Batting Segment in Figure 7 is a segment element that is described by a temporal duration (global distribution, syntactic), and shape mask (local structure, syntactic) attributes. This segment has a "Consist of relationship with the Batting Event (spatial-temporal relationship, syntactic).
- the animated-region element is a hybrid spatial-temporal entity that refers to an arbitrary section of an arbitrary set frames of a video sequence.
- An animated region is defined by a set of syntactic features, a graph of animated regions and regions that are related by composition, spatial-temporal relationships, and visual relationships (see Figure 5).
- Animated regions may contain any features from the region and the segment element.
- the animated region is a segment and a region at the same time.
- the Pitcher Region in Figure 8 is an animated region that is described by an aspect ratio (global distribution, syntactic), a shape mask (local structure, syntactic), and a symmetry (global composition, syntactic) attributes. This animated region is "On top of the Sand 3 Region (spatial-temporal relationship, syntactic). Semantic Entities
- the semantic element is a content element that is described by not only semantic features but also by syntactic features. Semantic elements can be related to other elements by semantic and visual relationships (see Figure 5). Therefore, we derive the semantic element from the syntactic element using inheritance. We further categorize the semantic elements into object, animated-object, and event elements. Pure semantic attributes are annotations, which are usually in text format (e.g. 6-W semantic annotations, free text annotations).
- the object element is a semantic and spatial entity; its refers to an arbitrary section of an image or a frame of a video.
- An object is defined by a set of syntactic and semantic features, and a graph of objects and regions that are related by spatial (composition is a spatial relationship), visual, and semantic relationships (see Figure 5).
- the object is a region.
- the event element is a semantic and temporal entity; its refers to an arbitrary section of a video sequence.
- An event is defined by a set of syntactic and semantic features, and a graph of events, segments, animated regions, animated objects, regions, and objects that are related by temporal
- composition is a temporal relationship
- visual and semantic relationships.
- the event is a segment with semantic attributes and relationships.
- the Batting Event in Figure 7 is an event element that is described by a "Batting” (generic scene, semantic), "Bat by player #32, Yankees” (specific scene, semantic), and a "Good Strategy” (abstract scene, semantic) attributes.
- the syntactic attributes ofthe Batting Segment can apply to the Batting Event (i.e. we could have not distinguished between Batting Event and Batting Segment, and could have assigned the syntactic attributes ofthe Batting Segment to the Batting Event).
- the Batting Event is composed ofthe Field Object, and the Throwing and the Hitting Events, which represent the two main actions in the Batting Event (i.e. throwing and hitting the ball).
- the Throwing and the Hitting Events are related by a "Before" relationship (temporal relationship, syntactic).
- the animated-object element is a semantic and spatial -temporal entity; it refers to an arbitrary section in an arbitrary set of frames of a video sequence.
- An animated object is defined by a set of syntactic and semantic features, and a graph of animated objects animated regions, regions, and objects that are related by composition, spatial-temporal, visual, and semantic relationships (see Figure 5).
- the animated object is an event and an object at the same time.
- the Pitcher Object in Figure 8 is an animated object that is described by "Man” (generic object, semantic), "Player #3, Yankees” (specific object, semantic), and a "Speed" (abstract object, semantic) attributes.
- This animated object is "On top of the Sand 3 Region shown in Figure 9 (spatial-temporal relationship, syntactic).
- the syntactic features of Pitcher Regions may apply to the Pitcher Object.
- Figure 5 provides fundamental models of each proposed type of content element. Attributes, elements, and relationships are categorized in the following classes: syntactic and semantic. The semantic and syntactic attributes have an associated attribute, level, whose value correspond to the level ofthe visual that they refer to. Syntactic elements are further divided in region, segment, and animated regions. Semantic elements are categorized in object, animated object, and event classes.
- Figure 6 depicts an exemplary baseball batting event.
- FIG. 7 provides a conceptual description ofthe Batting Event for the
- Figure 8 provides a conceptual description ofthe Hit and the Throw Events for the Batting Event in Figure 6 in accordance with the present invention.
- Figure 9 provides a conceptual description ofthe Field Object for the Batting Event in Figure 6 in accordance with the present invention.
- the present invention may also be illustrated in connection with a discussion of percept and concept in analysis and classification of characteristics of images.
- One ofthe difficulties inherent in the indexing of images is the number of ways in which they can be analyzed.
- a single image may represent many things, not only because it contains a lot of information, but because what we see in the image can be mapped to a large number of abstract concepts.
- a distinction between those possible abstract descriptions and more concrete descriptions based only on the visual aspects ofthe image constitutes an important step in indexing.
- Images are multi-dimensional representations of information, but at the most basic level they simply cause a response to light (tonal-light or absence of light). At the most complex level, however, images represent abstract ideas that largely depend on each individual's knowledge, experience, and even particular mood. We can make distinctions between percept and concept.
- the percept refers to what our senses perceive- in the visual system it is light. These patterns of light produce the perception of different elements such as texture and color. No interpretation process takes place when we refer to the percept- no knowledge is required.
- a concept refers to an abstract or generic idea generalized from particular instances. As such, it implies the use of background knowledge and an inherent interpretation of what is perceived. Concepts can be very abstract in the sense that they depend on an individual's knowledge and interpretation- this tends to be very subjective. Syntax and Semantics
- syntax refers to the way visual elements are arranged without considering the meaning of such arrangements. Semantics, on the other hand, deals with the meaning of those elements and of their arrangements. As will be shown in the discussion that follows, syntax can refer to several perceptual levels- from simple global color and texture to local geometric forms such as lines and circles. Semantics can also be treated at different levels. General vs. Visual Concepts
- the first step in creating a conceptual indexing structure is to make a distinction between visual and non-visual content.
- the visual content of an image corresponds to what is direclty perceived when the image is observed (i.e., descriptors stimulated directly by the visual content ofthe image or video in question- the lines, shapes, colors, objects, etc).
- the non-visual content corresponds to information that is closely related to the image, but that is not explicitly given by its appearance. In a painting, for example, the price, current owner, etc. belong to the non-visual category.
- each ofthe levels of analysis that follows is obtained only from the image.
- the viewer's knowledge always plays a role, but the general rule here is that information not explicitly obtained from the image does not go into this category (e.g., the price of a painting would not be part of visual content).
- any descriptors used for visual content are stimulated by the visual content ofthe image or video in question
- Our visual structure contains ten levels: the first four refer to syntax, and the remaining six refer to semantics.
- levels one to four are directly related to percept, and levels five through ten to visual concept. While some of these divisions may not be strict, they should be considered because they have a direct impact in understanding what the user is searching for and how he tries to find it in a database.
- the two main categories could be color and grayscale, with additional categories/ descriptions which affect general visual characteristics. These could include number of colors, compression scheme, resolution, etc. We note that some of these may have some overlap with the non- visual indexing aspects described herein.
- Global Distribution aims to classify images or video sequences based on their global content and is measured in terms of low-level perceptual features such as spectral sensitivity (color), and frequency sensitivity (texture). Individual components ofthe content are not processed at this level (i.e., no "form" is given to these distributions in the sense that the measures are taken globally).
- Global distribution features may include global color (e.g., dominant color, average, histogram), global texture (e.g., coarseness, directionality, contrast), global shape (e.g. aspect ratio), global motion (e.g. speed, acceleration, and trajectory), camera motion, global deformation (e.g. growing speed), and temporal/spatial dimensions (e.g. spatial area and temporal dimension), among others.
- global color e.g., dominant color, average, histogram
- global texture e.g., coarseness, directionality, contrast
- global shape e.g. aspect ratio
- global motion e.g. speed, acceleration, and trajectory
- camera motion e.g. growing speed
- temporal/spatial dimensions e.g. spatial area and temporal dimension
- the Local Structure level is concerned with the extraction and characterization ofthe image's components. At the most basic level, those components result from low-level processing and include elements such as the Dot, Line, Tone, Color, and Texture. In the Visual Literacy literature, some of these are referred to as the "basic elements" of visual communication and are regarded as the basic syntax symbols. Other examples of local structure attributes are temporal/spatial position (e.g. start time and centroid), local color (e.g. MxN Layout), local motion, local deformation, and local shape/2D geometry (e.g. bounding box). There are various images in which attributes of this type may be of importance.
- Global Composition refers to the arrangement or spatial layout of elements in the image.
- Traditional analysis in art describes composition concepts such as balance, symmetry, center of interest (e.g., center of attention or focus), leading line, viewing angle, etc.
- center of interest e.g., center of attention or focus
- leading line e.g., viewing angle
- there is no knowledge of specific objects only basic elements (i.e. dot, line, etc.) or groups of basic elements are considered.
- the view of an image is simplified to an image that contains only basic syntax symbols: an image is represented by a structured set of lines, circles, squares, etc.
- Specific Objects refers to objects that can identified and named. Shatford refers to this level as specific of. Specific knowledge ofthe objects in the image is required, and such knowledge is usually objective since it relies on known facts. Examples include individual persons, and objects.
- This level is analogous to General Scene with the difference that here there is specific knowledge about the scene. While different objects in the image may contribute in different ways to determine that the image depicts a specific scene, a single object is sometimes enough. A picture that clearly shows the Eiffel Tower, for example, can be classified as a scene of Paris, based only on that object .
- the Abstract Scene level refers to what the image as a whole represents. It may be very subjective. Users sometimes describe images in affective (e.g. emotion) or abstract (e.g. atmosphere, theme) terms. Other examples at the abstract scene level include sadness, happiness, power, heaven, and compassion. Relationships across levels
- relations between image elements 8 This structure accommodates relations at different levels and is based on the visual structure presented earlier. We note that relations at some levels are most useful when applied between entities to which the structure is applied (e.g., scenes from different images may be compared). Elements within each level are related according to two types of relations: syntactic and semantic (only for levels 5 through 10). For example: two circles (local structure) can be related spatially (e.g., next to), temporally (e.g., before) and/or visually (e.g., darker than). Elements at the semantic levels (e.g., objects) can have syntactic and semantic relations- (e.g., two people are next to each other, and they are friends).
- each relation can be described at different levels (generic, specific, and abstract).
- relations between levels 1,6,8, and 10 can be most useful between entities represented by the structure (e.g., between images, between parts of images, scenes, etc.)
- the visual structure may be divided into syntax/percept and visual concept/semantics. To represent relations, we observe such division and take into consideration the following: (1) Knowledge of an object embodies knowledge ofthe object's spatial dimensions, that is, ofthe gradable characteristics of its typical, possible or actual, extension in space; (2) knowledge of space implies the availability of some system of axes which determine the designation of certain dimensions of, and distances, between objects in space.
- syntactic relations can occur between elements at any ofthe levels, but semantic relations occur only between elements of levels 5 through 10. Semantic relationships between different colors in a painting, for example, could be determined (e.g., the combination of colors is warm), but we do not include these at that level of our model.
- spatial relationships into the following classes: (1) topological (i.e., how the boundaries of elements relate) and (2) orientation (i.e., where the elements are placed relative to each other). Topological relations include near, far, touching, etc. and orientation relations include diagonal to, in front of, etc.
- Temporal relations refer to those that connect elements with respect to time (e.g., in video these include before, after, between, etc.), and visual relations refer only to visual features (e.g., bluer, darker, etc.). Semantic relations are associated with meaning (e.g., owner of, friend of, etc.).
- relations can be defined at different levels.
- Syntactic relations can be generic (e.g., near) or specific (e.g, a numerical distance measure).
- Semantic relationships can be generic, specific, or abstract.
- spatial global distribution could be represented by a distance histogram, local structure by relations between local components (e.g., distance between visual literacy elements), and global composition by global relations between visual literacy elements.
- non-visual information refers to information that is not directly part ofthe image, but is rather associated with it in some way.
- attributes may divide attributes into biographical and relationship attributes. While it is possible for non- visual information to consist of sound, text, hyperlinked text, etc., our goal here is to present a simple structure that gives general guidelines for indexing. We will focus briefly on text information only. Figure 10 gives an overview of this structure.
- the source for the actual image may be direct (e.g., a photograph of a natural scene) or indirect (e.g., image of a sculpture, painting, building, drawing).
- Biographical Information is not directly related to the subject ofthe image, but rather to the image as a whole. Examples include the author, date, title, material, technique, etc.
- the second class of non- visual information is directly linked to the image in some way.
- Associated Information may include a caption, article, a sound recording, etc.
- this information helps perform some ofthe indexing in the visual structure, since it may contain specific information about what is depicted in the image (i.e., the subject). In that context, it is usually very helpful at the semantic levels since they require more knowledge that is often not present in the image alone. In some cases, however, the information is not directly related to the subject ofthe image, but it is associated to the image in some way.
- a sound recording accompanying a portrait may include sounds that have nothing to do with the person being depicted- they are associated with the image though, and could be indexed if desired.
- Physical Attributes simply refer to those that have to do with the image as a physical object. This may include location ofthe image, location ofthe original source, storage (e.g., size, compression), etc.
- Semantic Information Table to gather high level information about the image (ee Figure 11).
- the table can be used for individual objects, groups of objects, the entire scene, or parts ofthe image.
- visual and non-visual information contribute in filling in the table- simple scene classes such as indoor/outdoor may not be easily determined from the visual content alone; location may not be apparent from the image, etc.
- Individual objects can be classified and named based on the non-visual information, contributing to the mapping between visual object and conceptual object.
- visual and non-visual information can be used to semantically characterize an image or its parts.
- the way in which these two modalities contribute to answer the questions in the semantic table may vary depending on the content.
- the table helps answer questions such as: What is the subject (person/object, etc.)?, What is the subject doing? Where is the subject? When? How? Why?
- the table can be applied to individual objects, groups of objects, the entire scene, or parts ofthe image.
- Categorization can be defined as treating a group of entities as equivalent.
- a category is any of several fundamental and distinct classes to which entities or concepts belong- entities within categories appear more similar and entities between categories appear less similar.
- Sensory Perception categories e.g., texture, color or speech sounds -Id
- GK Generic Knowledge
- GK categories In our structure we can identify Sensory Perception categories such as color and texture. GK categories, however, play a very important role since users are mainly interested in the objects that appear in the images and what those objects may represent. Some theories in cognitive psychology express that classification in GK categories is done as follows:
- attribute values ofthe entity are used (e.g., rule: an image in the people category should have a person in it).
- Prototypes a prototype of the category contains the characteristic attributes of its category's exemplars. These are attributes that are highly probable across category members, but are neither necessary nor sufficient for category membership. A new image is classified according to how similar it is to the category's prototype (e.g., a prototype for the landscape class could be simple sketch of a sunset).
- an instance is classified according to its most similar exemplar's category (e.g., instead of having a rule for the people category, we could have a set of example images in that class and use those for classification).
- Category structure is a crucial factor in a digital library and brings about several issues of importance which we briefly discuss here.
- Visual objects are being automatically detected using the Visual Apprentice.
- visual object detectors are built by defining an object definition hierarchy (i.e., specifying the model of an object and its parts) and providing the system with examples. Multiple classifiers are learned automatically by the system at different levels ofthe hierarchy (region, perceptual, object-part, and object), and the best classifiers are automatically selected and combined when performing automatic classification.
- levels ofthe hierarchy region, perceptual, object-part, and object
- the structures are intuitive and highly functional and stress the need, requirements, and limitations of different indexing techniques (manual and automatic).
- the indexing cost (computational or in terms of human effort) for an audio segment, for example, is generally higher at the lower levels ofthe pyramid: automatically determining the type of content (music vs. voice) vs. recognizing generic objects (e.g., voice of a man) vs. recognizing specific objects (e.g., voice of Bill Clinton). This also implies that more information/knowledge is required at the lower levels and if a user (e.g.
- the proposed audio structure contains ten levels: the first four refer to syntax, and the remaining six refer to semantics.
- An overview for the audio structure can be drawn from Figure 3.
- the width of each level in an indication of the amount of knowledge/information required.
- the syntax levels are type/technique, global distribution, local structure, and global composition.
- the semantic levels are generic object, generic scene, specific object, specific scene, abstract object, and abstract scene.
- the syntax levels classify syntactic descriptors, that is, those that describe the content in terms of low-level features. In the visual structure, these referred to the colors and textures present in the image. In the audio structure of this document, they refer to the low-level features of the audio signal (whether it is music, voice, etc.). Examples include the fundamental frequency, harmonic peaks, etc.
- the semantic levels ofthe visual structure classified attributes related to objects and scenes.
- the semantic levels in the audio structure are analogous, except that the classification is based on the attributes extracted from the audio signal itself.
- objects e.g., voice of a man, sound of a trumpet, etc.
- scenes e.g., street noise, opera, etc.
- Attributes that describe the global content of audio measured in terms of low-level features.
- the attributes at this level are global because they are not concerned with individual components of the signal, but rather with a global description.
- a signal can be described as being Gaussian noise- such description is global because it doesn't say anything about the local components (e.g., what elements, or low- level features describe the noise signal).
- attributes here are meant to describe the local structure of the signal.
- the local elements are given by basic syntax symbols that are present in the image (e.g., lines, circles, etc.). This level serves the same function in audio, so any low-level (i.e., not semantic such as a word, or a letter in spoken content) local descriptor would be classified at this level.
- Global description of an audio segment based on the specific arrangement or composition of basic elements (i.e., the local structure descriptors). While local structure focuses on specific local features ofthe audio, Global Composition focuses on the structure ofthe local elements (i.e., how they are arranged). For example, an audio sequence can be represented (or modeled) by a Markov chain, or by any other structure that uses low-level local features.
- an audio segment can be described in terms of semantics (e.g., recognition), however, objects play an important role.
- Objects can be placed in categories at different levels- an apple can be classified as a Macintosh apple, as an apple, or as a fruit.
- the recognition of an object can be based on an audio segment, and therefore we can make a similar classification. For example, we can say that an audio entity corresponds (e.g., a voice) to a man, or to Bill Clinton.
- Generic Scene the most general level of object description, which can be recognized with everyday knowledge. That means there is no knowledge ofthe specific identity of the object in question (e.g., explosion, rain, clap, man's voice, woman's voice, etc.). Audio entity descriptors can be classified at this level.
- an audio segment can be indexed according to individual objects, it is possible to index the audio segment as a whole based on the set of all ofthe entities it contains and their arrangement.
- Examples of audio scene classes include street noise, stadium, office, people talking, concert, newsroom, etc.
- the guideline for this level is that only general knowledge is required. It is not necessary to recognize a specific audio entity (e.g., who's voice it is), or a specific audio scene (e.g., which concert it is) to obtain a descriptor at this level.
- Specific Objects refer to identified and named audio entities. Specific knowledge is required, and such knowledge is usually objective since it relies on known facts- at this level, noises or sounds are identified and named. Examples include the voice of individual persons (e.g., "Bill Clinton") or characteristic noises (e.g., bell of NY stock exchange), etc.
- This indexing level is the most difficult one in the sense that it is completely subjective and assessments between different users may vary greatly. The importance of this level was shown, for images, in experiments, where viewers used abstract attributes to describe images, among others. Emotive attributes, can also be assigned to objects in an audio segment. For example, a sound (e.g., in a movie, in music), may be described as scary, happy, etc.
- Abstract Scene level refers to what the audio segment as a whole represents. It may be very subjective. For images, it has been shown, for example, that users sometimes describe images in affective (e.g. emotion) or abstract (e.g. atmosphere, theme) terms. Similar descriptions can be assigned to audio segments, for example, attributes to describe an audio scene could include: sadness (e.g., people crying), happiness (e.g., people laughing), etc.
- elements within these levels could be associated with not only semantic relationships, but also syntactic relationships (e.g. "the trumpet sounds near the violin", and “the trumpet notes complement the violin notes”).
- semantic relationships e.g. "the trumpet sounds near the violin”, and “the trumpet notes complement the violin notes”
- lexical relationships such as synonymy, antonymy, hyponymy/hypernymy, and meronymy/holonymy
- predicative relationships referring to actions (events) or states.
- topological i.e. how boundaries of elements relate
- orientation or directional i.e. where the elements are placed relative to each other
- these relationships can often be extracted from an audio segment: listening to a stereo broadcast of a news report, for example, it is often easy to assign syntactic attributes to the audio entities. For example, it is possible to assess that one sound is near another, or rather, the syntactic relationships between different sound sources. In this respect, one could determine somewhat detailed topological and directional relationships that may not be explicit in the signal.
- topological relationships are "To be near to”, “To be within”, and “To be adjacent to”; examples of directional relationships are "To be in front of, and "To be to the left of. Note that the main difference between these relationships and those obtained from visual information lies on the extraction of the relationships themselves- it may be more difficult to determine some spatial relationships from the audio alone, but in creation of synthetic audio models, these relationships play a very important role.
- Audio relationships relate audio entities based on their visual attributes or features. These relationships can be indexed into global, local, and composition classes (see Table 3). For example, an audio global relationship could be "To be less noisy than” (based on a global noise feature), an audio local relationship could be “is louder than” (based on a local loudness measure), and an audio composition relationship could be based on comparing the structures of a Hidden Markov Models. In a similar way in which the elements of the audio structure have different levels (generic, specific, and abstract), these types of syntactic relationships (see Table 3) can be defined in a generic level ("Near") or a specific level (" 10 meters from”). For example, operational relationships such "To be the union of, "To be the intersection of, and "To be the negation of are topological, specific relationships either spatial or temporal (see Table 3). Semantic Relationships
- Semantic relationships can only occur among content elements at the semantic levels of the ten-level conceptual structure.
- Table 4 summarizes the semantic relationships including examples. Note that since semantic relationships are based on understanding of the content, we can make the same classification for relationships obtained from visual content as for relationships obtained from audio content. The semantic relationships here, therefore, are identical to those described in connection with video signals. The only difference lies in the way the semantic content is extracted (i.e., understanding the audio vs. understanding an image or video). To make the explanation more clear, we have used examples related to audio, although the original examples would also apply.
- the lexical semantic relationships correspond to the semantic relationships among nouns used in WordNet. These relationships are synonymy (violin is similar to a viola), antonymy (flute is opposite to drums), hyponymy (a guitar is a string instrument), hypernymy (a string instrument and a guitar), meronymy (a musician is member of a musical band), and holonymy (a musical band is composed of musicians).
- the predicative semantic attributes refer to actions (events) or states among two ore more elements.
- WordNet divides verbs into 15 semantic domains: verbs of bodily care and functions, change, cognition, communication, competition, consumption, contact, creation, emotion, motion, perception, possession, social interaction, and weather verbs. Only those domains that are relevant for the description of visual concept could be used.
- Table 3 Indexing structure for syntactic relationships and examples.
- Hyponymy - Generic Composition A is an opera
- the present invention includes not only methods, but also computer- implemented systems for multiple level classifications of digital signals (e.g., multimedia signals) for indexing and/or classification purposes.
- digital signals e.g., multimedia signals
- the methods described hereinabove have been described at a level of some generality in accordance with the fact that they can be applied within any system for processing digital signals of the type discussed herein — e.g., any ofthe art-recognized (or future-developed) systems compatible with handling of digital multimedia signals or files under the MPEG-7 standards.
- any multimedia- compatible device for processing, displaying, archiving, or transmitting digital signals including but not limited to video, audio, still image, and other digital signals embodying human-perceptible content
- a personal computer workstation including a Pentium microprocessor, a memory (e.g., hard drive and random access memory capacity), video display, and appropriate multimedia appurtenances.
- the present invention proposes fundamental entity-relationship models for the current Generic AV DS to address the shortcomings relating to its global design.
- the fundamental entity-relation models index (1) the attributes ofthe content elements,
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14232599P | 1999-07-03 | 1999-07-03 | |
US142325P | 1999-07-03 | ||
PCT/US2000/018231 WO2001003008A1 (en) | 1999-07-03 | 2000-06-30 | Fundamental entity-relationship models for the generic audio visual data signal description |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1194870A1 true EP1194870A1 (en) | 2002-04-10 |
EP1194870A4 EP1194870A4 (en) | 2008-03-26 |
Family
ID=22499415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP00946974A Withdrawn EP1194870A4 (en) | 1999-07-03 | 2000-06-30 | Fundamental entity-relationship models for the generic audio visual data signal description |
Country Status (7)
Country | Link |
---|---|
EP (1) | EP1194870A4 (en) |
JP (1) | JP4643099B2 (en) |
KR (1) | KR100771574B1 (en) |
CN (1) | CN1312615C (en) |
AU (1) | AU6065400A (en) |
MX (1) | MXPA02000040A (en) |
WO (1) | WO2001003008A1 (en) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2844079B1 (en) * | 2002-08-30 | 2005-08-26 | France Telecom | ASSOCIATIVE SYSTEM OF MULTIMEDIA OBJECT DESCRIPTION |
JP4027269B2 (en) * | 2003-06-02 | 2007-12-26 | キヤノン株式会社 | Information processing method and apparatus |
US7478038B2 (en) * | 2004-03-31 | 2009-01-13 | Microsoft Corporation | Language model adaptation using semantic supervision |
JP2007265341A (en) * | 2006-03-30 | 2007-10-11 | Sony Corp | Content utilization method, content utilization device, content recording method, content recording device, content providing system, content receiving method, content receiving device, and content data format |
BRPI0605994B1 (en) * | 2006-09-29 | 2019-08-06 | Universidade Estadual De Campinas - Unicamp | PROGRESSIVE RANDOMIZATION PROCESS FOR MULTIMEDIA ANALYSIS AND REASONING |
US8407241B2 (en) | 2009-06-12 | 2013-03-26 | Microsoft Corporation | Content mesh searching |
US10417263B2 (en) | 2011-06-03 | 2019-09-17 | Robert Mack | Method and apparatus for implementing a set of integrated data systems |
US9244924B2 (en) * | 2012-04-23 | 2016-01-26 | Sri International | Classification, search, and retrieval of complex video events |
US8537983B1 (en) | 2013-03-08 | 2013-09-17 | Noble Systems Corporation | Multi-component viewing tool for contact center agents |
DK2983763T3 (en) | 2013-04-10 | 2017-08-28 | Sanofi Sa | DRIVING MECHANISM FOR A PHARMACEUTICAL SUPPLY DEVICE |
KR101461183B1 (en) * | 2013-09-23 | 2014-11-28 | 장우용 | System and method for generating digital content |
CN104882145B (en) * | 2014-02-28 | 2019-10-29 | 杜比实验室特许公司 | It is clustered using the audio object of the time change of audio object |
US10349093B2 (en) * | 2014-03-10 | 2019-07-09 | Cisco Technology, Inc. | System and method for deriving timeline metadata for video content |
US9838759B2 (en) | 2014-06-20 | 2017-12-05 | Google Inc. | Displaying information related to content playing on a device |
US9946769B2 (en) | 2014-06-20 | 2018-04-17 | Google Llc | Displaying information related to spoken dialogue in content playing on a device |
US10206014B2 (en) | 2014-06-20 | 2019-02-12 | Google Llc | Clarifying audible verbal information in video content |
US9805125B2 (en) | 2014-06-20 | 2017-10-31 | Google Inc. | Displaying a summary of media content items |
US10349141B2 (en) | 2015-11-19 | 2019-07-09 | Google Llc | Reminders of media content referenced in other media content |
US10034053B1 (en) | 2016-01-25 | 2018-07-24 | Google Llc | Polls for media program moments |
US10432987B2 (en) | 2017-09-15 | 2019-10-01 | Cisco Technology, Inc. | Virtualized and automated real time video production system |
CN111341319B (en) * | 2018-12-19 | 2023-05-16 | 中国科学院声学研究所 | Audio scene identification method and system based on local texture features |
CN113673635B (en) * | 2020-05-15 | 2023-09-01 | 复旦大学 | Hand-drawn sketch understanding deep learning method based on self-supervision learning task |
CN113221566B (en) * | 2021-05-08 | 2023-08-01 | 北京百度网讯科技有限公司 | Entity relation extraction method, entity relation extraction device, electronic equipment and storage medium |
CN116821692A (en) * | 2023-08-28 | 2023-09-29 | 北京化工大学 | Method, device and storage medium for constructing descriptive text and space scene sample set |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3303543B2 (en) * | 1993-09-27 | 2002-07-22 | インターナショナル・ビジネス・マシーンズ・コーポレーション | How to organize and play multimedia segments, and how to organize and play two or more multimedia stories as hyperstory |
US5821945A (en) * | 1995-02-03 | 1998-10-13 | The Trustees Of Princeton University | Method and apparatus for video browsing based on content and structure |
-
2000
- 2000-06-30 WO PCT/US2000/018231 patent/WO2001003008A1/en active Application Filing
- 2000-06-30 AU AU60654/00A patent/AU6065400A/en not_active Abandoned
- 2000-06-30 MX MXPA02000040A patent/MXPA02000040A/en active IP Right Grant
- 2000-06-30 JP JP2001518680A patent/JP4643099B2/en not_active Expired - Fee Related
- 2000-06-30 EP EP00946974A patent/EP1194870A4/en not_active Withdrawn
- 2000-06-30 KR KR1020027000069A patent/KR100771574B1/en not_active IP Right Cessation
- 2000-06-30 CN CNB008124620A patent/CN1312615C/en not_active Expired - Fee Related
Non-Patent Citations (2)
Title |
---|
No further relevant documents disclosed * |
See also references of WO0103008A1 * |
Also Published As
Publication number | Publication date |
---|---|
KR100771574B1 (en) | 2007-10-30 |
JP2003507808A (en) | 2003-02-25 |
CN1312615C (en) | 2007-04-25 |
JP4643099B2 (en) | 2011-03-02 |
WO2001003008A1 (en) | 2001-01-11 |
CN1372669A (en) | 2002-10-02 |
EP1194870A4 (en) | 2008-03-26 |
KR20020050220A (en) | 2002-06-26 |
AU6065400A (en) | 2001-01-22 |
MXPA02000040A (en) | 2003-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6847980B1 (en) | Fundamental entity-relationship models for the generic audio visual data signal description | |
KR100771574B1 (en) | A method for indexing a plurality of digital information signals | |
Jaimes et al. | Conceptual framework for indexing visual information at multiple levels | |
Lew et al. | Content-based multimedia information retrieval: State of the art and challenges | |
US6411724B1 (en) | Using meta-descriptors to represent multimedia information | |
Zlatintsi et al. | COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization | |
Snoek et al. | Multimodal video indexing: A review of the state-of-the-art | |
Xie et al. | Event mining in multimedia streams | |
Benitez et al. | MediaNet: A multimedia information network for knowledge representation | |
Troncy et al. | Multimedia semantics: metadata, analysis and interaction | |
Vassiliou | Analysing film content: A text-based approach | |
Sebe et al. | Personalized multimedia retrieval: the new trend? | |
Shih | Distributed multimedia databases: Techniques and Applications | |
Sikos et al. | The Semantic Gap | |
Amaria et al. | Kolyang,“A survey on multimedia ontologies for a semantic annotation of cinematographic resources for the web of data,” | |
Ionescu et al. | Video genre categorization and representation using audio-visual information | |
Del Bimbo | Issues and directions in visual information retrieval | |
Di Bono et al. | WP9: A review of data and metadata standards and techniques for representation of multimedia content | |
Smith | MPEG-7 multimedia content description standard | |
Luo et al. | Integrating multi-modal content analysis and hyperbolic visualization for large-scale news video retrieval and exploration | |
Salway | Video Annotation: the role of specialist text | |
Manzato et al. | Supporting multimedia recommender systems with peer-level annotations | |
Del Bimbo | Semantics-based retrieval by content | |
Gagnon et al. | ERIC7: an experimental tool for Content-Based Image encoding and Retrieval under the MPEG-7 standard | |
Benitez et al. | Extraction, description and application of multimedia using MPEG-7 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20011228 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL |
|
AX | Request for extension of the european patent |
Free format text: AL;LT;LV;MK;RO;SI |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20080221 |
|
17Q | First examination report despatched |
Effective date: 20080521 |
|
APBK | Appeal reference recorded |
Free format text: ORIGINAL CODE: EPIDOSNREFNE |
|
APBN | Date of receipt of notice of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA2E |
|
APBR | Date of receipt of statement of grounds of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA3E |
|
APAF | Appeal reference modified |
Free format text: ORIGINAL CODE: EPIDOSCREFNE |
|
APAF | Appeal reference modified |
Free format text: ORIGINAL CODE: EPIDOSCREFNE |
|
APBT | Appeal procedure closed |
Free format text: ORIGINAL CODE: EPIDOSNNOA9E |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20150106 |