WO1999005865A1 - Acces a des images sur la base de leur contenu - Google Patents

Acces a des images sur la base de leur contenu Download PDF

Info

Publication number
WO1999005865A1
WO1999005865A1 PCT/US1998/015063 US9815063W WO9905865A1 WO 1999005865 A1 WO1999005865 A1 WO 1999005865A1 US 9815063 W US9815063 W US 9815063W WO 9905865 A1 WO9905865 A1 WO 9905865A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
video segment
accessing
vector
keyframe
Prior art date
Application number
PCT/US1998/015063
Other languages
English (en)
Inventor
Francis Quek
Original Assignee
The Board Of Trustees Of The University Of Illinois
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Board Of Trustees Of The University Of Illinois filed Critical The Board Of Trustees Of The University Of Illinois
Publication of WO1999005865A1 publication Critical patent/WO1999005865A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/91Television signal processing therefor

Definitions

  • the field of the invention relates to imaging and more particularly to video imaging.
  • Video is at least 30 pictures per second. Video has the potential of being a rich and compelling source of information.
  • Many challenges have to be met before this vision can be realized.
  • Much research has gone into technologies and standards for video compression, synchronization between video and sound, transport of video over networks and other communication channels and operating system issues of the storage and retrieval of video media on demand. In our research, we address the challenge of how video may be accessed and manipulated in an interactive fashion, and how video may be segmented by content to provide conceptual-level access to the data.
  • Video has to be packaged in a way that makes it amenable for search and rapid browsing.
  • Video is capable of generating vast amounts of raw footage—and such footage is by far the most voluminous and ever growing data trove available.
  • Given the sheer volume of video data it is desirable to automate the majority of the task of video segmentation. Given the variability in the semantic content of video, it is necessary to facilitate user interaction with the segmentation process.
  • the sequential nature of video also makes its access cumbersome.
  • Teen who owns a video camera is acquainted with the frustration of tapes gathering dust on the shelf because it is too time consuming to sift through the hours of video.
  • Video Access Interaction and Manipulation
  • Content-Based Video Segmentation The former addresses the higher level system needs for handling video not as a stream of video frames but as conceptual objects based on content.
  • the latter address the challenge of automating the production of these content-based video objects.
  • a method and apparatus are provided for accessing a video segment of a plurality of video frames .
  • the method includes the steps of segmenting the plurality of video frames into a plurality of video segments based upon semantic content and designating a frame of each video segment of the plurality of segments as a keyframe and as an index to the segment.
  • the method further includes the steps of ordering the keyframes and placing at least a portion of the ordered keyframes in an ordered display with a predetermined location of the ordered display defining a selected location.
  • a keyframe may be designated as a selected keyframe.
  • the ordered keyframes may be precessed through the ordered display until the selected keyframe occupies the selected location.
  • FIG. 1 is a block diagram of a content based video access system in accordance with an embodiment of the invention
  • FIG. 2 depicts the shot hierarchical organization of the system of FIG. 1;
  • FIG. 3 depicts a shot hierarchy data model of the system of FIG. 1;
  • FIG. 4 depicts a multiply-linked representation for video access of the system of FIG. 1
  • FIG. 5 depicts a summary representation of a standard keyframe and filmstrip representation of the system of FIG. 1;
  • FIG. 6 depicts a keyframe-annotation box representation of the system of FIG. 1 ;
  • FIG. 7 depicts a multiresolution animated representation showing shot ordering in boxed arrows of the system of FIG. 1;
  • FIG. 8 depicts an animation sequence in ordered mode of the system of FIG. 1 ;
  • FIG. 9 depicts an animation sequence in straight line mode of the system of FIG. 1;
  • FIG. 10 depicts a multiresolution animated browser with a magnified view of the system of FIG. 1 ;
  • FIG. 11 depicts a screen setup for scanning video frames of the system of FIG. 1;
  • FIG. 12 depicts a total search time and animation time versus animation speed plot of the system of FIG. 1;
  • FIG. 13 depicts histograms showing time clustering for different animation speeds of the system of FIG. 1 ;
  • FIG. 14 depicts a hierarchical shot representation control panel of the system of FIG. 1;
  • FIG. 15 depicts a VCR-like control panel of the system of FIG. 1;
  • FIG. 16 depicts a timeline representation control panel of the system of FIG. 1;
  • FIG. 17 depicts a hierarchy of events inherent in the processing of video frames by the system of FIG. 1;
  • FIG. 18 depicts VCM output of a pan-right sequence showing the local vector field as fine lines and a dominant vector of a video frame processed by the system of FIG. 1;
  • FIG. 19 depicts a four-quadrant approach to zoom detection of the system of FIG. 1;
  • FIG. 20 depicts VCM output of a zoom-in sequence showing the local vector field as fine lines and dominant vector of a video frame processed by the system of FIG. 1;
  • FIG. 21 depicts a VCM output of a pan-zoom sequence showing the local vector field as fine lines and dominant vector of a video frame processed by the system of FIG. 1;
  • FIG. 22 depicts a method of creating an NCM for a given interest point of simulated video frame processed by the system of FIG. 1;
  • FIG. 23 depicts a vector field obtained using ADC
  • FIG. 24 depicts a vector field obtained using VCM from a video frame processed by the system of FIG. 1 ;
  • FIG. 25 depicts a global VCM computed for vertical pan sequence of a video frame processed by the system of FIG. 1;
  • FIG. 26 depicts a vertical camera pan of a video frame analyzed using VCM by the system of FIG. 1 ;
  • FIG. 27 depicts a vector field obtained without using temporal prediction by the system of FIG. 1;
  • FIG. 28 depicts a vector field obtained using temporal prediction by the system of FIG. 1 ;
  • FIG. 29 depicts an example of vector clustering combining a moving object and camera pan by the system of FIG. 1;
  • FIG. 30 depicts hand movements detected using VCM and clustering by the system of FIG. 1 ;
  • FIG. 32 depicts another example of a noisy frame
  • FIG. 33 depicts a pure zoom sequence analyzed with VCM by the system of FIG. 1 ;
  • FIG. 34 depicts a vector field obtained for a pan and zoom sequence by the system of FIG. 1;
  • FIG. 35 depicts a vector field obtained for a camera rotating about its optical axis by the system of FIG. 1.
  • FIG. 1 is a block diagram of a video processing system 10, generally, that may be used to segment a video file based upon a semantic content of the frames. Segmentation based upon semantics allows the system 10 to function as a powerful tool in organizing and accessing the video data.
  • MAR Mul tiresolution Animated Representation
  • MMI man-machine interface
  • FIG. 2 shows the shot architecture of our system that may be displayed on a display 16 of the system 10.
  • FIG. 2 shows a portion of the MAR (to be described in more detail later) at level 0 along with subgroups of the MAR at levels 1 and 2. In the figure, each shot at the highest level (we call this level 0) is numbered starting from 1.
  • Shot 2 is shown to have four subshots which are numbered 2.1 to 2.4. These shots are said to be in level 2 (as are shots 6.1 to 6.3) .
  • Shot 2.3 in turn contains three subshots, 2.3.1 to 2.3.3. In our organization, shot 2 spans its subshots. In other words, all the video frames in shots 2.1 to 2.4 are also frames of shot 2. Similarly, all the frames of shots 2.3.1 to 2.3.3 are also in shot 2.3. Hence the same frame may be contained in a shot in each level of the hierarchy. One could, for example, select shot 2 and begin playing the video from its first frame. The frames would play through shot 2, and eventually enter shot 3. If one begins playing from the first frame of shot 2.2, the video would play through shot 2.4 and proceed to shot 3.
  • the system is situated at one level, and only the shots at that level appear in the visual summary representation (to be described later) .
  • the system would be in level 2 although the current frame is also in shot 2 . 3 and shot 2 in levels 1 and 0 respectively.
  • This organization may, for example, be used to break a video archive of a courtroom video into the opening statements, a series of witnesses, closing statements, jury instructions, and verdict at the highest level.
  • a witness testimony unit may be broken into direct and cross-examination sessions at the next level, and may be further broken into a series of attorney-witness exchanges, etc.
  • FIG. 3 shows the data model which makes up our shot database.
  • Each shot defines its first and final frame within the video (represented by F and L respectively) and its resources for visual summary (the keyframe and filmstrip representations represented by the K) .
  • Each shot unit is linked to its predecessor and successor in the shot sequence.
  • Each shot may also be comprised of a list of subshots. This allows the shots to be organized in a recursive hierarchy.
  • Interface consistency and metaphor realism are critical to helping the user form a good mental model of both the system of the MAR and the data (e.g., the shots and subshots of the MAR) .
  • Our system features a mul tiply-linked representation set that provides this consistency and realism. By maintaining this realism and consistency, we provide a degree of tangibleness to the system and video data that allows the user to think of our representational units as 'real' objects. The consistency among the representational components also permits the user to maintain a degree of 'temporal situatedness ' through the video elements displayed.
  • the system is made up four major interlinked representational components :
  • the system interface showing these components is shown in FIG. 4. While all these components involve detailed design and implementation, the key to making them consistent and real is that they are mul tiply- linked.
  • the hierarchical video shot representation, the VCR-like video screen display, and the timeline representation are updated to show the video segment in each of their representations .
  • the timeline moves to the temporal displacement of the selected shot, the VCR-like video screen displays the first frame in the shot, and the hierarchical representation indicates where the shot is in the shot hierarchy.
  • the fundamental summary element in our system is the keyframe . It has been shown that this simple representation is a surprisingly effective way to provide the user with a sense of the content of a video sequence. Our key contribution is in the way such keyframes are organized and presented within the context of a video access and browsing environment. We provide three summary organizations: • The standard keyframe presentation
  • FIG. 5 shows the standard keyframe representation. Each shot is summarized by its keyframe, and the keyframes are displayed in a scrollable table. The keyframe representing the "current shot” is highlighted with a boundary. A shot can be selected as the current shot from this interface by selecting its keyframe with the computer pointing device (e.g., a mouse) .
  • the computer pointing device e.g., a mouse
  • the position of the current shot in the shot hierarchy appears in the shot hierarchy representation
  • the first frame of the shot appears as the "current frame” in the display of the VCR-like interface
  • the timeline representation is updated to show the offsets of the first frame of the shot in the video data.
  • the video can be played in the VCR display by activating the "Play" button on the VCR-like interface.
  • the current keyframe highlight boundary blinks.
  • the frame displayed in the VCR display is the current frame.
  • the next shot becomes the current shot, and the current shot highlight boundary is updated.
  • Beneath the keyframe window is a subwindow showing the annotation information of the current shot.
  • This current frame annota tion panel contains the classification of the shot and a textual description that includes a set of keywords.
  • the classification and keywords are dependent on the domain to which the system is applied. For example, for a video database for court room archival, the classification may be "prosecution questioning", "testimony", etc. If editing is enabled, the user may modify the annotation content of the current shot .
  • a filmstrip presentation of any shot may be activated by clicking on its keyframe with the keyboard shift key depressed (other key combinations or alternate mouse buttons may be chosen for this function) .
  • the filmstrip representation may be similarly activated from all the visual summary modes.
  • Filmstrip presentation of the current shot may be as shown in FIG. 5. This filmstrip shows every n th frame at a lower resolution (we use the 10 th frame in our implementation) in a horizontally scrollable window. The user can select any frame in the filmstrip and make it the current frame . All other representation components are updated accordingly (i.e., the VCR-like display will jump to the frame and enter pause mode and the timeline representation will be updated) .
  • the filmstrip representation provides higher temporal resolution access into the video and is particularly useful for editing the shot hierarchy.
  • FIG. 6 shows the visual summary mode that is particularly useful for browsing and updating annotation content.
  • the keyframes of the shots are displayed beside their annotation panels subwindow in a scrollable window.
  • the annotation panel is identical to the current frame annotation panel in the standard keyframe representation.
  • the keyframes behave exactly the same way as in the standard keyframe representation (i.e., selection of the current shot, activation of the filmstrip representation of any shot, and the interlinking with the other representation components). While this representation does not permit the user to see as many keyframes, its purpose is to let the user browse and edit the shot annotations .
  • Annotation information from neighboring shots is available, and one does not have to make a particular shot the current shot to see its annotation or to edit the annotation content.
  • FIG. 7 shows the layout of our MAR.
  • the purpose of using the MAR is to permit more keyframes to fit within the same screen real-estate.
  • Our current implementation shown in FIG. 7 may present 75 keyframes in a 640x480- pixel window. It may also contain 72 low resolution keyframes displayed at 128x96 pixels, two intermediate resolution keyframes displayed at 128x96 pixels, and one high resolution keyframe (representing the current shot) displayed at 256x192 pixels. This is in contrast with the standard keyframe representation which displays 40 keyframes at 128x96 pixels and takes more than twice the screen real-estate (a 750x610 pixel window) . Even at the lowest resolution, the thumbnail images provide sufficient information for the user to recognize its contents .
  • the animation control panel above the MAR allows the user to select the animation mode and speed. We shall discuss the different path modes in the next sections. The user is also able to select either the "real image” or "box animation” modes from this panel. The former mode animates the thumbnail images while the latter replaces them with empty boxes during animation.
  • box animation mode may be necessary for machines with less computational capacity. We shall also discuss the need for being able to set the animation speed in a later section.
  • FIG. 7 shows the ordering of the shots in the MAR in the boxed arrows .
  • the highest resolution keyframe in the center is always the current shot keyframe .
  • a highlighting boundary appears around the keyframe (see FIG. 10) . This indicates that the corresponding shot will be selected if the mouse selection button is depressed.
  • the interface animates so that the selected frame moves and expands to the current frame location.
  • the other keyframes move their respective new locations according to their shot order.
  • FIG. 8 shows the ordered mode animation.
  • the keyframes scroll along the path determined by the shot order shown in FIG . 7 , expanding as they reach the center keyframe location and contracting as it passes the center. This animation proceeds until the selected keyframe arrives at the current keyframe location at the center of the interface.
  • Ordered mode animation is useful for understanding the shot order in the browser and for scanning the keyframes by scrolling through the keyframe database. Ordered mode animation is impractical for random shot access because of the excessive time taken in the animation.
  • the filmstrip In accordance with our overall system design, one can access the filmstrip representation in exactly the same way as in the other visual summary modes.
  • the filmstrip of course, functions the same way no matter which visual summary mode is active when it is launched.
  • the current frame annotation panel is also accessible in the MAR as with the standard keyframe representation.
  • the second animation mode is the straight line animation illustrated in FIG. 9.
  • the system determines the final location of every keyframe within the shot database and animates the motion of each keyframe to its destination in a straight line. If the resolution of the keyframe changes between its origin and destination, the keyframe grows or shrinks in a linear fashion along its animated path. As is evident in the animation sequence shown in FIG. 9, new keyframes move into and some keyframes move off the browser panel to maintain the shot order.
  • the advantage of this animation mode is its speed. As will be seen later, the trade-off achieved between animation time and visual search time determines the total time it takes to access a particular shot through the interface.
  • thumbnail images generally provide sufficient information to select the required shot, we have found that it is sometimes convenient to be able to see a higher resolution display of a keyframe before selecting the shot. This may be characterized as a magnifying glass feature. Being able to see a higher resolution display of a keyframe is especially important if the shot keyframes are very similar or where the keyframes are complex. If the mouse pointer stays over a particular keyframe for a specified period (2 seconds in one implementation) , a magnified view of the keyframe appears. This magnified view disappears once the cursor leaves the keyframe.
  • FIG. 10 shows a highlighted keyframe under the cursor and a magnified display of the same keyframe.
  • the first is that the resolution-for-screen real- estate tradeoff gives the user an advantage in viewing greater scope without seriously impairing her ability to recognize the content of the keyframe.
  • the second assumption is that the animation assists the user to select a sequence of shots.
  • the scope advantage of the first assumption is self-evident since one can indeed see more shot keyframes for any given window size using the multiresolution format. Whether the reduction in resolution proves to be a serious impediment is harder to determine. Thumbnail representations are used by such programs as Adobe's Photoshop which uses a much lower resolution image as image file icons on the Macintosh computer. The subjects who have tested our system demonstrated that they can easily recognize the images in the low resolution thumbnails.
  • We tested the second assumption by conducting a response time experiment which will be detailed in this section. The goal of the experiment was to determine if animation improves user performance, and how different animation speeds may impact selection time.
  • FIG. 11 shows the screen layout of our experiment.
  • Each subject was presented with the interface and a window showing five target keyframes .
  • a window showing the list of targets is displayed above the MAR selection window. Since our experiment tests the situation in which the user knows which images she was interested in, the same target sequence of five keyframes was used throughout the experiment. The subjects were required to select the five target images displayed in order from left to right. The keyframe to be selected is highlighted in the target list window. If the subject makes a correct selection, the interface functions as designed (i.e., the animation occurs). Erroneous selections were recorded. The interface does not animate when an error is made. We shall call each such sequence of five selections a selection set.
  • the experiment was performed by five subjects. Each subject was required to perform 8 sets of selections at each of five animation speeds. Hence each subject made 40 selections per animation speed yielding a total of 32 between selection intervals which were recorded. This constitutes a total of 200 selections and 128 time interval measurements per subject.
  • the subjects rested after each selection set.
  • the fastest animation speed was the 'jump' animation in which the keyframes moved instantaneously to their destinations .
  • animation speed 1 the keyframes moved to their destinations in 30 steps.
  • Animation speeds 2 to 4 correspond to 45, 60, and 75 animation steps respectively.
  • the actual times for each animation are shown in animation time histograms in FIG. 13.
  • FIG. 12 plots the average animation, search and total selection times against animation speed for all subjects.
  • the plots show that the total time for selection decreases at animation speed 1 (from 2.87 sec to 2.36 sec) and increases steadily thereafter (2.78 sec, 3.35 sec and 3.67 for animation speeds 2, 3 and 4 respectively) .
  • the break-even point for animation appears to be around animation speed 2.
  • Our plots also show that the search time decreases from animation speed 0 to 2. Thereafter, increased animation speed seems to have little effect on search time. The animation time, however, increases steadily from speed 0 to 4.
  • An ANOVA single factor analysis on the results yields shows that these averages are reliable (see Table
  • the histograms for each of the measurements for the different animation speeds in FIG. 13 give us an insight to what is happening.
  • selection time is the sum of an animation time, visual search time, and motor action time (time for the user to move the cursor to the target and make the selection) .
  • animation we observe that subjects move the cursor during tracking before animation terminates. This actually permits faster selection with animation.
  • subjects move the cursor as a pointer in tandem with visual search (even at animation time 0 or jump mode) so that it is not easy to separate pure visual search and motion time.
  • the histogram of the total selection time for animation speed 0 (this is the same as search time since animation time is zero) in FIG. 13 is more broadly distributed than for the other animation speeds.
  • animation speed 1 (average of 1.5 sec) was optimal for the subjects who participated in the experiments. We caution, however, that this may not be generalizable to the general population. Our subjects were all between 24 and 35 years of age. It is likely that people from other age groups will have a different average optimal animation speed. Furthermore, for our subjects, animation speeds above animation speed 2 (average of 2.24 sec) did not appreciably decrease search time while it increased total selection time. We observe that at higher animation times, the subjects appear to be waiting for the animation to cease before making their selections . This delay may lead to frustration or a loss of concentration in users of the system if the animation time is too long. For this reason, we have made the animation speed selectable by the user.
  • the Hierarchical Video Shot Representa tion panel is designed to allow the user to navigate, to view, and to organize video in our hierarchical shot model shown in FIG. 2. It comprises two panels for shot manipulation, a panel which shows the ancestry of the current shot and permits navigation through the hierarchy, a panel for switching among visual summary modes, and a panel that controls the shot display in the visual summary.
  • the hierarchical shot representation is tied intimately to the other representational components. It shows the hierarchical context of the current shot in the visual summary which is in turn ganged to the VCR-like display and the timeline representation.
  • the shot editing panel permits manipulation of the shot sequence at the hierarchical level of the current shot. It allows the user to split the current shot at the current frame position. Since the user can make any frame filmstrip representation of the visual summary the current frame, the filmstrip representation along with the VCR-like control panel are useful tools in the shot manipulation process. As illustrated by the graphic icon, the new shot created by shot splitting becomes the current shot. In the same panel shots may be merged into the current shot and groups of shots (marked in the annotation visual summary interface) may be merged into a single shot .
  • This panel also allows the user to create new shots by setting the first and last frames in the shot and capturing its keyframe.
  • the "Set Start” and “Set End” buttons are pressed, the current frame (displayed in the VCR display) becomes the start and end frames respectively of the new shot.
  • the default keyframe is the first frame in the new shot, but the user can select any frame within the shot by pausing the VCR display at that shot and activating the "Grab Keyframe” button.
  • the second shot manipulation panel is designed for subshot editing. As is obvious from the bottom icons in this panel, it permits the deletion of a subshot sequence (the subshot data remains within the first and last frames of the supershot, which is the current shot) .
  • the "promote subshot” button permits the current shot to be replaced by its subshots, and the "group to subshot” permits the user to create a new shot as the current shot and make all shots marked in the annotation visual summary interface its subshots.
  • the subshot navigation panel in FIG. 14 permits the user to view the ancestry of the current shot and to navigate the hierarchy.
  • the "Down” button in this panel indicates that the current shot has subshots, and the user can descend the hierarchy by clicking on the button. If the current shot has no subshots, the button becomes blank (i.e., the icon disappears) . Since the current shot in the figure (labeled as "Ingrid Bergman") is a top level shot, the "Up” button (above the "down” button) is left blank. The user can also navigate the hierarchy from the visual summary window. In our implementation, if a shot contains subshots, the user can click on the shot keyframe with the right mouse button to descend one level into the hierarchy. The user can ascend the hierarchy by clicking on any keyframe with the middle mouse button. These button assignments are, however, arbitrary and can be replaced by any mouse and key-chord combinations .
  • the hierarchical shot representation panel also permits the user to hide the current shot or a series of marked shots in the visual summary display. This makes is easier for the user to view a greater video over a greater temporal span in the video summary window at the expense of access to same shots in the sequence.
  • the hide feature can, of course, be switched off to reveal all shots.
  • FIG. 15 shows the VCR-like control panel that provides a handle to the data using the familiar video metaphor.
  • the user can operate the video just as through it were a video tape or video disk while maintaining situated in the shot hierarchy.
  • the video is played, one might think of the image in the VCR display as the current frame.
  • the current frame advances (or reverses, jumps, fast forwards/reverses , etc.) across shot boundaries, all the other representation components respond accordingly to maintain situatedness .
  • the user can also loop the video through the current shot and jump to the current shot boundaries .
  • the VCR display will jump to the new current frame and the VCR panel will be ready to play the video from that point.
  • any other representational component e.g., by selecting a shot in video summary, selecting a frame in a filmstrip representation, navigating the shot hierarchy in the hierarchical shot representation, and changing the temporal offset in the timeline representation.
  • FIG. 16 shows the timeline representation of the current video state.
  • the lower timeline shows the offset of the current frame in the entire video.
  • the vertical bars represent the shot boundaries in the video, and the numeric value shows the percent offset of the current frame into the video .
  • the slider is strongly coupled to the other representational components.
  • the location of the current frame is tracked on the timeline. Any change in the current frame by interaction from either the visual summary representation and the hierarchical shot representation is reflected by the timeline as well.
  • the timeline also serves as a slider control. As the slider is moved, the current frame is updated in the VCR-like representation, and the current shot is changed in the visual summary and hierarchical shot representations .
  • the timeline on the top shows the offset of the current frame in the video sequence of the currently active hierarchical level. It functions in exactly the same way as the global timeline.
  • the current frame is within the first shot in the highest level of the hierarchy, offset 5% into the global video sequence.
  • the same frame is in the eleventh (last) subshot of shot 1 in the highest level, and is offset 96% into the subshots of shot 1.
  • semantic content of a video sequence may include any symbolic or real information element in a frame (or frames) of a video sequence. Semantic content, in fact, may be defined on any of a number of levels.
  • segmentation based upon semantic content may be triggered based upon detection of a scene change. Segmentation may also be triggered based upon camera pans or zooms, or upon pictorial or audio changes among frames of a video sequence.
  • segmentation need not be perfect for it to be useful.
  • a tourist takes six hours of video on a trip. If a system is able to detect every time the scene changed (i.e., the camera went off and on) and divided the video into shots based at scene change boundaries, we would have fewer than 200 scene changes if the average time the camera stayed on was about 2 minutes.
  • the tourist would be able to browse this initial shot segmentation using our interface. She would be able to quickly find the shot she took at the Great Wall of China by just scanning the keyframe database. Given that she has a sense of the temporal progression of her trip, this database would prove useful with scene change-based segmentation.
  • the video would now be accessible as higher level objects rather than as a sequence of image frames that must be played. She could further organize the archive of her trip using the hierarchical editing capability of our system.
  • video may be segmented by detecting video events which serve as cues that the semantic content has changed.
  • video events which serve as cues that the semantic content has changed.
  • scene change detection example one may use some discontinuity in the video content as such a cue.
  • FIG. 17 shows the event hierarchy that one might detect in video data.
  • domain events Such events are strongly dependent on the domain in which the video is taken.
  • the task may be to detect each time a different person is speaking, and when different witnesses enter and leave the witness box.
  • a 'drama model' In a general tourist and home video domain, one might apply a 'drama model' and detect every time a person is added to the scene. One may then locate a specific person by having the system display the keyframes of all 'new actor' shots. The desired person will invariably appear.
  • Camera/photography events are essentially events of perspective change in the video.
  • the two that we detect are camera pans and zooms .
  • scene change events We have already discussed scene change events.
  • a scene change is a discontinuity in the video stream where the video preceding and succeeding the discontinuity differ abruptly in location and time.
  • Examples of scene change events in video include cuts, fades, dissolves and wipes. Cuts, by far the most common in raw video footage, involve abrupt scene changes in the video. The other modes may be classified as forms of more gradual scene change.
  • some scene changes are inherently undetectable . For example, one may cut a video stream and restart it viewing exactly the same video content . It is not possible to detect such changes from the video alone. Hence, we modify our definition to include the constraint that scene change events must include significant alteration in the video content before and after the scene change event.
  • Boreczky and Rowe provide a good review and evaluation of various algorithms to detect scene changes. They compare five different scene boundary detection algorithms using four different types of video: television programs, news programs, movies, and television commercials.
  • the five scene boundary detection algorithms evaluated were global histograms, region histograms, motion compensated pixel differences, and discrete cosine transform coefficient differences.
  • Their test data had 2507 cuts and 506 gradual transitions. They concluded that global histograms, region histograms, running histograms performed well overall. The more complex algorithms actually performed more poorly. In their evaluation, running histograms produced the best results using the criterion of 'least number of transition missed' . It detected above 93% of the cuts and 88% of the gradual transitions.
  • Camera or photography events are related to the change of camera perspective.
  • camera pans and zooms Such perspective changes can be detected by examining the motion field in the video sequence.
  • the camera pans we expect the video to exhibit a translation field in the video data in the opposite direction form the pan.
  • the camera zooms in or out we expect a divergent field from the camera focal axis respectively.
  • pan-zoom combinations are dominated in the vector field by the pan effect.
  • VCM Vector Coherence Mapping
  • the dominant vector computed is denoted V p .
  • the dominant vector may indicate either a global translation field or a dominant object motion in the frame. One may distinguish between the two by noting that the global translation field is distributed across the entire image while the dominant object motion is likely to be clustered. Further, a true translation field is likely to persist across several frames .
  • We detect global translation fields by taking the average of the dot products between all scene vectors and the dominant vector:
  • L pan is the likelihood that the field belongs to a camera pan at time t.
  • FIG. 18 shows the output of VCM for a pan of a computer workbench. The local vectors are shown as fine lines and the dominant translational vector is shown in the thick line in the center of the image.
  • FIG. 19 illustrates our approach for zoom detection. Since we assume pure zooms to converge or diverge form the optical axis of the camera, we detect a zoom by dividing all vectors computed into four quadrants Ql to Q4. In each quadrant, we take a 45° unit vector in that quadrant VI to V4 respectively and compute the dot product between all the vectors in each quadrant. Our likelihood measure for zoom is the average of the absolute values of the dot products :
  • FIG. 20 shows the output of VCM for a zoom of the back of a computer workbench. the local vectors are shown as fine lines and the dominant translational vector is shown in the thick line in the center of the image.
  • Table 2 and Table 3 show the first four pan and zoom likelihoods for the pan and zoom sequences of FIG. 18 and FIG. 20 respectively. As can be seen, the pan likelihood is consistently high for the pan sequence and the zoom likelihood is high for the zoom sequence.
  • FIG. 21 shows the VCM output for a pan-zoom sequence.
  • the pan and zoom likelihood results are tabulated in table 3. As we predicted, the pan effect dominates that of the zoom vector field.
  • the pan likelihood values are consistently above 80% throughout the sequence. Since we are interests only in video segmentation, it is not important that the system distinguish between pan-zooms and pure pans.
  • VCM Vector Coherence Mapping
  • VCM is a completely parallel voting algorithm in "vector parameter space.” The voting is distributed with each vector being influenced by elements in its neighborhood. Since the voting takes place in vector space, it is relatively immune to noise in image space. Our results show that VCM functions under both synthetic and natural (e.g., motion blur) conditions.
  • the fuzzy combination process provides a handle by which high level constrain information is used to guide the correlation process. Hence, no thresholds need to be applied early on in the algorithm, and there are no non-linear decisions (and consequently errors) to propagate to later processes .
  • VCM extends the state of the art in feature- based optical flow computation.
  • the algorithm is straight-forward and easily implementable in either serial or parallel computation.
  • VCM is able to compute dominant translation fields in video sequence and multiple vector fields representing various combinations of moving cameras and moving objects.
  • the algorithm can compute good vector fields from noisy image sequences. Our results on real image sequences evidence this robustness.
  • VCM has good noise immunity.
  • M- estimators to enforce robustness
  • the robustness of VCM lies in the fact that correlation errors owing to noise occur in image space, and have little support in the parameter space of the vectors .
  • VCM provides a framework for the implementation of various constraints using fuzzy image processing. A number of other constraints may be added within this paradigm (e.g., color, texture, and other model-based constraints) .
  • VCM is an effective way for generating flow fields from video data. researchers can use the algorithm to produce flow fields that serve as input into dynamic flow recognition problems like video segmentation and gesture analysis .
  • Barron et al provide a good review of optical flow techniques and their performance. We shall adopt their philology of the field to put our work in context. They classify optical flow computation techniques into four classes. The first of these, pioneered by Horn and Schunck computes optical fields using the spatial- temporal gradients in an image sequence by the application of an image flow equation. The second class performs "region-based matching" by explicit correlation of feature points and computing coherent fields by maximizing some similarity measure. The third class are "energy-based” methods which extract image flow fields in the frequency domain by the application of "velocity- tuned” filters. The fourth class are "phase-based” methods which extract optical flow using the phase behavior of band-pass filters. Barron et al . include zero-crossing approaches such as that due to Hildreth under this category. Under this classification, our approach falls under the second (region-based matching) category.
  • region or feature-based correlation approaches involve three computational stages: pre-processing to obtain a set of trackable regions, correlation to obtain an initial set of correspondences, and the application of some constraints (usually using calculus of variations to minimize deviations from vector field smoothness, or using relaxation labeling to find optimal set of disparities) to obtain the final flow field.
  • some constraints usually using calculus of variations to minimize deviations from vector field smoothness, or using relaxation labeling to find optimal set of disparities.
  • the kinds of features selected in the first stage are often related to the domain in which the tracking occurs. Essentially good features to be tracked should have good localization properties and it must be reliably detected. Shi and Tomasi provide an evaluation of the effectiveness of various features. They contend that the texture property that makes features unique are also the ones that enhance tracking. Tracking systems have been presented that use corners, local texture measures., mixed edge, region and textural features, local spatial frequency along 2 orthogonal directions, and a composite of color, pixel position and spatiotemporal intensity gradient.
  • Simple correlation using any feature type typically results in a very noisy vector field.
  • Such correlation is usually performed using such techniques as template matching, absolute difference correlation (ADC) , and the sum of squared differences (SSD) .
  • a key trade-off in correlation is the size of the correlation region or template. Larger regions yield higher confidence matches while smaller ones are better for localization.
  • Zheng and Chellappa apply a weighted correlation that assigns greater weights to the center of the correlation area to overcome this problem.
  • a further reference also claims that by applying subpixel matching estimation and using affine predictions of image motion given previous ego-motion estimates, they can compute good ego-motion fields without requiring post processing to smooth the field.
  • constraints include rigid-body assumptions, spatial field coherence, and temporal path coherence. These constraints may be enforced using such techniques as relaxation labeling, greedy vector exchange and competitive learning for clustering. These algorithms are typically iterative and converge on a consistent coherent solution.
  • VCM performs a voting process in vector parameter space and biases this voting by likelihood distributions that enforce the spatial and temporal constraints.
  • VCM is similar to the Hough-based approaches. The difference is that in VCM, the voting is distributed and the constraints enforced on each vector is local to the region of the vector. Furthermore, in VCM the correlation and constraint enforcement functions are integrated in such a way that the constraints "guide" the correlation process by the likelihood distribution.
  • the Hough methods apply a global voting space.
  • One reference for example, first computes the set of vectors and estimates the parameters of dominant motion in an image using such a global Hough space. To track multiple objects, one reference divides the image into patches and computes parameters in each patch, and applies M-estimators to exclude outliers from the Hough voting.
  • VCM has good noise immunity. Unlike other approaches which use such techniques as M- estimators to enforce robustness, the robustness of VCM lies in the fact that correlation errors owing to noise occur in image space, and have little support in the parameter space of the vectors .
  • P fc be the set of interest points detected in image I s at time t. These may be computed by any suitable interest operator. Since VCM is feature-agnostic, we apply a simple image gradient operator and select high gradient points.
  • Nt ⁇ j , n] X N "
  • W 1 t (p : t ) s(k., k 2 , max - x ⁇ , y' - y ⁇ )) (3)
  • vcm implements a voting scheme by which neighborhood point correlations affect the vector j at point P j .
  • a In k 2 - k- 1 ⁇ ⁇ .
  • a global vcm is computed corresponding to the dominant translation occurring in the frame (e.g., due to camera pan) .
  • a vcm can be computed for ANY point in image I fc whether or not it is an interest point.
  • a vcm built in this way can be used to estimate optical flow at any point, so a dense optic flow field can be computed.
  • H?' 1 - S( ⁇ ⁇ + ⁇ , T w + ⁇ , N(p ⁇ )) (7) where T w is a threshold and ⁇ controls the steepness of the sigmoidal function.
  • T w is a threshold and ⁇ controls the steepness of the sigmoidal function.
  • vcm(p and Np we can then apply a fuzzy-AND operation of
  • ® denotes pixel-wise multiplication.
  • This scatter template is fuzzy-ANDed with ypj) to obtain a new temporal ncm N ⁇ (p ⁇ j:
  • N ⁇ (p ⁇ t ) N(p ⁇ t )® T ⁇ (11)
  • ® denotes pixel-wise multiplication. This applies the highest weight to the area of
  • the motion detector motion-sensitive edge detector
  • the motion detector provides strong evidence for a moving object in a region that does not have an existing vector. This is similar to case 1, except that it applies only to the region of interest, and not to the whole image.
  • equation 13 yields no suitable match for a vector being tracked, three situations may the cause: (1) rapid acceleration/deceleration pushed the point beyond the search region; (2) the point has been occluded; or (3) an error occurred in previous tracking.
  • the Vector Coherence Mapping algorithm implementation consists of three main parts :
  • 16x16 subwindows To ensure an even distribution of the flow field across the frame, we subdivide it into 16x16 subwindows and pick two points with the highest gradient from each subwindow (their gradients have to be above a certain threshold) , which gives 600 interest points in our implementation. If a given 16x16 subwindow does not contain any pixel with high enough gradient magnitude, no pixels are chosen from that subwindow. Instead, more pixels are chosen from other subwindows, so the number of interest points remains constant.
  • the initial set ⁇ feature point array) of 600 feature points is computed ONLY for the first frame of analyzed seguence of frames , and then the set is updated during vector tracking.
  • a least one permanent feature points array may be maintained over the entire sequence of frames .
  • the array is filled with initial set of feature points found for the first frame of the sequence. After calculating vcm's for all feature points from the permanent array, the array may be updated as follows:
  • the 5x5 region around each point p ⁇ in the current frame is correlated against the 65x49 area of the NEXT frame, centered at the coordinates of pj as shown on FIG 22.
  • the resulting 65x49 array serves as N(PJJ.
  • the hot spot found on N(P ) could be a basis for vector computation, but the vector field obtained is usually noisy (see FIG. 23) . This is precisely the result of the ADC process alone.
  • the vcm for a given feature point is created according to equations 2 and 4. For efficiency, our implementation considers only N(P J )'S of the points p fc within a 65x65 neighborhood of pj when computing vcm(p ⁇ . Each vcm is then normalized. The vector v ⁇ is computed as the vector starting at the center of the vcm and ending at the coordinates of the peak value of the vcm. If the maximal value in vcm is smaller than a certain threshold, the hot spot is considered to be too weak, the whole vcm is reset to 0, the vector related to p ⁇ is labeled UNDEFINED, and a new interest point is selected as detailed above.
  • Feature drift arises because features in real space do not always appear at integer positions of an image pixel grid. While some attempts to solve this problem by subpixel estimation have been described, we explicitly allow features locations to vary by integral pixel displacements. We want to avoid assigning a vector v ⁇ to p ⁇ if it does not correspond to a high correlation in Hence, we inspect (the ADC response) to see if the value corresponding to the vcm (p ) hot spot is above threshold T w (see equation 7) . Secondly, to improve the tracking accuracy for subsequent frames, we want to ensure that p ⁇ + ⁇ t is also a feature point.
  • FIG. 24 shows VCM algorithm performance. It shows the same frame as FIG. 23.
  • Dominant translation may be computed according to equation 6. Since we do not want any individual ncm to dominate the computation, they are normalized before summing. Hence, the dominant motion is computed based on the number of vectors pointing in a certain direction, and not on the quality of match which produced these vectors.
  • FIG. 25 shows the example image of a global vcm computed for the pan sequence shown in FIG. 26.
  • the hot spot is visible in the lower center of FIG. 25. This corresponds to the global translation vector shown as a stronger line in the middle of the image on FIG. 26.
  • the intensity distribution in FIG. 25 is an evidence of an aperture problem existing along the arm and dress boundaries (long edges with relatively low texture). These appear as high intensity ridges in the vcm. In the VCM voting scheme, such edge points still contribute to the correct vector value, the hot spot is still well defined.
  • Temporal coherence may now be considered. As discussed above, we compute two likelihood maps for each feature point using equations 8 and 13 to compute the spatial and spatial-temporal likelihood maps, respectively.
  • the fact that the scatter template is applied to ncm's and not only vcm's allows the neighboring points' temporal prediction to affect each other.
  • a given point's movement history affects predicted positions of its neighbors. This way, when there is a new feature point in some area and this point does not have ANY movement history, its neighbors (through temporally weighted ncm's) can affect the predicted position of that point.
  • FIG. 27 shows the vector field obtained without temporal prediction
  • FIG. 28 shows vector field obtained for the same data with temporal prediction. Without temporal prediction, we can see a lot of false vectors between objects as they pass each other. Temporal prediction solves this problem.
  • the vectors can be clustered according to three features: origin location, direction and magnitude. The importance of each feature used during clustering can be adjusted. It is also possible to cluster vectors with respect to two or one feature only. We use a one pass clustering method. Example of vector clustering is shown in FIG. 29.
  • FIG. 23 shows a noisy vector field obtained from the ADC response (ncm's) on a hand motion video sequence.
  • the vector field computed for the same sequence produced a vector field characterized by the smooth field shown in FIG. 24.
  • FIG. 26 presents the performance of VCM on a video sequence with an up-panning camera, and where the aperture problem is evident.
  • FIG. 25 shows the global vcm computed on a frame in the sequence.
  • the bold line in the center of FIG. 26 shows the correct image vector corresponding to an upward pan.
  • FIG. 28 shows the results of the VCM algorithm for a synthetic image sequence in which two identical objects move in opposite directions at 15 pixels per frame.
  • the correct field computed in FIG. 28 shows the efficacy of the temporal cohesion. Without this constraint, most of the vectors were produced by false correspondences between the two objects (FIG. 27) .
  • This experiment also shows that VCM is not prone to boundary oversmoothing .
  • FIG. 29 shows the vector fields computed for a video sequence with a down-panning camera and a moving hand.
  • the sequence contains significant motion blur.
  • VCM and vector clustering algorithms extracted two distinct vector fields with no visible boundary oversmoothing .
  • the subject is gesturing with both hands and nodding his head.
  • Three distinct motion fields were correctly extracted by the VCM algorithm.
  • FIGs. 31 and 32 show the efficacy of the VCM algorithm on noisy data.
  • the video sequence was corrupted with uniform random additive noise to give a S/N ratio of 21.7 dB .
  • FIG. 31 shows the result of ADC correlation (i.e., using the ncm's alone).
  • FIG. 32 shows the vector field computed by the VCM algorithm. The difference in vector field quality is easily visible .
  • FIGs. 33, 34 and 35 show analysis of video sequences with various camera motions .
  • the zoom-out sequence resulted in the anticipated convergent field (FIG. 33) .
  • Fig. 34 shows the vector field for a combined up-panning and zooming sequence.
  • FIG. 35 shows the rotating field obtained from a camera rotating about its optical axis .
  • the algorithm features a voting scheme in vector parameter space, making it robust to image noise.
  • the spatial and temporal coherence constraints are applied using fuzzy-image- processing technique by which the constraints are applied to the bias the correlation process. Hence, the algorithm does not require the typical iterative second stage process of constraint application.
  • VCM is capable of extracting vector fields out of image sequences with significant synthetic and real noise (e.g., motion blur) . It produced good results on videos with multiple independent or composite (e.g., moving camera with moving object) motion fields.
  • Our method performs well for aperture problems and permits the extraction of vector fields containing sharp discontinuities with no discernible over-smoothing effects.
  • the technology described in this patent application facilitates a family of applications that involve the organization and access of video data. The commonality of these applications is that they require segmentation of the video into semantically significant units, the ability to access, annotate, refine, and organize these video segments, and the ability to access the video data segments in a integrated fashion. Following are a number of examples of such applications.
  • One application of the techniques described above may be to video databases of sporting events . Professionals and serious amateurs of organized sports often need to study video of sporting events.
  • a game may be organized into halves, series and plays.
  • Each series (or drive) may be characterized by the roles of the teams (offense or defense) , the distance covered, the time consumed, number of plays, and the result of the drive.
  • Each play may be described by the kind of play (passing, rushing, or kicking) , the outcome, the down, and the distance covered.
  • the segmentation may be obtained by analysis of the image flow fields obtained by an algorithm like our VCM to detect the most atomic of these units (the play) .
  • the other units may be built up from these units.
  • VCM facilitates the application of various constraints in the computation of the vector fields.
  • vector fields may be computed for the movement of players on each team.
  • the team on offense (apart from the man-in-motion) must be set for a whole second before the snap of the ball. In regular video this is at least 30 frames.
  • the defensive team is permitted to move.
  • This detection may also be used to obtain the direction, duration, and distance of the play. The fact that the ground is green with yard markers will also be useful in this segmentation.
  • Specialized event detectors may be used for kickoffs, and to track the path of the ball in pass plays. What is important is that a set of domain event detectors may be fashioned for such a class of video. The outcome of this detection is a shot-subshot hierarchy reflecting the game, half, series, and play hierarchy of football. Once the footage has been segmented, our interface permits the refinement of the segmentation using the multiply-linked interface representation. Each play may be annotated, labeled, and modified interactively. Furthermore, since the underlying synchronization of all the interface components is time, the system may handle multiple video viewpoints (e.g. endzone view, blimp downward view, press box view) . Each of these views may be displayed in different keyframe representation windows.
  • video viewpoints e.g. endzone view, blimp downward view, press box view
  • the multiresolution representation is particularly useful because it optimizes the use of screen real-estate, and so permits a user to browse shots from different viewpoints simultaneously.
  • the animation of the keyframes in each MAR are synchronized so that when one selects a keyframe from one window to be the centralized current shot, all MAR representations will centralize the concomitant keyframes.
  • the same set of interfaces may be used to view and study the resulting organized video. Another application may be in a court of law.
  • Video may serve as legal courtroom archives either in conjunction with or in lieu of stenographically generated records.
  • the domain events to be detected in the video are the transitions between witnesses, the identity of the speaker (judge, lawyer, or witness) .
  • the witness-box is vacant.
  • a wi tness-box camera may be set up to capture the vacant witness-box before the proceedings and provide a background template from which occupants may be detected by a simple image subtraction change detection.
  • 'Witness sessions' may be defined as time segments between witness-box vacancies.
  • Witness changes must occur in one of these transitions.
  • witness identity by an algorithm that locates the face and compares facial features. Since we are interested only in witness change, almost any face recognizer will be adequate (all we need is to determine if the current 'witness session' face is the same as the one in the previous session) .
  • a standard first order approach that compares the face width, face length, mouth width, nose width, and the distance between the eyes and the nostrils as a ratio to the eye separation comes immediately to mind.
  • the lawyer may be identified by tracking her in the video. Speaker identification may by achieved detecting the frequency of lip movements and correlating them to the location of sound RMS power amplitude clusters in the audio track.
  • the multiple video streams may be represented in the interface as different keyframe windows. This allows us to organize, annotate and access the multiple video streams in the semantic structure of the courtroom proceedings . This may be hierarchy of the (possibly cross-session) testimonies of particular witnesses, direct and cross examination, witness sessions, question and witness response alternations, and individual utterances by courtroom participants . Hence the inherent structure, hierarchy, and logic of the proceedings may be accessible from the video content.
  • Home video is another application.
  • One of the greatest impediments to the wider use of home video is the to access potentially many hours of video in a convenient way.
  • a particularly useful domain event in home video is the 'new actor detector' .
  • the most significant moving objects in most typical home videos are people.
  • the same head-detector described earlier for witness detection in our courtroom example may be used to determine if a new vector cluster entering the scene represent a new individual .
  • Home videos can then be browsed to find people by viewing the new-actor keyframes the way one browses a photograph album.
  • the techniques described herein may also be applied to special event videos. Certain special event videos are common enough to warrant the development of dedicated domain detectors for them. An example might be that of formal weddings. Wedding videos are either taken by professional or amateur photographers, and one can imagine the application of our technology to produce annotated keepsakes of such video.
  • the techniques described above also have application in the business environment. A variety of business applications could benefit from our technology. We describe two kinds of video as being representative of these applications. First, business meeting could benefit from video processing. Business meetings (when executed properly) exhibit an organizational structure that may be preserved and accessed by our technology. Depending on the kinds of meeting being considered, video segments may be classified as moderator-speaking, member- speaking, voting and presentation.
  • a camera may be trained on her to locate all moderator utterances. These may be correlated with the RMS power peaks in the audio stream. _ The same process will detect presentations to the membership. Members who speak or rise to speak may be detected by analyzing the image flow fields or by image change detection and correlated with the audio RMS peaks as discussed earlier. If the moderator sits with the participants at a table, she will be detected as a speaking member. Since the location of each member is a by-product of speaker detection it is trivial to label the moderator once the video has been segmented.
  • This process will provide a speaker-wise decomposition of the meeting that may be presented in our multiply-linked interface technology.
  • a user may enhance the structure in our hierarchical editor and annotator to group sub-shots under agenda items, proposals and discussion, and amendments and their discussion. If multiple cameras are used, these may be accessed and as synchronized multi-view databases as in our previous examples.
  • the product will be a set of video meeting minutes that can be reviewed using our interaction technology. Private copies of these minutes may be further organized and annotated by individuals for their own reference. Certain areas of marketing may also benefit under the embodiments described above. Mirroring the success of desktop publishing in the 1980 's, we anticipate the immense potential in the production of marketing and business videos.
  • a marketer may use our interaction technology to further organize, and annotate the video. These video segments, may further be resequenced to produce a marketing video. Home buyers may view a home of interest using our multiply-linked interaction technology to see different aspects of the home. This random-access capability will make home comparison faster and more effective.
  • a specific embodiment of a method and apparatus for providing content based video access according to the present invention has been described for the purpose of illustrating the manner in which the invention is made and used. It should be understood that the implementation of other variations and modifications of the invention and its various aspects will be apparent to one skilled in the art, and that the invention is not limited by the specific embodiments described. Therefore, it is contemplated to cover the present invention any and all modifications, variations, or equivalents that fall within the true spirit and scope of the basic underlying principles disclosed and claimed herein .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Cette invention se rapporte à un procédé et à un appareil (figure 1) permettant d'accéder à un segment vidéo d'un ensemble de plusieurs images vidéo. Ce procédé consiste à segmenter le groupe d'images vidéo en plusieurs segments vidéo sur la base du contenu sémantique et à désigner une image de chaque segment du groupe de segments comme image clef et comme repère de ce segment. Ce procédé consiste en outre à classer dans l'ordre des images clefs et à placer au moins une partie des images clefs classées dans une disposition d'affichage classée, dans laquelle une position prédéterminée définit une position choisie. Une image clef peut être désignée comme image clef choisie. Les images clefs classées dans l'ordre peuvent être traitées sur toute la surface de l'affichage classé jusqu'à ce que l'image classée sélectionnée occupe la position sélectionnée.
PCT/US1998/015063 1997-07-22 1998-07-22 Acces a des images sur la base de leur contenu WO1999005865A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US5335397P 1997-07-22 1997-07-22
US60/053,353 1997-07-22

Publications (1)

Publication Number Publication Date
WO1999005865A1 true WO1999005865A1 (fr) 1999-02-04

Family

ID=21983629

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/015063 WO1999005865A1 (fr) 1997-07-22 1998-07-22 Acces a des images sur la base de leur contenu

Country Status (1)

Country Link
WO (1) WO1999005865A1 (fr)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0955599A2 (fr) * 1998-05-07 1999-11-10 Canon Kabushiki Kaisha Sytéme d'interpretation video automatisé
US6340971B1 (en) * 1997-02-03 2002-01-22 U.S. Philips Corporation Method and device for keyframe-based video displaying using a video cursor frame in a multikeyframe screen
EP1251515A1 (fr) * 2001-04-19 2002-10-23 Koninklijke Philips Electronics N.V. Procédé et système pour sélectionner une position dans une séquence d'images
US9779774B1 (en) 2016-07-22 2017-10-03 Microsoft Technology Licensing, Llc Generating semantically meaningful video loops in a cinemagraph
US10728443B1 (en) 2019-03-27 2020-07-28 On Time Staffing Inc. Automatic camera angle switching to create combined audiovisual file
CN112347303A (zh) * 2020-11-27 2021-02-09 上海科江电子信息技术有限公司 媒体视听信息流监测监管数据样本及其标注方法
US10963841B2 (en) 2019-03-27 2021-03-30 On Time Staffing Inc. Employment candidate empathy scoring system
US11023735B1 (en) 2020-04-02 2021-06-01 On Time Staffing, Inc. Automatic versioning of video presentations
US11127232B2 (en) 2019-11-26 2021-09-21 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
CN113449662A (zh) * 2021-07-05 2021-09-28 北京科技大学 一种基于多帧特征聚合的动态目标检测方法及装置
US11144882B1 (en) 2020-09-18 2021-10-12 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
US11423071B1 (en) 2021-08-31 2022-08-23 On Time Staffing, Inc. Candidate data ranking method using previously selected candidate data
US11727040B2 (en) 2021-08-06 2023-08-15 On Time Staffing, Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11907652B2 (en) 2022-06-02 2024-02-20 On Time Staffing, Inc. User interface and systems for document creation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4612569A (en) * 1983-01-24 1986-09-16 Asaka Co., Ltd. Video editing viewer
US4698664A (en) * 1985-03-04 1987-10-06 Apert-Herzog Corporation Audio-visual monitoring system
US5179449A (en) * 1989-01-11 1993-01-12 Kabushiki Kaisha Toshiba Scene boundary detecting apparatus
JPH08163479A (ja) * 1994-11-30 1996-06-21 Canon Inc 映像検索方法及び装置
US5537530A (en) * 1992-08-12 1996-07-16 International Business Machines Corporation Video editing by locating segment boundaries and reordering segment sequences
US5778108A (en) * 1996-06-07 1998-07-07 Electronic Data Systems Corporation Method and system for detecting transitional markers such as uniform fields in a video signal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4612569A (en) * 1983-01-24 1986-09-16 Asaka Co., Ltd. Video editing viewer
US4698664A (en) * 1985-03-04 1987-10-06 Apert-Herzog Corporation Audio-visual monitoring system
US5179449A (en) * 1989-01-11 1993-01-12 Kabushiki Kaisha Toshiba Scene boundary detecting apparatus
US5537530A (en) * 1992-08-12 1996-07-16 International Business Machines Corporation Video editing by locating segment boundaries and reordering segment sequences
JPH08163479A (ja) * 1994-11-30 1996-06-21 Canon Inc 映像検索方法及び装置
US5778108A (en) * 1996-06-07 1998-07-07 Electronic Data Systems Corporation Method and system for detecting transitional markers such as uniform fields in a video signal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
UEDA H., MIYATAKE T., YOSHIZAWA S.: "IMPACT: AN INTERACTIVE NATURAL-MOTION-PICTURE DEDICATED MULTIMEDIA AUTHORING SYSTEM.", HUMAN FACTORS IN COMPUTING SYSTEMS. REACHING THROUGH TECHNOLOGYCHI. CONFERENCE PROCEEDINGS, XX, XX, 27 April 1991 (1991-04-27), XX, pages 343 - 350., XP002914568 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6340971B1 (en) * 1997-02-03 2002-01-22 U.S. Philips Corporation Method and device for keyframe-based video displaying using a video cursor frame in a multikeyframe screen
US6516090B1 (en) 1998-05-07 2003-02-04 Canon Kabushiki Kaisha Automated video interpretation system
EP0955599A3 (fr) * 1998-05-07 2000-02-23 Canon Kabushiki Kaisha Sytéme d'interpretation video automatisé
EP0955599A2 (fr) * 1998-05-07 1999-11-10 Canon Kabushiki Kaisha Sytéme d'interpretation video automatisé
CN100346420C (zh) * 2001-04-19 2007-10-31 皇家菲利浦电子有限公司 基于关键帧的播放位置的选择方法和系统
WO2002086897A1 (fr) * 2001-04-19 2002-10-31 Koninklijke Philips Electronics N.V. Procede et systeme pour la selection de position de lecture reposant sur l'utilisation d'images clefs
EP1251515A1 (fr) * 2001-04-19 2002-10-23 Koninklijke Philips Electronics N.V. Procédé et système pour sélectionner une position dans une séquence d'images
US9779774B1 (en) 2016-07-22 2017-10-03 Microsoft Technology Licensing, Llc Generating semantically meaningful video loops in a cinemagraph
US11457140B2 (en) 2019-03-27 2022-09-27 On Time Staffing Inc. Automatic camera angle switching in response to low noise audio to create combined audiovisual file
US10728443B1 (en) 2019-03-27 2020-07-28 On Time Staffing Inc. Automatic camera angle switching to create combined audiovisual file
US11961044B2 (en) 2019-03-27 2024-04-16 On Time Staffing, Inc. Behavioral data analysis and scoring system
US10963841B2 (en) 2019-03-27 2021-03-30 On Time Staffing Inc. Employment candidate empathy scoring system
US11863858B2 (en) 2019-03-27 2024-01-02 On Time Staffing Inc. Automatic camera angle switching in response to low noise audio to create combined audiovisual file
US11127232B2 (en) 2019-11-26 2021-09-21 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11783645B2 (en) 2019-11-26 2023-10-10 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11861904B2 (en) 2020-04-02 2024-01-02 On Time Staffing, Inc. Automatic versioning of video presentations
US11184578B2 (en) 2020-04-02 2021-11-23 On Time Staffing, Inc. Audio and video recording and streaming in a three-computer booth
US11636678B2 (en) 2020-04-02 2023-04-25 On Time Staffing Inc. Audio and video recording and streaming in a three-computer booth
US11023735B1 (en) 2020-04-02 2021-06-01 On Time Staffing, Inc. Automatic versioning of video presentations
US11144882B1 (en) 2020-09-18 2021-10-12 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
US11720859B2 (en) 2020-09-18 2023-08-08 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
CN112347303A (zh) * 2020-11-27 2021-02-09 上海科江电子信息技术有限公司 媒体视听信息流监测监管数据样本及其标注方法
CN113449662A (zh) * 2021-07-05 2021-09-28 北京科技大学 一种基于多帧特征聚合的动态目标检测方法及装置
US11727040B2 (en) 2021-08-06 2023-08-15 On Time Staffing, Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11966429B2 (en) 2021-08-06 2024-04-23 On Time Staffing Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11423071B1 (en) 2021-08-31 2022-08-23 On Time Staffing, Inc. Candidate data ranking method using previously selected candidate data
US11907652B2 (en) 2022-06-02 2024-02-20 On Time Staffing, Inc. User interface and systems for document creation

Similar Documents

Publication Publication Date Title
Rav-Acha et al. Making a long video short: Dynamic video synopsis
Luo et al. Towards extracting semantically meaningful key frames from personal video clips: from humans to computers
Ju et al. Summarization of videotaped presentations: automatic analysis of motion and gesture
Borgo et al. State of the art report on video‐based graphics and video visualization
Pritch et al. Nonchronological video synopsis and indexing
EP1955205B1 (fr) Procédé et système pour la production de synopsis vidéo
US8818038B2 (en) Method and system for video indexing and video synopsis
US7594177B2 (en) System and method for video browsing using a cluster index
Chen et al. Personalized production of basketball videos from multi-sensored data under limited display resolution
CA2761187C (fr) Systemes et procedes de production autonome de videos a partir de donnees multi-detectees
Chen et al. An autonomous framework to produce and distribute personalized team-sport video summaries: A basketball case study
Chen et al. Visual storylines: Semantic visualization of movie sequence
US7904815B2 (en) Content-based dynamic photo-to-video methods and apparatuses
JP2009539273A (ja) ビデオクリップからのキーフレーム候補の抽出
WO2008094600A1 (fr) Présentation simultanée de segments vidéo permettant une compréhension de fichier vidéo rapide
WO1999005865A1 (fr) Acces a des images sur la base de leur contenu
Li et al. Structuring lecture videos by automatic projection screen localization and analysis
Borgo et al. A survey on video-based graphics and video visualization.
Zhang Content-based video browsing and retrieval
Wang et al. Taxonomy of directing semantics for film shot classification
Rui et al. A unified framework for video browsing and retrieval
Aner-Wolf et al. Video summaries and cross-referencing through mosaic-based representation
Zhang et al. AI video editing: A survey
Niu et al. Real-time generation of personalized home video summaries on mobile devices
Zhang Video content analysis and retrieval

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP SG US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA