WO2003084249A9 - Methods for summarizing video through mosaic-based shot and scene clustering - Google Patents

Methods for summarizing video through mosaic-based shot and scene clustering

Info

Publication number
WO2003084249A9
WO2003084249A9 PCT/US2003/009704 US0309704W WO03084249A9 WO 2003084249 A9 WO2003084249 A9 WO 2003084249A9 US 0309704 W US0309704 W US 0309704W WO 03084249 A9 WO03084249 A9 WO 03084249A9
Authority
WO
WIPO (PCT)
Prior art keywords
mosaic
int
matrix
val
clusters
Prior art date
Application number
PCT/US2003/009704
Other languages
French (fr)
Other versions
WO2003084249A1 (en
Inventor
Aya Aner
John Kender
Original Assignee
Univ Columbia
Aya Aner
John Kender
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Columbia, Aya Aner, John Kender filed Critical Univ Columbia
Priority to AU2003226140A priority Critical patent/AU2003226140A1/en
Publication of WO2003084249A1 publication Critical patent/WO2003084249A1/en
Publication of WO2003084249A9 publication Critical patent/WO2003084249A9/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/248Character recognition characterised by the processing or recognition method involving plural approaches, e.g. verification by template match; Resolving confusion among similar patterns, e.g. "O" versus "Q"
    • G06V30/2504Coarse or fine approaches, e.g. resolution of ambiguities or multiscale approaches
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/64Circuits for processing colour signals
    • H04N9/74Circuits for processing colour signals for obtaining special effects

Definitions

  • This invention relates to systems and methods for hierarchical representation of video, which is based on physical locations and camera positions, and more particularly to a method using mosaics for the representation and comparison of shots in a video sequence.
  • An object of the present invention is to provide a technique for summarizing video including temporal and non-temporal representations.
  • Another object of the present invention is to provide a technique for summarizing video that uses mosaic representation of video in order to cluster shots by physical settings.
  • a further object of the present invention is to provide a technique for summarizing video that is efficient and accurate.
  • a still further object of the present invention is to provide a technique for comparing a plurality of videos to identify repeating and unique physical settings for the plurality of videos, such as a television series, such as a situation comedy, including a plurality of episodes.
  • a method for summarizing a video comprising a plurality of consecutive frames, the method comprising the steps of dividing the plurality of consecutive frames into a plurality of sequences of consecutive frames; dividing the plurality of sequences of consecutive frames into a plurality of scenes; determining representative shots for these scenes; preparing a mosaic representation for each representative shot; comparing each of the mosaics in the video; and clustering the mosaics into the physical settings in which the frames were photographed.
  • the step of preparing a mosaic representation of each shot may include determining a reference frame for each shot. This reference frame is identified automatically, as described below.
  • the step of preparing a mosaic representation of each shot comprises computing a transformation, such as an aff ⁇ ne transformation, between each pair of successive sampled frames in the shot and then using this transformation to project the frames into the image plane of the chosen reference frame.
  • the step of comparing each of the mosaics may include performing a first alignment, also referred to herein as a coarse alignment, for a pair of the mosaics.
  • a first alignment also referred to herein as a coarse alignment
  • Each mosaic is divided into a plurality of strips, and each pair of strips, including one strip from each mosaic of the pair of mosaics, is compared.
  • the step of comparing the strips corresponds to determining a vertical alignment, e.g., by determining the best diagonal in a distance matrix, such as block-to-block distance matrix B[k,l].
  • the step of performing an alignment of the pairs of mosaics further may further comprise performing a horizontal alignment, e.g., by determining the best diagonal in a second distance matrix S[i,j], for each of the pairs of mosaics.
  • a further step in comparing the mosaics may further include retaining a subset of the mosaics.
  • the step of retaining a subset of mosaics may includes, for pairs of mosaics, determining a threshold based on a distance value determined for each of the pairs of the mosaics. Pairs of the mosaics having distance values less than or equal to the threshold are retained as representing potentially common physical areas, and pairs of mosaics having distance values greater than the threshold are discarded as not representing common physical areas, i.e., being "dissimilar.”
  • the step of comparing the mosaics may further comprise, for pairs of mosaics, performing a second alignment, also referred to as a finer alignment.
  • the step of performing the second alignment of the pairs of mosaics may include cropping the mosaics based on parameters determined during the step of performing the first alignment of the pairs of the mosaics.
  • the step of performing the second alignment of the pairs of mosaics may comprise dividing the each of the mosaics, as cropped above, into a plurality of finer strips (compared with the coarse alignment stage) and comparing a pair of strips including a strip from each mosaic of the pair of mosaics.
  • the step of performing an alignment of the pairs of mosaics further comprises determining a vertical alignment of each of the pairs of strips and a horizontal alignment of each of the pairs of mosaics.
  • the method may include dividing the plurality of consecutive frames into a plurality of scenes, and preparing one or more mosaic representations for each scene.
  • the technique may further include comparing a pair of scenes by determining the distance value between the scenes in the pair of scenes. This step of determining the distance value may comprise determining the minimum distance between pairs of mosaics including one mosaic from each of the pairs of scenes.
  • a further step may include clustering each of the distance values of the pairs of scenes into a matrix arranged by physical settings. For comparing different videos, such as different episodes in a series, physical settings that are repeatedly found in each episode may be identified, and physical settings that are unique to an episode may be identified.
  • FIG. 1 illustrates the hierarchical representation of a video in accordance with present invention.
  • FIG. 2 illustrates a representation of several key frames in accordance with the present invention.
  • FIG. 3 illustrates a plurality of mosaics generated from a plurality of shots in accordance with the present invention.
  • FIG. 4 illustrates the HSI color space.
  • FIG. 5 illustrates a representation of key frames and mosaics from a first shot in accordance with the present invention.
  • FIG. 6 illustrates a representation of key frames and mosaics from a second shot in accordance with the present invention.
  • FIG. 7 illustrates a representation of a first difference matrix in accordance with the present invention.
  • FIG. 8 illustrates a representation of a first difference matrix in accordance with the present invention, using strips having different dimensions.
  • FIG. 9 illustrates a representation of a second distance matrix in accordance with the present invention.
  • FIG. 10 illustrates an exemplary plot for performing outlier analysis in accordance with a preferred embodiment of the present invention.
  • FIG. 11 illustrates the first and second mosaics and a first stage in a finer stage analysis in accordance with a preferred embodiment of the present invention.
  • FIG. 12 illustrates a distance matrix used in a second stage of a finer stage analysis in accordance with a preferred embodiment of the present invention.
  • FIG. 13 illustrates a plurality of frames and a mosaic created from the frames in accordance with the present invention.
  • FIG. 14 illustrates a plurality of screen shots from a video.
  • FIG. 15 illustrates a comparison of mosaics prepared from screen shots similar to those of FIG 14, in accordance with the present invention.
  • FIG. 16 illustrates a plurality of screen shots from another video.
  • FIG. 17 illustrates a clustering of mosaics prepared from screen shots similar to those of FIG 16, in accordance with the present invention.
  • FIG. 18 illustrates another comparison of mosaics prepared from screen shots similar to those of FIG 16, in accordance with the present invention.
  • FIG. 19 illustrates a plurality of camera locations and associated shots taken from a physical setting.
  • FIGS. 20-23 illustrate transformation analysis for a plurality of shot types in accordance with the present invention.
  • FIG. 24 illustrates a technique for determining the dissimilarity between two scenes in accordance with the present invention.
  • FIG. 25 illustrates a similarity graph for clustering scenes from a video in accordance with the invention.
  • FIG. 26 illustrates the representation of physical settings with corresponding scenes in accordance with the invention.
  • FIGS. 27-29 illustrate similarity graphs for three episodes of a program in accordance with the present invention.
  • FIGS. 30-32 illustrate the dendrograms for clustering scenes of three episodes of a program in accordance with the present invention.
  • FIG. 33 illustrates a comparison the episodes represented in FIGS. 27 and 30, 28 and 31, and 29 and 32, respectively, in accordance with the present invention.
  • FIG. 34 illustrates a similarity graph for a fourth episode of a program in accordance with the present invention.
  • FIG. 35 illustrates the dendrogram for clustering scenes of the fourth episode of FIG. 34 of a program in accordance with the present invention.
  • FIGS. 36-39 illustrates screen shots of a browser for use with the technique in accordance with the present invention.
  • the same reference numerals and characters unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments.
  • the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims.
  • a video summarization technique is described herein, which, given a video sequence, generates a semantic hierarchical representation 10, illustrated in FIG. 1 with a tree-like representation.
  • This representation becomes more compact at each level of the tree.
  • the bottom level e.g., frames 12
  • the highest level e.g., physical settings 14, may have 5-6 representative images.
  • the next two levels of the tree represent a temporal segmentation of the video into shots 16 and scenes 18.
  • the highest level, physical settings 14, represents a novel abstraction of video and is based on a non-temporal representation, as will be described in greater detail herein.
  • the first temporal segmentation of the video is a segmentation into shots 16, presented in the second level of the tree.
  • shots are defined herein as a sequence of consecutive frames taken from the same camera or recording device.
  • Many algorithms are known in the art for shot boundary detection, in order to divide the frames 12 into a plurality of shots 16 (e.g. A. Aner and J. R. Kender, "A Unified Memory-Based Approach to Cut, Dissolve, Key Fame and Scene Analysis," IEEE ICIP, 2001; R. Lienhart, "Comparison Of Automatic Shot Boundary Detection Algorithms," Proc. ofSPIE Vol.
  • a second temporal segmentation is the segmentation of the shots 16 into scenes 18.
  • the definition of a "scene,” as used herein, is a collection of consecutive shots, which are related to each other by the same semantic context. For example, in the exemplary embodiment, consecutive shots which were taken in the same physical location and that describe an event or a story which is related in context to that physical location are grouped together into one scene.
  • Methods of scene boundary detection known in the art include, e.g., A. Hanjalic, R.L. Lündijk, and J. Biemond, "Automated High- Level Movie Segmentation For Advanced Video Retrieval Systems," IEEE Transactions on Circuits and Systems for Video Technology, Volume 9, June 1999; and J.R.
  • physical settings refers to groups of scenes 18, such that each group takes place in the same location.
  • These segments were well-defined in many television programs.
  • the methods described herein were applied to "situation comedies,” or "sitcoms.”
  • two physical settings are the main character's apartment and a diner. These physical settings occur many times during episodes on this program, and are therefore directly related to the main "theme” of the program.
  • each episode also has 2-3 physical settings which are typically unique to that episode. These physical settings can be determined by comparing physical settings across different episodes of the same sitcom program. These special settings may be used to infer the main "plots" of each episode, and are therefore useful for representing the content of the episode.
  • the methods described herein capture the highest level of a video, using scene-based representation. It allows for efficient browsing and indexing of video data, and solves the problems usually caused by a frame-based representation.
  • Mosaics are used for representing shots, and also (due to the concept of establishing shots or pan and zoom shots) for scene representation.
  • the mosaics are used to spatially cluster scenes or shots, depending on the video genre.
  • the background information is typically used, since this information is the most relevant for the comparison process of shots.
  • the background information is further used for shot and scene clustering and for gathering general information about the whole video sequence.
  • a complete representation of a single shot however, also involves foreground information. Examples of single shot representations are the "synopsis mosaic,” presented in M. Irani and P. Anandan, "Video Indexing Based On Mosaic Representations," Proceedings of the IEEE, volume 86, 1998.
  • FIG. 2 shows a collection of key-frames 20, 22, 24, 26, and 28 taken from a panning shot, (hi the exemplary embodiment, the key frames are selected by hand, although automatic key-frames generation may alternatively be used.)
  • the panning shot an actress is tracked walking across a room, beginning at the right side of the room and ending at the left side. Thus frame 28 is followed by frame 26, etc., ending at frame 20.
  • a mosaic 30 of the panning shot discussed above is shown in FIG. 2. The whole room is visible and even though the camera has changed its zoom, the focal length changes are eliminated in the mosaic.
  • the 3D physical scene is relatively planar; or (b) the camera motion is relatively slow; or (c) the relative distance between the surface elements in the 3D plane is relatively small compared with their distance to the camera.
  • Cameras are mostly static, and most camera motion is either translation and rotation about a main axis, or a zoom. Since the physical setting is limited in size, both camera's movement and varying positioning are restrained.
  • Cameras are positioned horizontally, e.g., scenes and objects viewed by the cameras will be situated parallel to the horizon.
  • the method described herein incorporates a novel use of mosaics for shot comparison and for hierarchical representation of video.
  • the construction of color mosaics herein uses the technique of gray level mosaic construction methods described in M. Irani, P. Anandan, J. Bergenand, R. Kumar, and S. Hsu, "Efficient Representation Of Video Sequences And Their Applications," Signal processing: Image Communication, Volume 8, 1996, which is incorporated by reference in its entirety herein.
  • this technique provides no capability for color video, and accordingly significant changes have been made to this technique, as discussed below.
  • a first step is the generation of affine transformations between successive frames in the sequence (in the exemplary embodiment, sampling was performed at 6 frames/sec).
  • One of the frames is then chosen as the "reference frame,” that is, as the basis of the coordinate system, such that the mosaic plane will be this frame's plane.
  • This frame is selected automatically using the method below, and illustrated in FIGS. 20-23.
  • This reference frame is "projected” into the mosaic using an identity transformation. The rest of the transformations are mapped to this coordinate system and are used to project the frames into the mosaic plane. It is noted that although the true transformations between the frames in some shot sequences are projective and not affine, only affine transformations are computed. (This may result in some local distortions in the mosaic, but prevents the projective distortion (illustrated by mosaic 32 in FIG. 2)).
  • each pixel in the mosaic is determined by the median value of all of the pixels that were mapped into it. Simply taking the median of each channel may result in colors which might not have existed in the original frames, and it is desirable to use only true colors.
  • the frames are converted to gray- level images while maintaining corresponding pointers from each gray-level pixel to its original color value.
  • an array is formed of all values from different frames that were mapped onto it, the median gray-level value of that array is detennined, and then the corresponding color value for that pixel is used in the mosaic.
  • Outlier rejection as known in the art described in M. Irani, P. Anandan, J. Bergenand, R. Kumar, and S.
  • Hsu "Efficient Representation Of Video Sequences and their Applications," Signal processing: Image Communication, Volume 8, 1996, and incorporated by reference herein, is used to both improve the accuracy of the affine transfonnations constructed between frames as well as to detect and segment out all of the moving objects from the mosaic. This improves the accuracy of the affine transformations constructed between frames and results in "clear" mosaics, e.g., mosaic 30, where only the background of the scene is visible, as shown in FIG. 2. A subsequent stage in the method is the comparison of mosaics.
  • FIGS. 3(a) - 3(c) show the physical scene from similar angles (since they were generated from shots taken by cameras located close to one another).
  • the mosaics 46 and 48 in FIGS. 3(d)-3(e) were generated from shots that were taken from cameras located in totally different locations than the ones used in FIGS. 3(a)-3(c), and therefore show different parts of the physical scene. It would appear to be easier to cluster mosaics of FIGS. 3(a)-3(c) together into one group and the mosaics of FIGS. 3(d)-3(e) together in a second group, based solely on these image properties. However, in order to cluster mosaics 40-48 of FIGS. 3(a)-3(e) into one single cluster, information is obtained from other scenes where similar camera locations were used.
  • Each of the mosaics is divided into smaller regions to look for similarities in consecutive relative regions.
  • the properties of the mosaics that are being compared include visual similarities, or other features, such as texture.
  • assumptions about the camera viewpoint and placement are made to more efficiently analyze the images. Such assumptions may include noting the horizontal nature of the mosaics. Camera locations and movements are limited due to physical set considerations, which causes the topological order of the background to be constant throughout the mosaics. Therefore, the corresponding regions only have horizontal displacements, rather than more complex perspective changes.
  • the approach of rubber-sheet matching is used, which takes into account the topological distortions among the mosaics, and the rubber-sheet transformations between two mosaics of a same physical scene.
  • the comparison process is done in a coarse to fine manner.
  • a first step is to coarsely detect areas in each mosaic-pair which correspond to the same spatial area. It is typically required that sufficient portions of the mosaics match in order to determine them as similar. For example, since the mosaic is either bigger or has the same size as the original video frame, the width of the corresponding areas detected for a matching mosaic pair should be not less than approximately the original frame width.
  • the height of this area should be at least 2/3 of the original frame height (the upper part of the mosaic is typically used in this method). This requirement is motivated by cinematography rules, known in the art, concerning the focus on active parts of the frame (see, e.g., D. Arijon. Grammar of tbe Film Language. Simian- James Press, 1976).
  • the mosaics are divided into relatively wide vertical strips (60 pixels wide) and these strips are compared by distance matching, as will be described below.
  • An approximate common region in the two mosaics is determined by coarsely aligning a sequence of vertical strips in one mosaic with a sequence of vertical strips in the second mosaic.
  • the coarse detection stage is first performed in order to identify candidate matching regions in every mosaic pair. If no corresponding regions are found, the mosaics are determined to be "dissimilar.” In cases where candidate matching regions are found, a threshold is used, which is determined from matching mosaic-pairs as described in greater detail below, in order to discard mosaic pairs with poor match scores. Subsequently, a more restricted matching process is applied on the remaining cropped mosaic pairs.
  • narrower strips are used to finely verify similarities and to generate match scores for each mosaic-pair. This step of dividing the mosaic into smaller regions for comparison is useful since global color matches might occur across different settings, but usually not in different relative locations within them.
  • the technique for defining the distance measure between images regions is explained herein.
  • a color space based on hue, chrominance (saturation) and lightness (luminance) is used. Such changes correspond mainly to a variation along the intensity axis.
  • An advantage of this color space is that it is close to the human perception of colors.
  • the HSI color space 60 in polar coordinates, illustrated in FIG. 4, is used, as is known in the art.
  • the intensity channel 162 is computed as luminance (instead of brightness-average of RGB, as is typically used).
  • Hue H 64 represents the impression related to the dominant wavelength of the color stimulus.
  • Saturation S 66 expresses the relative color purity (amount of white light in the color). Hues are determined by their angular location on this wheel. Saturation, or the richness of color, is defined as the distance perpendicular to the intensity axis. Hue and saturation taken together are called the chromaticity coordinates (polar system). Colors near the central axis have low saturation and look pastel. Colors near the surface of the cone have high saturation.
  • the HSI space forces non-uniform quantization when constructing histograms and does not capture color similarities as well as CIELab color space, for example.
  • the appropriateness of any such quantization can be easily validated by converting the quantized HSI values back to RGB space and inspecting the resulting color-quantized images.
  • This procedure allows for tuning the quantization and to predict the results of the three-dimensional HSI histogram difference computations, hi the exemplary embodiment, uniform quantization was used for hue values, and 18 values were used. Since both saturation and intensity were found to behave poorly for small values, a non-uniform quantization was used for both. For saturation, a threshold was empirically chosen; for pixels with saturation values below this threshold (e.g., for grays), the hue values were ignored.
  • the matrix D(i, j) refers to an N xM matrix where each entry represents a distance measure.
  • This matrix is treated as a rectangular grid, and the best diagonal path is then searched for within that grid. For example, if P ⁇ (s,t) -» (k, /) ⁇ represents a path from node (s, t) to node (k, l) of length L, then its weight is defined by the average weight of its nodes:
  • the best diagonal path P with the minimum weight of all diagonals is searched for, that also satisfies the following constraints: 1. Length ⁇ p) ⁇ T Lensth .
  • the constraints T ⁇ mgt ) t and Tsi ope specify thresholds for minimum diagonal length and maximum slope value, respectively.
  • the first constraint is determined by the width and height of the original frames in the sequence, which determine the mosaic's size (e.g., 352 x 240 pixels in the exemplary embodiment), since it is required that sufficient portions of the mosaics will match. For example, for frame size of 352 x 240 pixels, the width of the strips in the coarse stage (described below) was set to 60 pixels, and T ⁇ eng a. was set to 5 for the horizontal alignment. Thus the width of the matched region is at least 300 pixels.
  • the second constraint relates to the different scales and angles of the generated mosaics and to different camera placement. If the matched mosaics are generated from shots both taken from the same location and at the same focal length, then the diagonal will have a slope of 45°. Yet, if one shot was taken from a different angle or with a different zoom, then the slope changes.
  • the scale difference could be as large as 2 : 1, resulting in a slope of approximately 26°. Therefore diagonals of slopes are examined that vary between 25° - 45° in both directions (allowing either the first mosaic to be wider or the second mosaic to be wider). Intervals of 5° are used, resulting in a total of 9 possible slopes.
  • the values along this diagonal are interpolated. Bilinear interpolation is one alternative, which has certain drawback, e.g., time consuming. Experiments have proved that nearest neighbor interpolation gives satisfactory results. With the use of indices and look-up-tables, this technique was implemented.
  • the nearest neighbor algorithm was modified and reordered to save computation time but also provide more flexibility on imposing additional slope and length constraints.
  • entries refers to the indexes of the matrix. An entry in a two- dimensional array, a matrix, is the pair of numbers (ij) where i is the row number and j is the column number.)
  • the transformation of the distance matrix of a specified slope is modeled into a matrix of scale 1:1.
  • a coarse horizontal alignment of two consecutive strip-sequences in a mosaic-pair is performed in order to detect a common physical area.
  • FIG. 5 several key frames 72, 74, 76, 78 and 80 of a first shot are used to generate the mosaic 82.
  • FIG. 6 key frames 82, 84, 86, 88 and 90 of a second shot are used to generate the mosaic 92.
  • An exemplary strip t 94 in mosaic 82 and strip S j 96 in mosaic 92 comprise a mosaic strip-pair.
  • the width of the strips 94 and 96 is set to be 60 pixels each, since no more than about 5-6 vertical segments are needed, and a finer comparison stage is subsequently perfonned (described in greater detail below).
  • each strip s, 94 S j 96 is further divided into blocks of size 60 x 60., and a block-to-block distance matrix B[k, ] 100 is generated for each pair of strips, as illustrated in FIG. 7.
  • Each entry 102 in this matrix 100 is the histogram difference between two blocks: block b k from the first strip s,- 94 and a block b ⁇ from the second strip s j 96:
  • Diff(b k ,b l ) is the histogram distance defined above.
  • the best diagonal 104 (the thin white diagonal line) is located (as discussed above) in the distance matrix B[k,l] and its start and end points are recorded The value of this diagonal 104 is chosen as the distance measure between the two strips:
  • diags is the set of all allowable diagonals in the matrix B[k,l , and each diagonal d e diags is given by a set of pairs of indices (k,l). The average of the values along the diagonal defines the distance between the two mosaics.
  • FIGS. 8(a)-8(b) illustrate the procedure for comparing mosaics having different dimensions.
  • the method described herein uses a coarse alignment of a sequence of A: strips with a sequence of up to 2k strips, therefore allowing the scale difference between the two mosaics to vary between 1 : 1 and to be as large as 2: 1.
  • An example of cropped region 116 and 118 from mosaics 110 and 112, respectively, are shown in FIG. 8 to create a block-to-block distance matrix 114 (the procedure for cropping is described below). This allows for matching mosaics which were generated from shots taken from different focal length.
  • the exemplary embodiment supported a scale difference as large as 2:1, it is contemplated that routine modifications would permit larger scale differences.
  • the strip-to-strip distance matrix S 120 is graphically displayed in FIG. 9.
  • the entry 122 in matrix 120 is the result from the two strip comparison of FIG. 7, i.e., the distance between the two strips 94 and 96.
  • the height 124 of this block matrix S 120 is the same as width 95 of the mosaic 82 in FIG. 5 and its width 126 is the width 97 of the mosaic 92 in FIG. 6, such that each block 128 represents the strip-to-strip difference between the two mosaics 82 and 92.
  • the distance values of the diagonal paths found for each pair are checked. It is expected that mosaic-pairs from different settings or with no common physical area will have high distance values, therefore they are discarded according to a threshold.
  • the threshold is determined by sampling several mosaic-pair distances which have common physical areas, and therefore would have the lowest distance values. The highest of these sampled values is selected as the threshold. This sampling method yielded accurate results in the exemplary embodiment after inspecting all mosaic- pairs distances as shown in graph 200 in FIG. 10. Mosaic-pairs with common physical areas (left of the vertical line 202) are separated from the rest of the mosaic- pairs (right of the vertical line 202).
  • the sampled pairs appear on the left-most side (left of the leftmost vertical line 206).
  • the maximum distance value among mosaic- pairs with common background is the horizontal line 204 which rejects mosaic pairs known to be from different physical areas, although it may permit a few false positive matches. If a diagonal path distance value exceeds this threshold, it is determined that the two mosaics do not match. If a diagonal path is found that is below the threshold a match is found, and the method continues to the finer step, described below. After discarding mosaic-pairs which had diagonal paths with large distance values, the measure of their closeness is refined in order to detect false positives and to more accurately determine physical background similarity.
  • the recorded start and end points of the diagonals from all comparison stages are used to crop each mosaic such that only the corresponding matching areas are left:
  • the start and end points of the diagonal in the S matrix are used to set the vertical borders and the new widths of the cropped mosaics.
  • every pair of strips is inspected along the diagonal path found in S, to determine which parts of the strips were used to give the best match. Since for every strip pair, the start and end points of the best diagonals in its corresponding B matrix were recorded, the average of these values is used to set the horizontal border.
  • the finer stage only the cropped parts of the mosaics are compared by applying a similar method to the one used in the coarse stage.
  • the mosaic 82 is cropped to the region 130 in FIG. 11.
  • the mosaic 92 is cropped to the region 132.
  • the cropped mosaics 130 and 132 are displayed in FIG. 12.
  • the cropped mosaic 130 is also rotated to better present the graphical representation of the new S distance matrix 134 between two cropped mosaics 130 and 132. Thinner strips (20 pixels wide) are used in this stage and the scale difference between the two mosaics is also taken into account. In certain circumstances, one mosaic may be wider than the other. Assuming that the cropped mosaics cover the same regions of the physical setting, the narrower cropped mosaic is divided into K thin strips (20 pixels wide); the best match will be a one-to- one match with K wider strips of the wider mosaic, where each strip pair covers the exact physical area. Let ⁇ > 1 be the "width ratio" between the two mosaics.
  • Histograms are re-computed of 20 x 20 blocks for the narrower cropped mosaic, and histograms are re-computed of 20 ⁇ x 20 ⁇ blocks for the wider mosaic.
  • the best path in the new distance matrix should have a slope of 45°.
  • Matching in the finer stage is less computationally expensive than the matching in the coarse stage.
  • the method of matching only corresponding regions in the mosaic addresses the problem of matching shots taken from cameras in different positions, where different parts of the background are visible.
  • the problem of camera zoom and angle is addressed in the finer stage, which allows accurate matching of cropped mosaics of the same setting but with different scale and angle.
  • a first example of the mosaic-based representation is the representation and comparison of shots from sports videos. Since information is needed from every shot, the coarse comparison stage was used to save computation time. This stage provided the capability to distinguish between shots showing wide side-views of the basketball court ("court shots"), close-up shots of players, shots taken from above the basket, and shots showing the field from a close side-view. Clustering mosaics of basketball shots led to a more meaningful shot categorization than clustering key-frames of the same shots.
  • An example of a mosaic 300 generated from a panned and zoomed court shot is shown in FIG. 13, and the corresponding key frames 302, 304, 306.
  • Preliminary results from clustering basketball videos allowed for the classification of shots and to determine temporal relations between clusters.
  • Filtering rules corresponding to video grammar editing rules, were manually determined to extract events of human interest. For example, in a video of a European basketball game, all foul basket-shots were detected, and used as bookmarks in a quick browsing tool shown in FIG. 14.
  • Foul penalty throws were characterized by a three-part grammatical sequence: a court shot 320, followed by a close-up shot 322 and a shot from above the basket 324.
  • the results 330 from the shot clustering (in the coarse stage) are shown in FIG. 15.
  • the following approach was used to cluster the mosaics: First a mosaic was generated for each shot.
  • FIG. 16 A dendrogram 400 representing the clustering result of shots 402 during several minutes of a game is shown in FIG. 17.
  • the upper cluster 410 represents close-up shots
  • the lower cluster 412 represents court shots.
  • clusters could be further divided into smaller categories, such as court/audience close-ups and whole/right/left court shots.
  • the first quarter of one of the first games of the season 32 sequences of a court shot were detected followed by a close-up shot, which were good candidates for representing field goals.
  • 18 were regular field goals 7 were assaults, 2 occurred just before a time-out and the rest of the 5 instances showed the missed field goals of a well-known NBA player.
  • This video was for popular basketball player Michael Jordan's first game returning to play after retirement, which may explain why cameras switched to close-ups even though Jordan missed.) All of these instances serve as "interesting" events in the game.
  • FIG. 16 Screen captures of this video player for a foul shot bookmark are shown in FIG. 16.
  • FIG. 18 The preliminary clustering results 420 which separated field shots from close-up shots are shown in FIG. 18. This figure shows first stage (coarse) comparison of shots from that game.
  • the top-left cluster 422 along the main diagonal 424 represents court shots which cluster together in this stage, the following cluster 426 represents various close-ups of basketball players and the coach.
  • EXAMPLE 2 A second example utilizes the mosaic-based representation to generate a compact hierarchical representation of video. This representation is prepared using the video genre sitcom programs, although it is contemplated that this approach may be used for other video genres.
  • the video sequences are first divided into shots using a shots transition detection technique described in A. Aner and J. R. Kender, "A Unified Memory-Based Approach to Cut, Dissolve, Key Fame and Scene Analysis," IEEEICIP, 2001, which is incorporated by reference herein. These shots are further divided into scenes using the method described in J.R. Kender and B.L. Yeo, "Video Scene Segmentation via Continuous Video Coherence," ZEEE Conference on Computer Vision and Pattern Recognition, 1998, incorporated by reference herein.
  • the hierarchical tree representation is illustrated in FIG. 1. Shots and scenes are represented with mosaics, which are used to cluster scenes according to physical location.
  • the hierarchical representation in FIG. 1 illustrates the new level of abstraction of video, which concludes the top level of the tree-like representation, and call it physical setting 14.
  • This high-level semantic representation does not only form a very compact representation for long video sequences, but also allows for efficiently comparing different videos (e.g., different episodes of the same sitcom). By analyzing the comparison results, the method allows to infer the main theme of the sitcom as well as the main plots of each episode.
  • a scene is a collection of consecutive shots, related to each other by some spatial context, which could be an event, a group of characters engaged in some activity, or a physical location.
  • a scene in a sitcom typically occurs in a specific physical location, and this location is usually repeated throughout the episode.
  • Some physical locations are characteristic of the specific sitcom, and repeat in almost all of its episodes. Therefore, it is advantageous to describe scenes in sitcoms by their physical location and use these physical settings to generate summaries of sitcoms and to compare different episodes of the same sitcom.
  • the novel technique described herein uses shots that have most information about the physical scene. These shots are selected by automatically detecting static, panning and zoom shots - the first shot of the scene, the establishing shot, is selected, along with all shots that have large pan or zoom - the process of automatically detecting these shots is described below.
  • FIG. 19 An example of a scene in a sitcom is shown in FIG. 19 for physical setting 500. Shots 502, 504, 506 that were photographed from different cameras 512, 514, 516, respectively, are typically very different visually, even though they all belong to the same scene and were taken at the same physical setting.
  • a "good shot” is a pan shot, as shown in FIG. 2, described above.
  • Another example is a zoom shot. A zoomed out portion of the shot will most likely show a wider view of the background, and thereby expose more parts of the physical setting.
  • Detecting "good shots” is done by examining the registration transformations computed between consecutive frames in the shot. By analyzing those transformations, shots are classified as panning (or tracking), zoomed-in, zoomed-out or stationary. This classification is done automatically, as follows: For each shot, the affine transformations that were previously computed in the mosaic construction stage, are consecutively applied on an initial square This is coded in matlab, provided in the appendix hereto as the routine: "check_affine.m” Each transformed quadrilateral was then measured for size and distance from a previous quadrilateral. Static shots are shown in FIG. 20. Pan shots were determined measuring distances between the quadrilaterals (FIG.
  • zoom shots were determined by measuring varying size of quadrilaterals (FIG. 22), and parallax was determined both by size and scale and by measuring the quality of the affine transformation computed (checking accumulated error and comparing to inverse transformations), as illustrated in FIG. 23.
  • the initial square is size is 1 Ox 10 and it resides in the middle of a generated figure. If the shot has N sampled frames, then their N corresponding computed affine transformations are applied - first on the first square, and then on the resulting quadrilaterals. As a result. N quadrilaterals are generated which are represented in different colors/tones to differentiate between them when looking at the generated figure.
  • the axes in the figure's graph represent the dimensions of the space in which the quadrilaterals reside. For example, if the shot had a large pan or zoom, then this space becomes very large. If the shot was static, then this space stays relatively small.
  • Each square in FIGS. 20-23 represents a sampled frame.
  • the colors/tones in FIGS. 20-23 are selected from a fixed array of 7 colors/tones.
  • the reference frame for the mosaic plane is chosen accordingly. This is performed automatically according to the following determined classifications: For panning and stationary shots, the middle frame is chosen. For zoomed-in shots the first frame is chosen. For zoomed-out shots, the last frame is chosen. A threshold is derived experimentally to allow for the selection of shots with significant zoom or pan.
  • the first interior shot of each scene is selected to be an R-mosaic (for a static scene it is the only R-mosaic).
  • Many indoor scenes also have an exterior shot preceding the interior "establishing shot,” which photographs the building from the outside. This scene may be detected because is does not cluster with the rest of the following shots into their scene, and also does not cluster with the previous shots into the preceding scene. Instead, it is determined as a unique cluster of one scene. These unique clusters are detected and disregarded when constructing a scene representation.
  • FIG. 24 illustrates the procedure in which this performed.
  • the following mosaics have been created: mosaic 1 602, mosaic 2 604, mosaic 3 606, mosaic 4 608.
  • the following mosaics have been created: mosaic 1 612, and mosaic 2 614.
  • the distances, or dissimilarities, between mosaics are computed as discussed above, and are represented in FIG. 24 as distances 616. (For example, the distance 618 is computed between mosaic 1 602 (of scene i 600) and mosaic 1 612 of scene j 610.
  • the distance between each pair of mosaics is computed according to the equations (l)-(4) above, in which equation (1), when applied on the strip-to-strip distance matrix, defines the final distance between the mosaic pair.
  • a scene difference matrix is constructed in which each entry (ij) corresponds to the difference measure between scene i and scene j.
  • An example is show in FIG. 25, in which the entries in the scene difference matrix were arranged manually (for display purposes) so that they correspond to the 5 physical settings of this episode. (The physical settings are automatically determined by the clustering algorithm. However, their naming is done manually.)
  • On the left is the scene list 650 in temporal order, and on the right is the similarity graph 652 for the scene clustering results, in matrix form, in which darker regions represent higher similarity (low distance scores). For example, there are 13 scenes in the episode (as indicated in list 650).
  • scenes 1, 3, 7, 12 took place in the setting which was marked as “Apartment 1,” i.e., the cluster 654.
  • Scenes 2, 6, 9, 11 took place in “Apartment 2,” i.e., the cluster 656.
  • Scenes 5, 10 took place in "Coffee Shop,” i.e., cluster 658.
  • Scenes 4, 13 took place in "Bedroom 1,” i.e., cluster 660, and scene 8 took place in "Bedroom 2.”
  • the scene clustering process typically results in 5-6 physical settings, there are often about 1-6 scenes in each physical setting cluster, and about 1- 6 mosaics representing each scene.
  • Shots of scenes in the same physical setting, and sometimes even within the same scene are filmed using cameras in various locations which show different parts of the background. Therefore, two mosaics of the same physical setting might not even have any corresponding regions, hi order to practice the technique, the representation of a physical setting preferably includes all parts of the background which are relevant to that setting. Therefore, if there isn't a single mosaic which represents the whole background, several mosaics which together cover the whole background.
  • the results of the matching algorithm's finer stage which recognizes corresponding regions in the mosaics, are used to determine a "minimal covering set" of mosaics for each physical setting.
  • This set is approximated by clustering all the representative mosaics of all the scenes of one physical setting and choosing a single mosaic to represent each cluster.
  • This single mosaic is the centroid of the cluster, e.g., it is the mosaic winch has the best average match value to the rest of the mosaics in that cluster.
  • FIG. 26 illustrates the hierarchical representation of a single episode, of scenes and physical settings, and the images 712-736 are sampled key frames from the 13 scenes of that episode.
  • the physical settings are represented by mosaics. There are five physical settings (Apartment 1 702, Apartment 2 704, Coffee Shop 706, Bedroom 1 708, and Bedroom 2 710) and 13 scenes 712-736.
  • frames 712, 716, 724, and 734 correspond to physical setting "Apartment 1" represented by mosaic 702; ., frames 714, 722, 728, and 732 correspond to physical setting "Apartment 2" represented by mosaic 704; frames 720 and 730 correspond to physical setting "Coffee Shop” represented by mosaic 706; frames 718 and 736 correspond to physical setting Bedroom 1" represented by mosaic 708; and frame 726 corresponds to physical setting Bedroom 2" represented by mosaic 710.
  • frames 740, 742, 744, 746, 748, 750 e.g., frames 712, 716, 724, and 734 correspond to physical setting "Apartment 1" represented by mosaic 702; ., frames 714, 722, 728, and 732 correspond to physical setting "Apartment 2" represented by mosaic 704; frames 720 and 730 correspond to physical setting "Coffee Shop” represented by mosaic 706; frames 718 and 736 correspond to physical setting Bedroom 1" represented by mosaic 708; and frame 726 corresponds to physical setting
  • Table 1 illustrates the compactness of the representation method using settings, scenes, shots, and frames of a single episode.
  • this episode there are 5 physical settings, each represented by a single R-Mosaic (the R-Mosaics are referred to by their corresponding shot number), and having 1-4 scenes.
  • Each scene is represented by only 1-4 R-Mosaics, and has 11-26 shots, i.e., approximately 2400- 11100 frames.
  • FIGS. 27-29 illustrate dendrograms generated using tools as described in Peter Kleiweg, "Data Clustering," on line publication http://odur.let.rug.nl/ ⁇ kleiweg/clustering/clustering.html, as are known in the art.
  • the program described therein reads a difference file, and creates a clustering represented by a dendogram.
  • This program is an implementation of seven different clustering algoritlims, which are described in Anil K. Jain and Richard C. Dubes,
  • FIG. 33 represents the results of an inter-video comparison of the physical settings in the three episodes.
  • the order of the entries in the original scene difference matrices were manually arranged to reflect the order of the reappearance of the same physical settings. Dark blocks represent good matches (low difference score).
  • FIG. 28 there are 5 main clusters representing the 5 different scene locations. For example, the first cluster along the main diagonal is a 4x4 square representing scenes from Apartment 1.
  • Lines join matching physical settings, which are common settings in most episodes of this sit-com: e.g., line 794 joins "apartment 1" setting represented by 760, 770, and 784 in episodes 1, 2, and 3, respectively; lines 796 joins "apartment 2" setting 772 and 782; and line 798 joins "coffee shop” setting 762, 774, and 780.
  • Peaks are defined herein as settings unique to the episode. For example, in the episode 1, (represented by the portion 750 of FIG. 33) there are three main plots involving activities in a dance class 764, jail 766, and an airport 768. Most episodes of sitcoms have been found to involve two or three plots; anecdotal feedback from human observers of the sitcoms suggests that people relate the plot to the unusual setting in which it occurs. That is, what makes a video unique is the use of settings which are unusual with respect to the library.
  • FIGS. 34-35 Another example is shown for a fourth episode, represented in FIGS. 34-35.
  • There are six physical settings in this episode e.g., "Apartment 1,” “Apartment 2,” “Coffee Shop,” “Party,” “Tennis Game,” and “Boss's Dinner.”
  • the clustering of scenes together into physical settings was not as straightforward as the previous episodes discussed above and illustrated in FIGS. 27- 32. This is due to the fact that the setting of "Apartment 1" was not presented in the same manner, since its scenes either took place in the kitchen or in the living room, but not in both.
  • the setting of "Apartment 1" includes mosaics of both the living room and the kitchen, causing two different settings of this episode to be combined together. More specifically, the "Apartment 1" setting cluster already contains mosaics that match both scene 1 and scenes 7, 9, and 12 from the new episode.
  • Example 2 demonstrates how the non-temporal level of abstraction of a single video could be verified and corrected by semantic inference to other videos. Scenes and settings that otherwise would not have been grouped together are related by a type of "learning" from the previously detected "physical setting" structure of other episodes. For the video genre which was used in example 2, sitcoms, the physical setting structure is well defined and it is straightforward to distinguish between them, as discussed above.
  • the scene dissimilarity measure is used to determine the accuracy of the physical settings detection. Different clustering methods would result in the same physical settings cluster structure as long as the scenes distance matrix has the correct values. For example, the episode 4 discussed above with reference to FIGS. 34-35, the inter- video comparison of physical settings would correct the clustering results for the first setting of "Apartment 1," but the clustering threshold was not as pronounce as in the first three episodes. Depending on this threshold, for large values scene 11 could be wrongly clustered with scenes 14, 2, and 4 (see, FIG. 35), and for small values scene 15 would not be clustered with scenes 5, 8, 10 and 13, as it should (see, FIG. 35).
  • clustering methods were used, which (all available from the web site that is referenced above in the text) that performed similarly - among them were "single link” (distance between clusters is defined to be minimum distance between their elements), “complete link” (instead distance between clusters is defined to be maximum distance between their elements, “group average” (distance between clusters is defined to be average distance between their elements), “weighted average” (distance between clusters is defined to be weighted average distance between their elements. Weight of each element is set according to the number of times this element participated in the cluster combination step of the clustering process) Since the maximum number of scenes encountered in sitcoms was 15, there are up to 15 elements to cluster, causing every clustering algorithm to run fast.
  • a representative user interface for the invention described herein is a video browser, which utilizes the proposed tree-like hierarchical structure as represented in FIG. 1, above.
  • the browser uses the video data gathered from all levels of representation of the video.
  • the mpeg video format of each episode is used for viewing video segments of the original video.
  • the shot level uses a list of all marked shots for each episode, including the start and end frame of each shot.
  • the scene level uses a list of all marked scenes for each episode, including the start and end shots of each scene.
  • a list of representative shots is kept, and their corresponding image mosaics are used for display within the browser.
  • the physical level it uses a list of all detected physical settings for each episode, with their corresponding hand-labeled descriptions (e.g., "Apartment 1," “Coffee Shop,” etc.).
  • Each physical setting has a single representative image mosaic, used for display.
  • a representative browser is illustrated in FIG. 36, and is implemented in Java.
  • the main menu is displayed as a table-like summary in a single window 850.
  • Each row 852, 854, and 856 represents an episode of the specified sitcom.
  • the columns 858-878 represent different physical settings that were determined during the clustering phase of scenes for all episodes, as discussed above.
  • Each cell (i,j) in the table is either empty (e.g., empty region 880 corresponding to setting "Apartment 2" 860 and episode 854) or displays a representative mosaic for settingy, taken from episode i.
  • the order of columns from left to right is organized from the most common, i.e., "Apartment 1" 852, to the non-common settings, i.e., "Bar” 878.
  • the first three columns represent common settings which repeat in almost every episode of the specific sitcom.
  • the rest of the columns are generally unique for every episode.
  • the user can immediately recognize the main plots for each episode by looking for non-empty cells in the row of that episode starting from the first column of unique settings, e.g., starting at the fourth column 862 in FIG. 36.
  • the main plots involve scenes taking place in settings "Bedroom 1" 864 and "Bedroom 2" 866.
  • the user can left-click on the representative mosaics for these settings, which displays a window 882 of a short list of scene mosaics that correspond to those settings (usually one or two) as illustrated in FIG. 37.
  • left-clicking on a mosaic in window 882 will enlarge and display the mosaic in window 884 of FIG. 37, and double-clicking on the representative mosaic for each scene will start playing the video from the beginning of that scene in window 886 of FIG. 38.
  • the temporal representation of each episode is also accessed from the main menu 850 and is used for fast browsing of the episode.
  • a window 882 of a list of all scene mosaics belonging to that episode appears (FIG. 39).
  • Each scene on the list shown in window 882 is represented by a single mosaic 886 and it is optionally expanded by left-clicking into a window of a list of representative mosaics (shots) for that scene.
  • the fast browsing is performed by scanning the scenes in order and only playing relevant video segment from chosen scenes by double-clicking on them, as shown in FIG. 38.
  • the browser discussed herein has the advantage of being both hierarchical in displaying semantically oriented video summaries of videos in a non- temporal tree-like fashion as well semantically relating different episodes of the same sitcom to each other.
  • the mosaic-based scene comparison method is not confined to the genres of situation comedies alone. In news videos, for example, it could allow classifying broadcasts from well-known public places. It could also allow classification of different sports videos such as basketball, soccer, hockey and tennis according to the characteristics of the play field.
  • the methods described herein may serve as a useful tool for content- based video access.
  • Video sequences are represented in their multiple levels of organization in the following structure: frames, shots, scenes, settings and themes.
  • the proposed mosaic-based approach allows direct identification of both clusters of scenes (settings) within a video sequences and similar settings across different video sequences, and serves as a useful indexing tool.
  • the comparison by alignment technique is useful to more general image retrieval applications.
  • the technique described herein incorporates spatial information by applying a coarse alignment between the images. It is robust to occluding objects and will match images for which only partial regions match (e.g., the top-left region of one image matches the bottom-left region of a second image).
  • void main ( ) This function reads in a data file which lists shots in a specified directory. For each shot, it reads in the mosaic image generated for this shot, applies a median filter on this image (this is done by applying the function "MedianFiiterRGB" which will be described in detail in image.c) and saves the new image into a new image file under that same shot directory.
  • DIRNAME "D: ⁇ aya ⁇ users ⁇ f riends3 ⁇ "
  • MAX_MOS_HEIGHT 1000 void main() ⁇ char line [256], filename [256] , *token; int i , j ; int mosaics [MAX_SHOT_NUM] ;
  • R Alloc atrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
  • G AllocMatrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
  • Rl AllocMatrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
  • Gl AllocMatrix(MAX_MOS_HEIGHT, AX_MOS_WIDTH)
  • hist.h This is a header file that lists all the function which are implemented in hist.c and will be explained there. It also defines some constants that set the size of the histogram and the histogram structure (three-dimensional histogram) used throughout the program. #ifndef _HIST H
  • HIST * FillCummHistFromRGBArrayl (int sizeY, int sizeU, int sizeV, unsigned char **R, unsigned char **G, unsigned char **B, int start , int endx, int startY, int endY) ;
  • HIST * FillHistFromRGBArray2 (int sizel, int sizeH, int sizeS, unsigned char **R, unsigned char **G, unsigned char **B, int startX, int endX, int startY, int endY) ;
  • HIST* AllocHist (int size) Function to allocate space for the histogram structure. All three dimensions of the histogram are set to be of the same size.
  • HIST* AllocHist2 (int sizeY, int sizeU, int sizeV) Function to allocate space for the histogram structure. Each dimension of the histogram is set to according to the size specified by the parameters to the function (sizeY, sizeU, sizeV).
  • void FreeHist (HIST *h) : Function to free the space allocated for the histogram structure.
  • HIST * FillHistFromRGBArrayl (int sizeY, int sizeU, int sizeV, unsigned char **R, unsigned char **G, unsigned char **B, int startx, int endx, int startY, int endY) : Function that takes an image and computes its RGB color histogram (three dimensional histogram in RGB color space). It returns a histogram structure containing the histogram values.
  • HIST * FillHistFromRGBArray2 (int sizel , int sizeH, int sizes , unsigned char **R, unsigned char **G, unsigned char **B , int startx, int endx, int startY, int endY) : Function that takes an image and computes its HSI color histogram (three dimensional histogram in HSI color space). It returns a histogram structure containing the histogram values.
  • Gval G[x] [y] / 255.0;
  • B Bvvaall B[x] [y] / 255.0;
  • Bval 0.058*Xn - 0.118*Yn + 0.896*Zn;
  • sumRGB Rval + Gval + Bval;
  • Rval Rval / sumRGB;
  • Gval Gval / sumRGB;
  • image.h This is a header file that lists all the function which are implemented in image.c and will be explained there.
  • void WritePPM (char *path, unsigned char **R, unsigned char **G, unsigned char **B, int Width, int Height) :Function to write a color image into a file using "ppm” format.
  • void WritePGM (char *path, unsigned char **im, int Width, int Height) : Function to write a gray-level image into a file using "pgm” format.
  • void ReadPPM (char *FileName , int *Width, int *Height , unsigned char ***Rarr, unsigned char ***Garr, unsigned char ***Barr) :
  • ***B Function to reads a color image from a file stored in "ppm" format. It assumes that space has already been allocated for the three color channels (R,G,B), and stores the values in them.
  • unsigned char ** AllocMatrix (int rows , int cols) : Function to allocate space for a two dimensional array of unsigned characters - each such two- dimensional array is used throughout the program to store a single color channel of an image.
  • void FreeMatrix (unsigned char **m) : Function to free the memory allocated to a two dimensional array (described above).
  • MedianFilterRGB (int filter_size , int Width, int Height , unsigned char **Rin, unsigned char **Gin, unsigned char **Bin, unsigned char **Rout , unsigned char **Gout , unsigned char **Bout) : Function that takes an image applies a median filter on that image.
  • Type getc (FilePtr) ; SizeCount++;
  • NextChar getc (FilePtr) ; SizeCount++; while (( (NextChar ⁇ '0')
  • TmpNum (TmpNum * 10) + (NextChar-48) ;
  • SizeCount + strlen (Comment) ;
  • TmpNum (TmpNum * 10) + (NextChar-48) ;
  • TmpNum (TmpNum * 10) + (NextChar-48) ;
  • HeaderSize SizeCount
  • fclose FilePtr
  • LineBuf (unsigned char *) malloc ( (*Width) *3) ;
  • R AllocMatrix(*Height, *Width)
  • G AllocMatrix(*Height, *Width)
  • Type getc (FilePtr) ;r SizeCount++;
  • NextChar getc (FilePtr) ; SizeCount++; while (( (NextChar ⁇ '0')
  • SizeCount + strlen (Comment) ;
  • TmpNum (TmpNum * 10) + (NextChar-48) ;
  • NextChar getc (FilePtr) ;
  • NextChar getc (FilePtr) ; SizeCount++; ⁇
  • TmpNum (TmpNum * 10) + (NextChar-48) ;
  • NextChar getc (FilePtr) ;
  • HeaderSize SizeCount-1; fclose (FilePtr) ;
  • Rout[i] [j] Rin[i] [j]
  • Gout[i] [j] Gin[i] [j]
  • Bout[i] [j] Bin[i] [j]
  • Rout[i] [j] Rin.i] [j]
  • Gout[i] [j] Gin[i] [j]
  • Bout[i] [j] Bintil [j]
  • RGB2Gray unsigned char **R, unsigned char **G, unsigned char **B, int Width, int Height
  • ⁇ val (unsigned char) (0.299 * R[i] [j] + 0.587 *
  • bilinear_interp.c This file first defines several structures used by the functions implemented in it. This includes some one-dimensional arrays that will be used repeatedly throughout the code, hence there is no need to allocate and free them each time. It also defines two two-dimensional arrays, LUT and rev-LUT which are used as Look-Up-Tables (and Reverse-Look-Up-Table) by some of the functions described below. Double round (double x) : Function to compute the rounded value of a floating point number. int MaxSubDiag (int slope_index, int slope_dir, int Xstart , int Xend, int Ystart , int Yend, double **Mat , int Rows , int
  • This function is used by the following function "GetBestDiagVal". Given a matrix, start and end points of long diagonal, and a slope, this function uses the LUT to retrieve used values along this diagonal, stores them into a one-dimensional array, then finds the maximal sub-sequence within this array. This is the code that modifies the nearest neighbor algorithm to save computation time, as described above.
  • Double GetDirectBestDiagVal double **Mat , int Rows , int Cols , int Limit , int *BestStartX, int *BestEndX , int *BestStartY, int *BestEnd ⁇ ) : More efficient version of the code in "GetBestDiagVal" which generates the vector on which "GetDirectBestDiagvai" operates instead if sending all the data to MaxSubDiag - and thus saves computation time.
  • Double GetDirectBestDiagValLimited double **Mat, int Rows , int Cols , int Limit , int*BestStartX, int *BestEndX, int
  • int MaxSubDiag int slope_index, int slope_dir, int Xstart,int Xend,int Ystart,int Yend, double **Mat, int Rows, int Cols, int Limit, double *DiagVal, int *BestStartX, int *BestEndX, int
  • DiagLen MaxSubDiag(0, 1, 0, Cols-2, 1, Rows-1, Mat, Rows, Cols,
  • scenes_frames_str.c The routine void main ( ) implements the coarse mosaic-matching algorithm, used both for sitcoms and for sports broadcasts. The algorithm is described in detail above. For sitcoms, it reads in a list of scenes of a specified episode, and a list of all mosaics representing each scene (this is the list of representative shots of each scene). It then computes the scene-to-scene distance matrix, and also generates the images that describe the strip-to-strip distance matrix between each pair of mosaics. (For sports broadcasts, it generates a shot-to-shot distance matrix by reading in a list of all shots of the sports sequence and computing the distance between each pair of mosaics.)
  • HIST * (*FillHistFunc) (int , int , int , unsigned char **, unsigned char **, unsigned char **, int , int , int) ; void (*FreeHistFunc) (HIST *) ; double (*HistDiffFunc) (HIST *,HIST *) ; unsigned char **R1 unsigned char **G1 unsigned char **B1 unsigned char **R unsigned char **G unsigned char **B unsigned char **orgRl; unsigned char **orgGl; unsigned char **orgBl; unsigned char **orgR; unsigned char **orgG; unsigned char **orgBl; unsigned char **orgR; unsigned char **orgG; unsigned char **orgB; unsigned char **orgR; unsigned char **orgG; unsigned char **orgB; unsigned char **bigR unsigned char **bigG unsigned char
  • R AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ;
  • G AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ;
  • B AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ;
  • Rl AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
  • Gl AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
  • Bl AllocMatrix (MAX MOS HEIGHT, MAX_MOS_WIDTH)
  • HTabl [k/block_dim] FillHistFunc (SIZE_0F_Y, SIZE_0F_U, SIZE_0F_V, Rl, Gl, Bl, 0, STRIP_HEIGHT-1, k, k+step) ;
  • HTabl [Wl-1] FillHistFunc (SIZE_0F_Y, SIZE_OF_U,SIZE_OF_V, Rl, Gl, Bl, 0, STRIPJHEIGHT-1, Width-WidthGap, Width-1) ;
  • Width2 Width
  • W2 Width/block_dim ;
  • WidthGap Width % block_dim; if (WidthGap > MIN_LAST_STRIP_WIDTH)
  • HTab2 [l/block_dim] FillHistFunc (SIZE_0F_Y, SIZE_OF_U, SIZ ⁇ _0F_V, R, G, B, 0, STRIP_H ⁇ IGHT-1, 1, 1+step) ;
  • HTab2[W2-l] FillHistFunc (SIZE_OF_Y, SIZE_OF_U, SIZE_0F_V, R, G, B, 0, STRIP_HEIGHT-1, Width-WidthGap, Width- 1) ;
  • bigG[l] [k] orgGl [Heightl- 1-k] [1]
  • bigB[l] [k] orgBl [Heightl- 1-k] [1] ;
  • FreeMatri (Y) ; free (clusters [0] ) ; free (clusters) ;
  • FreeMatrix (bigG) ; FreeMatri (bigB) ;
  • scenes_strips_str.c The routine void main() implements the coarse and fine mosaic-matching algorithm, used for sitcoms. The algorithm is described in detail above. It reads in a list of scenes of a specified episode, and a list of all mosaics representing each scene (this is the list of representative shots of each scene). It then computes the scene-to-scene distance matrix, and also generates the images that describe the strip-to-strip distance matrix between each pair of mosaics.
  • G AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
  • B AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
  • Rl AllocMatrix (MAX_M0S_HEIGHT, MAX_M0S_WIDTH) ;
  • Gl AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ;
  • Bl AllocMatrix (MAX MOS HEIGHT, MAX MOS WIDTH);
  • HTabl (HIST ***) malloc (MAX_BLOCK_HIST_HEIGHT*sizeof (HIST **));
  • HSTab2 [k] HSTab2 [k-1] + MAX_BLOCK_HIST_WIDTH;
  • Y AllocMatrix (2000 , 2000 ) ;
  • Wl Widthl/block_dim
  • HI Heightl/block__dim
  • W2 Width2/block_dim
  • H2 Height2/block_dim
  • WidthGap Width2 % block_dim; if (WidthGap > MIN_LAST_STRIP_WIDTH)
  • ⁇ dist GetDirectBestDiagvai (best, l,
  • GlobalPath [GlobalPath ' friends2_shots ⁇ ' ] ;
  • ShotsEnd Shots (: , 2);
  • Path [GlobalPath, 'shot', num2str (Shotlndex) , ' ⁇ ' ] ; eval( ['cd ' ,Path, ';']);
  • T reshape (T, 3, 3) ;
  • Bord_top min(Bord_top, Y(3));
  • Bord_top min(Bord_top, Y(4));
  • Bord_bottom max(Bord_bottom, Y(l))
  • Bord_bottom ma (Bord_bottom, Y(2))
  • Bord_left min(Bord_lef t, X(l))
  • Bord_left min(Bord_left, X(4))
  • Bord_right max(Bord_right, X(2));
  • Bord_right max(Bord_right, X(3)); fill(X,Y,this_color) ; end clear TRANS;
  • Border [Bord_top, Bord_bottom, Bord_left, Bord_right] save 'aff.txt' Border -ASCII print -dpsc aff.ps saveas (gcf, ' aff . jpg' ) hold off; end

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

A method for summarizing a video comprising a plurality of consecutive frames is provided comprising the steps of dividing said plurality of consecutive frames into a plurality of shots (16). A mosaic representation for each shot is prepared (32). Color mosaic representations are prepared from color video. Each of the mosaic representations in the video may be compared using a novel alignment technique that incorporates a coarse matching step and a fine matching step (40-48). The mosaics are clustered into physical settings in which said frames were photographed. The efficient mosaic-based scene representation allows fast clustering of scenes into physical settings, as well as further comparison of physical settings across videos.

Description

METHODS FOR SUMMARIZING VIDEO THROUGH MOSAIC-BASED SHOT AND SCENE CLUSTERING
SPECIFICATION
STATEMENT OF GOVERNMENT RIGHT The present invention was made in part with support of the National
Science Foundation, contract no. 9812026. Accordingly, the United States Government may have certain rights to this invention.
CLAIM FOR PRIORITY TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Patent
Application serial No. 60/368,092, filed on March 27, 2002, entitled "Video Summaries Through Mosaic-Based Shot and Scene Clustering," which is hereby incorporated by reference in its entirety herein.
COMPUTER PROGRAM LISTING
A computer program listing is submitted in the Appendix, and are incorporated by reference in their entirety herein.
COPYRIGHT NOTICE A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
Field Of The Invention
This invention relates to systems and methods for hierarchical representation of video, which is based on physical locations and camera positions, and more particularly to a method using mosaics for the representation and comparison of shots in a video sequence. Background
Advances in computer processing power, storage capacity and network bandwidth suggest that it may be possible to browse and index long collections of videos, just as search engines allow browsing over documents. For example, many existing video indexing and browsing applications rely on frames for visual representation (mainly of shots). Shots have been characterized to allow indexing and browsing, usually using key-frames as the basic representation tool (See, e.g., M. Yeung and B. Liu, "Efficient Matching and Clustering of Video Shots," IEEE International Conference on Image Processing, 1995). Information from raw frames is usually sufficient for low-level video analysis such as shot boundary detection. An example of further segmentation of video sequences into scenes is shown in J.R. Kender and B.L. Yeo, "Video Scene Segmentation Via Continuous Video Coherence," IEEE Conference on Computer Vision and Pattern Recognition, 1998. hi order to detect scene boundaries, a memory based model may be employed, which also uses color histograms of the raw frames and performs sufficiently accurately. Another reference is S. Uchihashi et al., "Video Manga: Generating Semantically Meaningful Video Summaries," ACM Multimedia, 1999, which uses frames to represent video.
However, frame, shot and scene indexing may not be sufficient at this level. Studies have indicated that viewers tend to recall and value videos at a higher level of organization, than current methods allow. For example, a scene is determined, by definition, by its physical surroundings (4 to 10 cameras or camera placements are typical in a physical location) and it is usually hard to observe the background in isolated key-frames using the methods, such as those listed above. Therefore, key-frames are not suitable for accurate comparison of long video sequences. It is therefore desirable to use mosaics to represent shots. Unlike M. Irani and P. Anandan, "Video Indexing Based on Mosaic Representations," Proceedings of the IEEE, Volume 86, 1998, where the mosaics were used to index frames, a method is needed to which provides a technique for comparing and clustering scenes in order to create a higher-level semantic representation of a video. SUMMARY OF THE INVENTION
An object of the present invention is to provide a technique for summarizing video including temporal and non-temporal representations.
Another object of the present invention is to provide a technique for summarizing video that uses mosaic representation of video in order to cluster shots by physical settings.
A further object of the present invention is to provide a technique for summarizing video that is efficient and accurate.
A still further object of the present invention is to provide a technique for comparing a plurality of videos to identify repeating and unique physical settings for the plurality of videos, such as a television series, such as a situation comedy, including a plurality of episodes.
These and other objects of the invention, which will become apparent with reference to the disclosure herein, are accomplished by a method for summarizing a video comprising a plurality of consecutive frames, the method comprising the steps of dividing the plurality of consecutive frames into a plurality of sequences of consecutive frames; dividing the plurality of sequences of consecutive frames into a plurality of scenes; determining representative shots for these scenes; preparing a mosaic representation for each representative shot; comparing each of the mosaics in the video; and clustering the mosaics into the physical settings in which the frames were photographed.
According to a preferred embodiment of the invention, the step of preparing a mosaic representation of each shot may include determining a reference frame for each shot. This reference frame is identified automatically, as described below. The step of preparing a mosaic representation of each shot comprises computing a transformation, such as an affϊne transformation, between each pair of successive sampled frames in the shot and then using this transformation to project the frames into the image plane of the chosen reference frame.
The step of comparing each of the mosaics may include performing a first alignment, also referred to herein as a coarse alignment, for a pair of the mosaics. Each mosaic is divided into a plurality of strips, and each pair of strips, including one strip from each mosaic of the pair of mosaics, is compared. The step of comparing the strips corresponds to determining a vertical alignment, e.g., by determining the best diagonal in a distance matrix, such as block-to-block distance matrix B[k,l]. The step of performing an alignment of the pairs of mosaics further may further comprise performing a horizontal alignment, e.g., by determining the best diagonal in a second distance matrix S[i,j], for each of the pairs of mosaics.
A further step in comparing the mosaics may further include retaining a subset of the mosaics. The step of retaining a subset of mosaics may includes, for pairs of mosaics, determining a threshold based on a distance value determined for each of the pairs of the mosaics. Pairs of the mosaics having distance values less than or equal to the threshold are retained as representing potentially common physical areas, and pairs of mosaics having distance values greater than the threshold are discarded as not representing common physical areas, i.e., being "dissimilar."
In the preferred embodiment, the step of comparing the mosaics may further comprise, for pairs of mosaics, performing a second alignment, also referred to as a finer alignment. The step of performing the second alignment of the pairs of mosaics may include cropping the mosaics based on parameters determined during the step of performing the first alignment of the pairs of the mosaics. The step of performing the second alignment of the pairs of mosaics may comprise dividing the each of the mosaics, as cropped above, into a plurality of finer strips (compared with the coarse alignment stage) and comparing a pair of strips including a strip from each mosaic of the pair of mosaics. The step of performing an alignment of the pairs of mosaics further comprises determining a vertical alignment of each of the pairs of strips and a horizontal alignment of each of the pairs of mosaics.
For comparing a plurality of scenes in a video, the method may include dividing the plurality of consecutive frames into a plurality of scenes, and preparing one or more mosaic representations for each scene. The technique may further include comparing a pair of scenes by determining the distance value between the scenes in the pair of scenes. This step of determining the distance value may comprise determining the minimum distance between pairs of mosaics including one mosaic from each of the pairs of scenes. A further step may include clustering each of the distance values of the pairs of scenes into a matrix arranged by physical settings. For comparing different videos, such as different episodes in a series, physical settings that are repeatedly found in each episode may be identified, and physical settings that are unique to an episode may be identified.
In accordance with the invention, the objects as described above have been met, and the need in the art for technique for providing a higher level representation of video, has been satisfied.
BRIEF DESCRIPTION OF THE DRAWINGS
Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which
FIG. 1 illustrates the hierarchical representation of a video in accordance with present invention.
FIG. 2 illustrates a representation of several key frames in accordance with the present invention. FIG. 3 illustrates a plurality of mosaics generated from a plurality of shots in accordance with the present invention.
FIG. 4 illustrates the HSI color space.
FIG. 5 illustrates a representation of key frames and mosaics from a first shot in accordance with the present invention. FIG. 6 illustrates a representation of key frames and mosaics from a second shot in accordance with the present invention.
FIG. 7 illustrates a representation of a first difference matrix in accordance with the present invention.
FIG. 8 illustrates a representation of a first difference matrix in accordance with the present invention, using strips having different dimensions.
FIG. 9 illustrates a representation of a second distance matrix in accordance with the present invention.
FIG. 10 illustrates an exemplary plot for performing outlier analysis in accordance with a preferred embodiment of the present invention. FIG. 11 illustrates the first and second mosaics and a first stage in a finer stage analysis in accordance with a preferred embodiment of the present invention. FIG. 12 illustrates a distance matrix used in a second stage of a finer stage analysis in accordance with a preferred embodiment of the present invention.
FIG. 13 illustrates a plurality of frames and a mosaic created from the frames in accordance with the present invention. FIG. 14 illustrates a plurality of screen shots from a video.
FIG. 15 illustrates a comparison of mosaics prepared from screen shots similar to those of FIG 14, in accordance with the present invention.
FIG. 16 illustrates a plurality of screen shots from another video.
FIG. 17 illustrates a clustering of mosaics prepared from screen shots similar to those of FIG 16, in accordance with the present invention.
FIG. 18 illustrates another comparison of mosaics prepared from screen shots similar to those of FIG 16, in accordance with the present invention.
FIG. 19 illustrates a plurality of camera locations and associated shots taken from a physical setting. FIGS. 20-23 illustrate transformation analysis for a plurality of shot types in accordance with the present invention.
FIG. 24 illustrates a technique for determining the dissimilarity between two scenes in accordance with the present invention.
FIG. 25 illustrates a similarity graph for clustering scenes from a video in accordance with the invention.
FIG. 26 illustrates the representation of physical settings with corresponding scenes in accordance with the invention.
FIGS. 27-29 illustrate similarity graphs for three episodes of a program in accordance with the present invention. FIGS. 30-32 illustrate the dendrograms for clustering scenes of three episodes of a program in accordance with the present invention.
FIG. 33 illustrates a comparison the episodes represented in FIGS. 27 and 30, 28 and 31, and 29 and 32, respectively, in accordance with the present invention. FIG. 34 illustrates a similarity graph for a fourth episode of a program in accordance with the present invention. FIG. 35 illustrates the dendrogram for clustering scenes of the fourth episode of FIG. 34 of a program in accordance with the present invention.
FIGS. 36-39 illustrates screen shots of a browser for use with the technique in accordance with the present invention. Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims.
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
A video summarization technique is described herein, which, given a video sequence, generates a semantic hierarchical representation 10, illustrated in FIG. 1 with a tree-like representation. This representation becomes more compact at each level of the tree. For example, the bottom level, e.g., frames 12, is composed from approximately 50,000 image frames in an exemplary embodiment. The highest level, e.g., physical settings 14, may have 5-6 representative images. The next two levels of the tree represent a temporal segmentation of the video into shots 16 and scenes 18. The highest level, physical settings 14, represents a novel abstraction of video and is based on a non-temporal representation, as will be described in greater detail herein.
The first temporal segmentation of the video is a segmentation into shots 16, presented in the second level of the tree. In an exemplary embodiment, there were approximately 300 shots for approximately 50,000 frames. A "shot" is defined herein as a sequence of consecutive frames taken from the same camera or recording device. Many algorithms are known in the art for shot boundary detection, in order to divide the frames 12 into a plurality of shots 16 (e.g. A. Aner and J. R. Kender, "A Unified Memory-Based Approach to Cut, Dissolve, Key Fame and Scene Analysis," IEEE ICIP, 2001; R. Lienhart, "Comparison Of Automatic Shot Boundary Detection Algorithms," Proc. ofSPIE Vol. 3656 Storage and Retrieval for Image and Video Databases VII, 1999; and J. Oh, K.A. Hua, andN. Liang, "Scene Change Detection In A MPEG Compressed Video Sequence," SPIE Conference on, Multimedia Computing arid Networking, 2000.).
A second temporal segmentation, represented as the third level in the tree illustrated in FIG. 1, is the segmentation of the shots 16 into scenes 18. The definition of a "scene," as used herein, is a collection of consecutive shots, which are related to each other by the same semantic context. For example, in the exemplary embodiment, consecutive shots which were taken in the same physical location and that describe an event or a story which is related in context to that physical location are grouped together into one scene. Methods of scene boundary detection known in the art include, e.g., A. Hanjalic, R.L. Lagendijk, and J. Biemond, "Automated High- Level Movie Segmentation For Advanced Video Retrieval Systems," IEEE Transactions on Circuits and Systems for Video Technology, Volume 9, June 1999; and J.R. Kender and B.L. Yeo, "Video Scene Segmentation Via Continuous Video Coherence," 7EEE Conference on Computer Vision and Pattern Recognition, 1998; and M. Yeung and B.L. Yeo. "Time-Constrained Clustering For Segmentation Of Video Into Story Units," International Conference on Pattern Recognition, 1996.. hi the exemplary embodiment, the method described in J.R.Kender and B.L. Yeo, described above and incorporated by reference in its entirety herein, was used to organize the video into 13-15 scenes.
Another level is the non-temporal representation of the video into segments referred to herein as "physical settings" 14. As used herein, physical settings refers to groups of scenes 18, such that each group takes place in the same location. In the exemplary embodiment, there were 5-6 physical settings per "episode" (e.g., half-hour long video). These segments were well-defined in many television programs. In the exemplary embodiment, the methods described herein were applied to "situation comedies," or "sitcoms." For example, in the "Seinfeld" program, two physical settings are the main character's apartment and a diner. These physical settings occur many times during episodes on this program, and are therefore directly related to the main "theme" of the program. In contrast, it was found that each episode also has 2-3 physical settings which are typically unique to that episode. These physical settings can be determined by comparing physical settings across different episodes of the same sitcom program. These special settings may be used to infer the main "plots" of each episode, and are therefore useful for representing the content of the episode.
The methods described herein capture the highest level of a video, using scene-based representation. It allows for efficient browsing and indexing of video data, and solves the problems usually caused by a frame-based representation.
The methods described herein are implemented on a computer that runs the C and Matlab routines attached in the Appendix hereto to analyze video images that have been digitized. A user interface has been implemented in Java, as will be described in greater detail below.
Mosaics, as will be discussed in greater detail below, are used for representing shots, and also (due to the concept of establishing shots or pan and zoom shots) for scene representation. The mosaics are used to spatially cluster scenes or shots, depending on the video genre. To create the mosaics, only the background information of the frames is typically used, since this information is the most relevant for the comparison process of shots. The background information is further used for shot and scene clustering and for gathering general information about the whole video sequence. A complete representation of a single shot, however, also involves foreground information. Examples of single shot representations are the "synopsis mosaic," presented in M. Irani and P. Anandan, "Video Indexing Based On Mosaic Representations," Proceedings of the IEEE, volume 86, 1998. or the "dynamic mosaic," presented in M. Irani, P. Anandan, J. Bergenand, R. Kumar, and S. Hsu, "Efficient Representation Of Video Sequences And Their Applications," Signal processing: Image Communication, Volume 8, 1996; and motion trajectories, shown in M. Gelgon, P. Bouthemy, "Determining A Structured Spatio-Temporal
Representation Of Video Content For Efficient Visualisation And Indexing," ECCV, Vol. 1, 1998.
An early stage of the procedure, representing shots by mosaics, is described in greater detail herein. Since, in many video genres, the main interest is typically the interaction between the characters, the physical location of the scene is typically visible in all of the frames only as background. The characters usually appear in the middle area of the frame, and consequently the background information may only be retrieved from the borders of the frame. (Prior art approaches, e.g., J, Oh, K.A. Hua, and N. Liang, "Scene Change Detection In A MPEG Compressed Video Sequence," SPIE Conference on, Multimedia Computing and Networking, 2000, segmented out the middle area, and used color information of the borderlines of the frames for shot boundary detection. However, when key- frames are used to represent shots and scenes and to compare them , the information extracted from a single frame (or key frame) is not sufficient. Most of the physical location information is lost since most of it was concealed by the actors and only a fraction of the frame is used.) FIG. 2 shows a collection of key-frames 20, 22, 24, 26, and 28 taken from a panning shot, (hi the exemplary embodiment, the key frames are selected by hand, although automatic key-frames generation may alternatively be used.) In the panning shot, an actress is tracked walking across a room, beginning at the right side of the room and ending at the left side. Thus frame 28 is followed by frame 26, etc., ending at frame 20. The entire background is visible throughout the shot; however the actress appears in the middle of all key-frames, thereby blocking significant parts of the background. This effect may be worse in close-up shots, in which half of the frame is occluded by an actor. Accordingly, the problem of backgrounds being blocked is solved by the use of mosaics to represent shots. A mosaic 30 of the panning shot discussed above is shown in FIG. 2. The whole room is visible and even though the camera has changed its zoom, the focal length changes are eliminated in the mosaic.
In order to construct color mosaics using either projective or affine models of camera motion, and in order to use mosaics to represent temporal video segments such as shots, the method described herein provides the best results when the following conditions are met:
1. For the video genres in which the mosaic-based representation is used, either (a) the 3D physical scene is relatively planar; or (b) the camera motion is relatively slow; or (c) the relative distance between the surface elements in the 3D plane is relatively small compared with their distance to the camera. 2. Cameras are mostly static, and most camera motion is either translation and rotation about a main axis, or a zoom. Since the physical setting is limited in size, both camera's movement and varying positioning are restrained.
3. Cameras are positioned horizontally, e.g., scenes and objects viewed by the cameras will be situated parallel to the horizon.
4. Cameras are placed indoors, so that lighting changes are not as significant as in outdoor scenes.
The method described herein incorporates a novel use of mosaics for shot comparison and for hierarchical representation of video. The construction of color mosaics herein uses the technique of gray level mosaic construction methods described in M. Irani, P. Anandan, J. Bergenand, R. Kumar, and S. Hsu, "Efficient Representation Of Video Sequences And Their Applications," Signal processing: Image Communication, Volume 8, 1996, which is incorporated by reference in its entirety herein. However, this technique provides no capability for color video, and accordingly significant changes have been made to this technique, as discussed below. A first step is the generation of affine transformations between successive frames in the sequence (in the exemplary embodiment, sampling was performed at 6 frames/sec). One of the frames is then chosen as the "reference frame," that is, as the basis of the coordinate system, such that the mosaic plane will be this frame's plane. This frame is selected automatically using the method below, and illustrated in FIGS. 20-23. This reference frame is "projected" into the mosaic using an identity transformation. The rest of the transformations are mapped to this coordinate system and are used to project the frames into the mosaic plane. It is noted that although the true transformations between the frames in some shot sequences are projective and not affine, only affine transformations are computed. (This may result in some local distortions in the mosaic, but prevents the projective distortion (illustrated by mosaic 32 in FIG. 2)).
The value of each pixel in the mosaic is determined by the median value of all of the pixels that were mapped into it. Simply taking the median of each channel may result in colors which might not have existed in the original frames, and it is desirable to use only true colors. As a first step, the frames are converted to gray- level images while maintaining corresponding pointers from each gray-level pixel to its original color value. For each mosaic-pixel, an array is formed of all values from different frames that were mapped onto it, the median gray-level value of that array is detennined, and then the corresponding color value for that pixel is used in the mosaic. Outlier rejection, as known in the art described in M. Irani, P. Anandan, J. Bergenand, R. Kumar, and S. Hsu, "Efficient Representation Of Video Sequences and their Applications," Signal processing: Image Communication, Volume 8, 1996, and incorporated by reference herein, is used to both improve the accuracy of the affine transfonnations constructed between frames as well as to detect and segment out all of the moving objects from the mosaic. This improves the accuracy of the affine transformations constructed between frames and results in "clear" mosaics, e.g., mosaic 30, where only the background of the scene is visible, as shown in FIG. 2. A subsequent stage in the method is the comparison of mosaics. In video genres where a physical setting appears several times, as is the case for video genres tested in the exemplary embodiments described herein - sitcoms and basketball games - the video it is often photographed from different view points and at different zoom, and sometimes also in different lighting conditions. Therefore, the mosaics generated from these shots are of different size and shape, and the same physical setting will appear different across several mosaics. Moreover, different parts of the same physical scene are visible in different mosaics, since not every shot covers the whole scene location. An example of different mosaics generated from shots of the same scene is shown in FIGS. 3(a)-3(b). The mosaics 40, 42, 44 in FIGS. 3(a) - 3(c), respectively, show the physical scene from similar angles (since they were generated from shots taken by cameras located close to one another). The mosaics 46 and 48 in FIGS. 3(d)-3(e) were generated from shots that were taken from cameras located in totally different locations than the ones used in FIGS. 3(a)-3(c), and therefore show different parts of the physical scene. It would appear to be easier to cluster mosaics of FIGS. 3(a)-3(c) together into one group and the mosaics of FIGS. 3(d)-3(e) together in a second group, based solely on these image properties. However, in order to cluster mosaics 40-48 of FIGS. 3(a)-3(e) into one single cluster, information is obtained from other scenes where similar camera locations were used. Each of the mosaics is divided into smaller regions to look for similarities in consecutive relative regions. The properties of the mosaics that are being compared include visual similarities, or other features, such as texture. When comparing mosaics generated from certain video genres, assumptions about the camera viewpoint and placement are made to more efficiently analyze the images. Such assumptions may include noting the horizontal nature of the mosaics. Camera locations and movements are limited due to physical set considerations, which causes the topological order of the background to be constant throughout the mosaics. Therefore, the corresponding regions only have horizontal displacements, rather than more complex perspective changes. hi order to compare mosaics, the approach of rubber-sheet matching is used, which takes into account the topological distortions among the mosaics, and the rubber-sheet transformations between two mosaics of a same physical scene. The comparison process is done in a coarse to fine manner. (Further details of rubber- sheet matching are described in R.C. Gonzalez and R.E. Woods, Digital Image Processing. Addison Wesley, 1993, chapter/section 5.9, pages 296-297, incorporated by refernece herein.) Since mosaics of common physical scenes may cover different parts of the scene, a first step is to coarsely detect areas in each mosaic-pair which correspond to the same spatial area. It is typically required that sufficient portions of the mosaics match in order to determine them as similar. For example, since the mosaic is either bigger or has the same size as the original video frame, the width of the corresponding areas detected for a matching mosaic pair should be not less than approximately the original frame width. The height of this area should be at least 2/3 of the original frame height (the upper part of the mosaic is typically used in this method). This requirement is motivated by cinematography rules, known in the art, concerning the focus on active parts of the frame (see, e.g., D. Arijon. Grammar of tbe Film Language. Simian- James Press, 1976).
After smoothing noise in the mosaics with a 5 x 5 median filter (choosing color median in a manner as described above), the mosaics are divided into relatively wide vertical strips (60 pixels wide) and these strips are compared by distance matching, as will be described below. An approximate common region in the two mosaics is determined by coarsely aligning a sequence of vertical strips in one mosaic with a sequence of vertical strips in the second mosaic. The coarse detection stage is first performed in order to identify candidate matching regions in every mosaic pair. If no corresponding regions are found, the mosaics are determined to be "dissimilar." In cases where candidate matching regions are found, a threshold is used, which is determined from matching mosaic-pairs as described in greater detail below, in order to discard mosaic pairs with poor match scores. Subsequently, a more restricted matching process is applied on the remaining cropped mosaic pairs.
According to the restricted matching process, narrower strips are used to finely verify similarities and to generate match scores for each mosaic-pair. This step of dividing the mosaic into smaller regions for comparison is useful since global color matches might occur across different settings, but usually not in different relative locations within them.
The technique for defining the distance measure between images regions is explained herein. To address changes in light intensity, which causes variations along all three axes in RGB space, a color space based on hue, chrominance (saturation) and lightness (luminance) is used. Such changes correspond mainly to a variation along the intensity axis. An advantage of this color space is that it is close to the human perception of colors. In the exemplary embodiment, the HSI color space 60 in polar coordinates, illustrated in FIG. 4, is used, as is known in the art. The intensity channel 162 is computed as luminance (instead of brightness-average of RGB, as is typically used). Hue H 64 represents the impression related to the dominant wavelength of the color stimulus. Saturation S 66 expresses the relative color purity (amount of white light in the color). Hues are determined by their angular location on this wheel. Saturation, or the richness of color, is defined as the distance perpendicular to the intensity axis. Hue and saturation taken together are called the chromaticity coordinates (polar system). Colors near the central axis have low saturation and look pastel. Colors near the surface of the cone have high saturation.
The HSI space forces non-uniform quantization when constructing histograms and does not capture color similarities as well as CIELab color space, for example. However, the appropriateness of any such quantization can be easily validated by converting the quantized HSI values back to RGB space and inspecting the resulting color-quantized images. This procedure allows for tuning the quantization and to predict the results of the three-dimensional HSI histogram difference computations, hi the exemplary embodiment, uniform quantization was used for hue values, and 18 values were used. Since both saturation and intensity were found to behave poorly for small values, a non-uniform quantization was used for both. For saturation, a threshold was empirically chosen; for pixels with saturation values below this threshold (e.g., for grays), the hue values were ignored. Saturation values above this threshold were equally divided into 5 bins. For intensity, another threshold was empirically chosen; for pixels with intensity values below this threshold (e.g., for black), saturation values were ignored. Intensity values above this threshold were equally divided into 2 bins. After determining the appropriate quantization, a simple L\ norm between the HSI color histograms was used. It is contemplated that the use of quadratic form distance may be used as alternative. Application of the HS/histogram-based comparison on small regions of the mosaics handled comparison of mosaics showing the same physical scene but from varying angles. The histogram generation is implemented in hist. c, in function "FillΗistFromRGBArray2".
The technique used for finding the best diagonal is explained herein. All comparison stages (coarse and fine) are based on the same method of finding the best diagonal in a distance matrix, which corresponds to finding horizontal or vertical alignment. Exemplary illustrations of such distance matrices are illustrated in FIGS. 6-7, below. According to this technique, the matrix D(i, j) refers to an N xM matrix where each entry represents a distance measure. This matrix is treated as a rectangular grid, and the best diagonal path is then searched for within that grid. For example, if P{(s,t) -» (k, /)} represents a path from node (s, t) to node (k, l) of length L, then its weight is defined by the average weight of its nodes:
WP{(s,t) → (k,l)} = - ∑ D(i,j). (1)
E (,-,;)φ, .(«)]
The best diagonal path P with the minimum weight of all diagonals is searched for, that also satisfies the following constraints: 1. Length {p)≥ TLensth.
2. Slope (P) ≤ TSiope. The constraints Tιmgt)t and Tsiope specify thresholds for minimum diagonal length and maximum slope value, respectively. The first constraint is determined by the width and height of the original frames in the sequence, which determine the mosaic's size (e.g., 352 x 240 pixels in the exemplary embodiment), since it is required that sufficient portions of the mosaics will match. For example, for frame size of 352 x 240 pixels, the width of the strips in the coarse stage (described below) was set to 60 pixels, and Tιenga. was set to 5 for the horizontal alignment. Thus the width of the matched region is at least 300 pixels.
The second constraint relates to the different scales and angles of the generated mosaics and to different camera placement. If the matched mosaics are generated from shots both taken from the same location and at the same focal length, then the diagonal will have a slope of 45°. Yet, if one shot was taken from a different angle or with a different zoom, then the slope changes. The scale difference could be as large as 2 : 1, resulting in a slope of approximately 26°. Therefore diagonals of slopes are examined that vary between 25° - 45° in both directions (allowing either the first mosaic to be wider or the second mosaic to be wider). Intervals of 5° are used, resulting in a total of 9 possible slopes. However, in order to determine the weight of each diagonal, the values along this diagonal are interpolated. Bilinear interpolation is one alternative, which has certain drawback, e.g., time consuming. Experiments have proved that nearest neighbor interpolation gives satisfactory results. With the use of indices and look-up-tables, this technique was implemented.
The nearest neighbor algorithm was modified and reordered to save computation time but also provide more flexibility on imposing additional slope and length constraints. Within the grid described above, there are multiple diagonals of the same slope. For each slope, values are interpolated along each diagonal by determining which entries should be used and which entries to repeat or to skip. (As used herein, "entries" refers to the indexes of the matrix. An entry in a two- dimensional array, a matrix, is the pair of numbers (ij) where i is the row number and j is the column number.) Instead of repeating computation for each diagonal in each slope, the transformation of the distance matrix of a specified slope is modeled into a matrix of scale 1:1. By sing this approach, it is only necessary to search for the best sub-sequence along the main diagonal, and the diagonals parallel to it. For each slope a list of interpolated indices is generated which are used for all diagonals of that slope, which are then stored in a look-up table. Subsequently, all diagonals are scanned and for each diagonal a vector of values is generated using the interpolated indices from the look-up table. Since the slopes are symmetrical, look-up tables are generated for 25°, 30°, 35°, and 40° which are applied once on rows and once on columns. The next step is to find the best subsequence along each diagonal. Instead of a brute-force search along the vector V i] which holds all the interpolated diagonal values, a partial
/ sum vector P[i] = ^K[i]. Any sum of subsequences Vj] ... V[k] can be computed as
0
P[k] - P\j - 1] where [-l] = = 0. (This interpolation is performed in bilinear.c, or more specifically: the main program written in "scenes_strips_strips_crop.c" calls the function "GetDirectBestDiagVal" written in bilinear_interp.c. This function, along with the 'service' function it calls, "MaxDirectsubDiag", implements the interpolation. These routines are provided in the Appendix.)
The coarse matching procedure will now be described. A coarse horizontal alignment of two consecutive strip-sequences in a mosaic-pair is performed in order to detect a common physical area. In FIG. 5, several key frames 72, 74, 76, 78 and 80 of a first shot are used to generate the mosaic 82. In FIG. 6, key frames 82, 84, 86, 88 and 90 of a second shot are used to generate the mosaic 92. An exemplary strip t 94 in mosaic 82 and strip Sj 96 in mosaic 92 comprise a mosaic strip-pair. The width of the strips 94 and 96 is set to be 60 pixels each, since no more than about 5-6 vertical segments are needed, and a finer comparison stage is subsequently perfonned (described in greater detail below). In order to align two strip sequences, a distance matrix S[i,j] is generated, in which each entry corresponds to the distance between a strip S{ from one mosaic to a strip sj in the second mosaic: S[i,j] = Diff(Si,Sj). (2)
Where
Figure imgf000019_0001
) is the difference measure between the strips (will be discussed below). An example of matrix S[t, j] is shown in FIG. 8, in which each gray level block corresponds to an entry in the matrix. Finding two strip sequences in the two mosaics that have "good" alignment and therefore define a common physical area, corresponds to finding a "good" diagonal in the matrix S[i,j], as will be explained in greater detail below.
The two strips, e.g., si 94 and Sj 96, as shown in FIG. 7 are compared, in order to determine whether they cover the same physical location. In comparing the two strips, the technique does not assume that the strips cover the same vertical areas, but it may be assumed that they both have overlapping regions. Therefore, in order to detect vertical alignment, each strip s, 94 Sj 96 is further divided into blocks of size 60 x 60., and a block-to-block distance matrix B[k, ] 100 is generated for each pair of strips, as illustrated in FIG. 7. Each entry 102 in this matrix 100 is the histogram difference between two blocks: block bk from the first strip s,- 94 and a block bι from the second strip sj 96:
Figure imgf000020_0001
The calculation of Diff(bk,bl) is the histogram distance defined above. The best diagonal 104 (the thin white diagonal line) is located (as discussed above) in the distance matrix B[k,l] and its start and end points are recorded The value of this diagonal 104 is chosen as the distance measure between the two strips:
Figure imgf000020_0002
The term diags is the set of all allowable diagonals in the matrix B[k,l , and each diagonal d e diags is given by a set of pairs of indices (k,l). The average of the values along the diagonal defines the distance between the two mosaics.
FIGS. 8(a)-8(b) illustrate the procedure for comparing mosaics having different dimensions. The method described herein uses a coarse alignment of a sequence of A: strips with a sequence of up to 2k strips, therefore allowing the scale difference between the two mosaics to vary between 1 : 1 and to be as large as 2: 1. An example of cropped region 116 and 118 from mosaics 110 and 112, respectively, are shown in FIG. 8 to create a block-to-block distance matrix 114 (the procedure for cropping is described below). This allows for matching mosaics which were generated from shots taken from different focal length. Although the exemplary embodiment supported a scale difference as large as 2:1, it is contemplated that routine modifications would permit larger scale differences. The technique described above and illustrated in FIG. 7 is continued with the next step of creating the strip-to-strip distance matrix S. For the mosaics 82 and 92, the strip-to-strip distance matrix S 120, as discussed above, is graphically displayed in FIG. 9. The entry 122 in matrix 120 is the result from the two strip comparison of FIG. 7, i.e., the distance between the two strips 94 and 96. The height 124 of this block matrix S 120 is the same as width 95 of the mosaic 82 in FIG. 5 and its width 126 is the width 97 of the mosaic 92 in FIG. 6, such that each block 128 represents the strip-to-strip difference between the two mosaics 82 and 92.
Next, the best diagonal in the matrix of strip-to-strip differences S[i, j] 120 is found, and the start and end points are recorded. The area of interest is represented by a thin diagonal line 130 across the matrix 120 along which entries have low values (corresponding to good similarity) in FIG. 9.
Once all mosaic pairs, e.g., mosaics 82 and 92, are processed, the distance values of the diagonal paths found for each pair are checked. It is expected that mosaic-pairs from different settings or with no common physical area will have high distance values, therefore they are discarded according to a threshold. The threshold is determined by sampling several mosaic-pair distances which have common physical areas, and therefore would have the lowest distance values. The highest of these sampled values is selected as the threshold. This sampling method yielded accurate results in the exemplary embodiment after inspecting all mosaic- pairs distances as shown in graph 200 in FIG. 10. Mosaic-pairs with common physical areas (left of the vertical line 202) are separated from the rest of the mosaic- pairs (right of the vertical line 202). The sampled pairs appear on the left-most side (left of the leftmost vertical line 206). The maximum distance value among mosaic- pairs with common background is the horizontal line 204 which rejects mosaic pairs known to be from different physical areas, although it may permit a few false positive matches. If a diagonal path distance value exceeds this threshold, it is determined that the two mosaics do not match. If a diagonal path is found that is below the threshold a match is found, and the method continues to the finer step, described below. After discarding mosaic-pairs which had diagonal paths with large distance values, the measure of their closeness is refined in order to detect false positives and to more accurately determine physical background similarity. The recorded start and end points of the diagonals from all comparison stages are used to crop each mosaic such that only the corresponding matching areas are left: The start and end points of the diagonal in the S matrix are used to set the vertical borders and the new widths of the cropped mosaics. In order to determine the new heights, every pair of strips is inspected along the diagonal path found in S, to determine which parts of the strips were used to give the best match. Since for every strip pair, the start and end points of the best diagonals in its corresponding B matrix were recorded, the average of these values is used to set the horizontal border. In the finer stage, only the cropped parts of the mosaics are compared by applying a similar method to the one used in the coarse stage. The mosaic 82 is cropped to the region 130 in FIG. 11.
Similarly, the mosaic 92 is cropped to the region 132. The cropped mosaics 130 and 132 are displayed in FIG. 12. There, the cropped mosaic 130 is also rotated to better present the graphical representation of the new S distance matrix 134 between two cropped mosaics 130 and 132. Thinner strips (20 pixels wide) are used in this stage and the scale difference between the two mosaics is also taken into account. In certain circumstances, one mosaic may be wider than the other. Assuming that the cropped mosaics cover the same regions of the physical setting, the narrower cropped mosaic is divided into K thin strips (20 pixels wide); the best match will be a one-to- one match with K wider strips of the wider mosaic, where each strip pair covers the exact physical area. Let α > 1 be the "width ratio" between the two mosaics.
Histograms are re-computed of 20 x 20 blocks for the narrower cropped mosaic, and histograms are re-computed of 20α x 20α blocks for the wider mosaic. The best path in the new distance matrix should have a slope of 45°. For the finer stage of comparison, the same approach of performing a vertical alignment using the block-to- block B-matrix used, and then the horizontal alignment using the strip to strip distance S matrix subsequently used.
Matching in the finer stage is less computationally expensive than the matching in the coarse stage. First, only a subset of the mosaic pairs are re-matched. Second, having adjusted for the mosaic widths, only diagonals parallel to the main diagonal need to be checked. Third, since the cropped mosaics cover the same regions, only complete diagonals rather than all possible sub-diagonals need to be checked. Therefore, values are computed for the main diagonal and its adjacent parallel diagonals. These diagonals are automatically known to satisfy the diagonal length and boundary constraints. These operations restrict the total number of strip and block comparisons to be performed and greatly lower computational time, even though there are more strips per mosaic. This final verification of mosaic match values is a relatively fast process.
The method of matching only corresponding regions in the mosaic, instead of comparing the whole mosaic, addresses the problem of matching shots taken from cameras in different positions, where different parts of the background are visible. The problem of camera zoom and angle is addressed in the finer stage, which allows accurate matching of cropped mosaics of the same setting but with different scale and angle.
EXAMPLE 1
A first example of the mosaic-based representation is the representation and comparison of shots from sports videos. Since information is needed from every shot, the coarse comparison stage was used to save computation time. This stage provided the capability to distinguish between shots showing wide side-views of the basketball court ("court shots"), close-up shots of players, shots taken from above the basket, and shots showing the field from a close side-view. Clustering mosaics of basketball shots led to a more meaningful shot categorization than clustering key-frames of the same shots. An example of a mosaic 300 generated from a panned and zoomed court shot is shown in FIG. 13, and the corresponding key frames 302, 304, 306.
Preliminary results from clustering basketball videos allowed for the classification of shots and to determine temporal relations between clusters. Filtering rules, corresponding to video grammar editing rules, were manually determined to extract events of human interest. For example, in a video of a European basketball game, all foul basket-shots were detected, and used as bookmarks in a quick browsing tool shown in FIG. 14. Foul penalty throws were characterized by a three-part grammatical sequence: a court shot 320, followed by a close-up shot 322 and a shot from above the basket 324. The results 330 from the shot clustering (in the coarse stage) are shown in FIG. 15. The following approach was used to cluster the mosaics: First a mosaic was generated for each shot. Then all mosaic pairs of all shots were compared to generate a shot-to-shot distance matrix. The comparison method used to generate the distance between each pair of mosaics used only the coarse stage described above. The resulting distance matrix (shown in FIG. 15) was given as input to a clustering algorithm (same one used to cluster scenes, described below). The clustering algorithm generated a dendrogram (similar to FIG. 17) which shows how well the shots cluster together. The clusters in region 332, about 2/3 of the matrix, are all court shots. The following region 334 represents a cluster of shots taken from above the basket, used for foul penalty shots. The rest of the data (region 336) are from various close-up shots.
In another example, an NBA basketball video was analyzed, and it was discovered that basket goal throws were usually characterized by a court shot 360 followed by a close-up shot 362, showing the player that made the goal (FIG. 16). (The frames illustrated in FIG. 16 were taken from the same video in which the mosaic 300 and key frames 302, 304, and 36 are illustrated in FIG. 13.) These instances were easily detected after the mosaic clustering in accordance with the invention was used as input to classify court shots and close-up shots. A dendrogram 400 representing the clustering result of shots 402 during several minutes of a game is shown in FIG. 17. The upper cluster 410 represents close-up shots, and the lower cluster 412 represents court shots. These clusters could be further divided into smaller categories, such as court/audience close-ups and whole/right/left court shots. In one example, the first quarter of one of the first games of the season, 32 sequences of a court shot were detected followed by a close-up shot, which were good candidates for representing field goals. Of these, 18 were regular field goals, 7 were assaults, 2 occurred just before a time-out and the rest of the 5 instances showed the missed field goals of a well-known NBA player. (This video was for popular basketball player Michael Jordan's first game returning to play after retirement, which may explain why cameras switched to close-ups even though Jordan missed.) All of these instances serve as "interesting" events in the game. They became bookmarks for an augmented video player which allows regular viewing of the game as well as skipping from one bookmark to the other. Screen captures of this video player for a foul shot bookmark are shown in FIG. 16. The preliminary clustering results 420 which separated field shots from close-up shots are shown in FIG. 18. This figure shows first stage (coarse) comparison of shots from that game. The top-left cluster 422 along the main diagonal 424 represents court shots which cluster together in this stage, the following cluster 426 represents various close-ups of basketball players and the coach.
Further analysis of the mosaics distinguishes between different categories of field goal throws and labels each with its corresponding team. This is done by adjusting the mosaic comparison method, which generally aligns sub- mosaics. It declares a good match the alignment of a left (right) court mosaic, for which only the left (right) basket is visible, with a whole court mosaic, in which both the left and right basket are visible. By forcing the matching stage to align the whole mosaic instead of sub-mosaics, a left (right) field mosaic is not matched with a whole field mosaic. After separating whole, left and right field shots, the information about basket goals becomes more complete. Close-ups occurring immediately after whole- court shots, simple motion analysis of the whole-court shots (which are panning shots) infers which side of the court was visible at the end of the shot, therefore infers which team made the basket goal.
EXAMPLE 2 A second example utilizes the mosaic-based representation to generate a compact hierarchical representation of video. This representation is prepared using the video genre sitcom programs, although it is contemplated that this approach may be used for other video genres. The video sequences are first divided into shots using a shots transition detection technique described in A. Aner and J. R. Kender, "A Unified Memory-Based Approach to Cut, Dissolve, Key Fame and Scene Analysis," IEEEICIP, 2001, which is incorporated by reference herein. These shots are further divided into scenes using the method described in J.R. Kender and B.L. Yeo, "Video Scene Segmentation via Continuous Video Coherence," ZEEE Conference on Computer Vision and Pattern Recognition, 1998, incorporated by reference herein. The hierarchical tree representation is illustrated in FIG. 1. Shots and scenes are represented with mosaics, which are used to cluster scenes according to physical location. The hierarchical representation in FIG. 1 illustrates the new level of abstraction of video, which concludes the top level of the tree-like representation, and call it physical setting 14. This high-level semantic representation does not only form a very compact representation for long video sequences, but also allows for efficiently comparing different videos (e.g., different episodes of the same sitcom). By analyzing the comparison results, the method allows to infer the main theme of the sitcom as well as the main plots of each episode.
As discussed above, a scene is a collection of consecutive shots, related to each other by some spatial context, which could be an event, a group of characters engaged in some activity, or a physical location. For example, a scene in a sitcom typically occurs in a specific physical location, and this location is usually repeated throughout the episode. Some physical locations are characteristic of the specific sitcom, and repeat in almost all of its episodes. Therefore, it is advantageous to describe scenes in sitcoms by their physical location and use these physical settings to generate summaries of sitcoms and to compare different episodes of the same sitcom. This property is not confined to the genre of sitcoms and could be employed for other video data that has the hierarchical structure of scenes and which are constrained by production economics, fonnal structure, and/or human perceptive limits to re-use their physical settings. In order to represent a scene, the novel technique described herein uses shots that have most information about the physical scene. These shots are selected by automatically detecting static, panning and zoom shots - the first shot of the scene, the establishing shot, is selected, along with all shots that have large pan or zoom - the process of automatically detecting these shots is described below. Once these shots are determined, their corresponding mosaics are chosen as the representative mosaics of the scene, or in short notation, "R-Mosaics." An example of a scene in a sitcom is shown in FIG. 19 for physical setting 500. Shots 502, 504, 506 that were photographed from different cameras 512, 514, 516, respectively, are typically very different visually, even though they all belong to the same scene and were taken at the same physical setting. One example of a "good shot" is a pan shot, as shown in FIG. 2, described above. Another example is a zoom shot. A zoomed out portion of the shot will most likely show a wider view of the background, and thereby expose more parts of the physical setting. Detecting "good shots" is done by examining the registration transformations computed between consecutive frames in the shot. By analyzing those transformations, shots are classified as panning (or tracking), zoomed-in, zoomed-out or stationary. This classification is done automatically, as follows: For each shot, the affine transformations that were previously computed in the mosaic construction stage, are consecutively applied on an initial square This is coded in matlab, provided in the appendix hereto as the routine: "check_affine.m" Each transformed quadrilateral was then measured for size and distance from a previous quadrilateral. Static shots are shown in FIG. 20. Pan shots were determined measuring distances between the quadrilaterals (FIG. 21), zoom shots were determined by measuring varying size of quadrilaterals (FIG. 22), and parallax was determined both by size and scale and by measuring the quality of the affine transformation computed (checking accumulated error and comparing to inverse transformations), as illustrated in FIG. 23. For each shot the initial square is size is 1 Ox 10 and it resides in the middle of a generated figure. If the shot has N sampled frames, then their N corresponding computed affine transformations are applied - first on the first square, and then on the resulting quadrilaterals. As a result. N quadrilaterals are generated which are represented in different colors/tones to differentiate between them when looking at the generated figure. The axes in the figure's graph represent the dimensions of the space in which the quadrilaterals reside. For example, if the shot had a large pan or zoom, then this space becomes very large. If the shot was static, then this space stays relatively small. Each square in FIGS. 20-23 represents a sampled frame. The colors/tones in FIGS. 20-23 are selected from a fixed array of 7 colors/tones. Once shots are classified, the reference frame for the mosaic plane is chosen accordingly. This is performed automatically according to the following determined classifications: For panning and stationary shots, the middle frame is chosen. For zoomed-in shots the first frame is chosen. For zoomed-out shots, the last frame is chosen. A threshold is derived experimentally to allow for the selection of shots with significant zoom or pan. For some shots, further processing is needed, since some shots have significant parallax motion. These shots were segmented into several parts and several separate mosaics were constructed for each part. However, some scenes are mostly static, without pan or zoom shots. Moreover, sometimes the physical setting is not visible in the R-Mosaic because the shot had a pan in a close-up form. For these scenes, a shot which best represents the scene has to be chosen. Most scenes in sitcoms (and all the static scenes that were processed in the example) have an interior "establishing shot," following the basic rules of film editing. This property also holds for many other video genres, since the use of an "establishing shot" appears necessary for human perception of a change of scene. It is a wide-angle shot (a "full-shot" or "long-shot"), photographed for the purpose of identifying the location or setting or for introducing the characters participating in that scene. An example of an establishing shot 502, shot by camera location 512 is illustrated in FIG. 19. Therefore, the first interior shot of each scene is selected to be an R-mosaic (for a static scene it is the only R-mosaic). Many indoor scenes also have an exterior shot preceding the interior "establishing shot," which photographs the building from the outside. This scene may be detected because is does not cluster with the rest of the following shots into their scene, and also does not cluster with the previous shots into the preceding scene. Instead, it is determined as a unique cluster of one scene. These unique clusters are detected and disregarded when constructing a scene representation. In the example, using this approach led to up to six R-Mosaics per scene. Once R-mosaics are chosen, they may be used to compare and cluster scenes. FIG. 24 illustrates the procedure in which this performed. For scene i 600, the following mosaics have been created: mosaic 1 602, mosaic 2 604, mosaic 3 606, mosaic 4 608. Similarly, for scene j 610, the following mosaics have been created: mosaic 1 612, and mosaic 2 614. The distances, or dissimilarities, between mosaics are computed as discussed above, and are represented in FIG. 24 as distances 616. (For example, the distance 618 is computed between mosaic 1 602 (of scene i 600) and mosaic 1 612 of scene j 610. The distance between each pair of mosaics is computed according to the equations (l)-(4) above, in which equation (1), when applied on the strip-to-strip distance matrix, defines the final distance between the mosaic pair. The dissimilarity between two scenes is determined as the minimum dissimilarity between any of their R-Mosaics, as illustrated in FIG. 24, and represented below: Distance (Scene i, Scene ii) = min {all distance values} (5)
This is due to the fact that different shots in a scene might contain different parts of the background, and when attempting to match scenes that share the same physical location, the technique finds at least one pair of shots (mosaics), from the two scenes that show the same part of the background.
After determining the distance measure between each scene within an episode, a scene difference matrix is constructed in which each entry (ij) corresponds to the difference measure between scene i and scene j. An example is show in FIG. 25, in which the entries in the scene difference matrix were arranged manually (for display purposes) so that they correspond to the 5 physical settings of this episode. (The physical settings are automatically determined by the clustering algorithm. However, their naming is done manually.) On the left is the scene list 650 in temporal order, and on the right is the similarity graph 652 for the scene clustering results, in matrix form, in which darker regions represent higher similarity (low distance scores). For example, there are 13 scenes in the episode (as indicated in list 650). More specifically, scenes 1, 3, 7, 12 took place in the setting which was marked as "Apartment 1," i.e., the cluster 654. Scenes 2, 6, 9, 11 took place in "Apartment 2," i.e., the cluster 656. Scenes 5, 10 took place in "Coffee Shop," i.e., cluster 658. Scenes 4, 13 took place in "Bedroom 1," i.e., cluster 660, and scene 8 took place in "Bedroom 2."
Although the scene clustering process typically results in 5-6 physical settings, there are often about 1-6 scenes in each physical setting cluster, and about 1- 6 mosaics representing each scene. Ideally, for the purposes of display and user interface, it is preferable to choose a single mosaic to represent each physical setting. However, this is not always possible. Shots of scenes in the same physical setting, and sometimes even within the same scene, are filmed using cameras in various locations which show different parts of the background. Therefore, two mosaics of the same physical setting might not even have any corresponding regions, hi order to practice the technique, the representation of a physical setting preferably includes all parts of the background which are relevant to that setting. Therefore, if there isn't a single mosaic which represents the whole background, several mosaics which together cover the whole background.
The results of the matching algorithm's finer stage, which recognizes corresponding regions in the mosaics, are used to determine a "minimal covering set" of mosaics for each physical setting. This set is approximated by clustering all the representative mosaics of all the scenes of one physical setting and choosing a single mosaic to represent each cluster. This single mosaic is the centroid of the cluster, e.g., it is the mosaic winch has the best average match value to the rest of the mosaics in that cluster. This may be achieved in two stages: first, by identifying separate clusters, where each cluster corresponds to mosaics which show corresponding areas in the background of the particular physical setting, in the second stage a distance matrix between the mosaics of each cluster is computed and the mosaic with the minimal average distance from all mosaics is chosen as the centroid of that cluster. FIG. 26 illustrates the hierarchical representation of a single episode, of scenes and physical settings, and the images 712-736 are sampled key frames from the 13 scenes of that episode. The physical settings are represented by mosaics. There are five physical settings (Apartment 1 702, Apartment 2 704, Coffee Shop 706, Bedroom 1 708, and Bedroom 2 710) and 13 scenes 712-736. The relationship between them are shown by the lines 740, 742, 744, 746, 748, 750, e.g., frames 712, 716, 724, and 734 correspond to physical setting "Apartment 1" represented by mosaic 702; ., frames 714, 722, 728, and 732 correspond to physical setting "Apartment 2" represented by mosaic 704; frames 720 and 730 correspond to physical setting "Coffee Shop" represented by mosaic 706; frames 718 and 736 correspond to physical setting Bedroom 1" represented by mosaic 708; and frame 726 corresponds to physical setting Bedroom 2" represented by mosaic 710. For display purposes, only one mosaic from the largest cluster is chosen to represent each setting.
Table 1 illustrates the compactness of the representation method using settings, scenes, shots, and frames of a single episode. In this episode, there are 5 physical settings, each represented by a single R-Mosaic (the R-Mosaics are referred to by their corresponding shot number), and having 1-4 scenes. Each scene is represented by only 1-4 R-Mosaics, and has 11-26 shots, i.e., approximately 2400- 11100 frames.
Figure imgf000031_0001
TABLE 1
The result of the mosaic-based scene clustering and the construction of physical settings for three episodes1 of the same sitcom are shown in similarity graphs FIGS. 27-29. FIGS. 30-32 illustrate dendrograms generated using tools as described in Peter Kleiweg, "Data Clustering," on line publication http://odur.let.rug.nl/~kleiweg/clustering/clustering.html, as are known in the art. (The program described therein reads a difference file, and creates a clustering represented by a dendogram. This program is an implementation of seven different clustering algoritlims, which are described in Anil K. Jain and Richard C. Dubes,
Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ, 1988, which is incorporated by reference in its entirety herein.) It is understood that other clustering technique are known in the art, and are also applicable for clustering the distance values calculated herein. Each cluster in each dendrogram represents a different physical setting.
(The epsiode represented by the dendrogram in FIGS. 30 corresponds to the same episode represented by the similarity graph in FIG. 27, and FIGS. 31 and 28 and FIGS. 32 and 29 are similarly related.) FIG. 33 represents the results of an inter-video comparison of the physical settings in the three episodes. With continued reference to FIGS. 27-29, the order of the entries in the original scene difference matrices were manually arranged to reflect the order of the reappearance of the same physical settings. Dark blocks represent good matches (low difference score). As can be seen in FIG. 28, there are 5 main clusters representing the 5 different scene locations. For example, the first cluster along the main diagonal is a 4x4 square representing scenes from Apartment 1. The false similarity values outside this square are due to actual physical similarities shared by "Apartment 1" and "Apartment 2" for example, light brown kitchen cabinets. Nevertheless, this did not affect the scene clustering results, as can be seen in the corresponding diagram of FIG. 31, generated by the results of applying a weighted-average clustering algorithm to the original scene difference matrix on the left.
When grouping information from several episodes of the same sitcom, repeating physical settings are detected. It is often the case that each episode has 2-3 settings which are unique to that episode, and 2-3 more settings which are common and recur in other episodes. The clustering information from three episodes of the same sitcom were summarized and similarities across the videos based on the physical settings were computed. The distance between two physical settings are defined to be the minimum distance between their R-Mosaics, as discussed above in reference to FIG. 24. In order to compare between three long video sequences, each 40K frames long, the 5 physical settings of the first episode are compared with the 5 physical settings of the second episode and the 6 physical settings of the third episode. The results are illustrated in FIG. 33, in which physical settings 760, 762, 764, 766, and 768 are from episode 1 (portion 750 of FIG. 33); physical settings 770, 772, 774, 776, 778 are from episode 2 (portion 752 of FIG. 32); and settings 780, 782, 784, 786, 790, and 792 are from episode 3 (portion 754 of FIG. 32). Lines join matching physical settings, which are common settings in most episodes of this sit-com: e.g., line 794 joins "apartment 1" setting represented by 760, 770, and 784 in episodes 1, 2, and 3, respectively; lines 796 joins "apartment 2" setting 772 and 782; and line 798 joins "coffee shop" setting 762, 774, and 780.
Comparing settings across episodes leads to a higher-level contextual identification of main plots. "Plots" are defined herein as settings unique to the episode. For example, in the episode 1, (represented by the portion 750 of FIG. 33) there are three main plots involving activities in a dance class 764, jail 766, and an airport 768. Most episodes of sitcoms have been found to involve two or three plots; anecdotal feedback from human observers of the sitcoms suggests that people relate the plot to the unusual setting in which it occurs. That is, what makes a video unique is the use of settings which are unusual with respect to the library.
Another example is shown for a fourth episode, represented in FIGS. 34-35. There are six physical settings in this episode, e.g., "Apartment 1," "Apartment 2," "Coffee Shop," "Party," "Tennis Game," and "Boss's Dinner." In this episode, the clustering of scenes together into physical settings was not as straightforward as the previous episodes discussed above and illustrated in FIGS. 27- 32. This is due to the fact that the setting of "Apartment 1" was not presented in the same manner, since its scenes either took place in the kitchen or in the living room, but not in both. Mosaics which were generated from shots in scenes that took place in the kitchen of "Apartment 1" have no overlap with mosaics which were generated from shots in scenes that took place in the living room of "Apartment 1." This process of mosaic creation separates scenes of "Apartment 1" into two clusters, as shown in FIG. 34. Scene 1 belongs to the living room cluster 802, and scenes 7, 9, and 12 belong to the kitchen cluster 804. The clustering does not separate scene 11 well from the rest of the scenes, since this new setting took place in an apartment with similar colored walls and windows as the colors appearing in the setting "Coffee Shop," into which it is loosely clustered. However, when these settings are compared with settings of a number of different episodes, the division of "Apartment 1" into two separate settings is corrected. In the context of all the episodes of this sitcom, the setting of "Apartment 1" includes mosaics of both the living room and the kitchen, causing two different settings of this episode to be combined together. More specifically, the "Apartment 1" setting cluster already contains mosaics that match both scene 1 and scenes 7, 9, and 12 from the new episode. Example 2 demonstrates how the non-temporal level of abstraction of a single video could be verified and corrected by semantic inference to other videos. Scenes and settings that otherwise would not have been grouped together are related by a type of "learning" from the previously detected "physical setting" structure of other episodes. For the video genre which was used in example 2, sitcoms, the physical setting structure is well defined and it is straightforward to distinguish between them, as discussed above. Some features of sitcoms which improve results are the cost consideration which limit the number of different sets used for each sitcom, as well as the very pronounced colors which are often used in different settings, probably to catch viewer attention. As discussed above, the scene dissimilarity measure is used to determine the accuracy of the physical settings detection. Different clustering methods would result in the same physical settings cluster structure as long as the scenes distance matrix has the correct values. For example, the episode 4 discussed above with reference to FIGS. 34-35, the inter- video comparison of physical settings would correct the clustering results for the first setting of "Apartment 1," but the clustering threshold was not as pronounce as in the first three episodes. Depending on this threshold, for large values scene 11 could be wrongly clustered with scenes 14, 2, and 4 (see, FIG. 35), and for small values scene 15 would not be clustered with scenes 5, 8, 10 and 13, as it should (see, FIG. 35).
The complexity of the scene clustering method discussed herein is low. Since all mosaic pairs are matched, if there are M mosaics in an episode, then "coarse" match stages will be performed, after which only several mosaic pairs will be matched in the "fine" stage of analysis. In the examples discussed herein, the number of pairs was on the order of 0(2M). Once the scene distance matrix is constructed, the physical settings are determined using any clustering algorithm known in the art. Several clustering methods were used, which (all available from the web site that is referenced above in the text) that performed similarly - among them were "single link" (distance between clusters is defined to be minimum distance between their elements), "complete link" (instead distance between clusters is defined to be maximum distance between their elements, "group average" (distance between clusters is defined to be average distance between their elements), "weighted average" (distance between clusters is defined to be weighted average distance between their elements. Weight of each element is set according to the number of times this element participated in the cluster combination step of the clustering process) Since the maximum number of scenes encountered in sitcoms was 15, there are up to 15 elements to cluster, causing every clustering algorithm to run fast. Finally, when comparing physical settings across episodes, there are only 5-6 settings in each episode, each presented by no more than 3 mosaics, which also makes the comparison process very efficient. A representative user interface for the invention described herein is a video browser, which utilizes the proposed tree-like hierarchical structure as represented in FIG. 1, above. The browser uses the video data gathered from all levels of representation of the video. At the frame level, the mpeg video format of each episode is used for viewing video segments of the original video. At the shot level, it uses a list of all marked shots for each episode, including the start and end frame of each shot. At the scene level, it uses a list of all marked scenes for each episode, including the start and end shots of each scene. For each scene, a list of representative shots is kept, and their corresponding image mosaics are used for display within the browser. At the physical level, it uses a list of all detected physical settings for each episode, with their corresponding hand-labeled descriptions (e.g., "Apartment 1," "Coffee Shop," etc.). Each physical setting has a single representative image mosaic, used for display.
A representative browser is illustrated in FIG. 36, and is implemented in Java. The main menu is displayed as a table-like summary in a single window 850. Each row 852, 854, and 856 represents an episode of the specified sitcom. The columns 858-878 represent different physical settings that were determined during the clustering phase of scenes for all episodes, as discussed above. Each cell (i,j) in the table is either empty (e.g., empty region 880 corresponding to setting "Apartment 2" 860 and episode 854) or displays a representative mosaic for settingy, taken from episode i. The order of columns from left to right is organized from the most common, i.e., "Apartment 1" 852, to the non-common settings, i.e., "Bar" 878. In the example, the first three columns represent common settings which repeat in almost every episode of the specific sitcom. The rest of the columns are generally unique for every episode. In this manner, the user can immediately recognize the main plots for each episode by looking for non-empty cells in the row of that episode starting from the first column of unique settings, e.g., starting at the fourth column 862 in FIG. 36. For example, in episode 852, the main plots involve scenes taking place in settings "Bedroom 1" 864 and "Bedroom 2" 866. In order to confirm the main plots quickly, the user can left-click on the representative mosaics for these settings, which displays a window 882 of a short list of scene mosaics that correspond to those settings (usually one or two) as illustrated in FIG. 37. If further needed, left-clicking on a mosaic in window 882 will enlarge and display the mosaic in window 884 of FIG. 37, and double-clicking on the representative mosaic for each scene will start playing the video from the beginning of that scene in window 886 of FIG. 38. The temporal representation of each episode is also accessed from the main menu 850 and is used for fast browsing of the episode. By left-clicking on a certain episode name, such as episode 854 of FIG. 36, a window 882 of a list of all scene mosaics belonging to that episode appears (FIG. 39). Each scene on the list shown in window 882 is represented by a single mosaic 886 and it is optionally expanded by left-clicking into a window of a list of representative mosaics (shots) for that scene. The fast browsing is performed by scanning the scenes in order and only playing relevant video segment from chosen scenes by double-clicking on them, as shown in FIG. 38.
The browser discussed herein has the advantage of being both hierarchical in displaying semantically oriented video summaries of videos in a non- temporal tree-like fashion as well semantically relating different episodes of the same sitcom to each other.
It will be understood that the foregoing is only illustrative of the principles of the invention, and that various modifications can be made by those skilled in the art without departing from the scope and spirit of the invention. The mosaic-based scene comparison method is not confined to the genres of situation comedies alone. In news videos, for example, it could allow classifying broadcasts from well-known public places. It could also allow classification of different sports videos such as basketball, soccer, hockey and tennis according to the characteristics of the play field. The methods described herein may serve as a useful tool for content- based video access. Video sequences are represented in their multiple levels of organization in the following structure: frames, shots, scenes, settings and themes. The proposed mosaic-based approach allows direct identification of both clusters of scenes (settings) within a video sequences and similar settings across different video sequences, and serves as a useful indexing tool.
Moreover, the comparison by alignment technique is useful to more general image retrieval applications. In contrast to many image retrieval techniques that use global color features of single images, the technique described herein incorporates spatial information by applying a coarse alignment between the images. It is robust to occluding objects and will match images for which only partial regions match (e.g., the top-left region of one image matches the bottom-left region of a second image).
APPENDIX
A computer program listing is submitted in the Appendix, and are incorporated by reference in their entirety herein. A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
median mosaics. c: void main ( ) : This function reads in a data file which lists shots in a specified directory. For each shot, it reads in the mosaic image generated for this shot, applies a median filter on this image (this is done by applying the function "MedianFiiterRGB" which will be described in detail in image.c) and saves the new image into a new image file under that same shot directory.
#include <stdio.h> #include <string.h> #include <stdlib.h> #include " .. \ image \ image. h"
#define FILTER__SIZE 5
#define MAX_SHOT_NUM 200
#define DIRNAME "D:\\aya\\users\\f riends3\\"
#define MAX_MOS_ IDTH 2000
#define MAX_MOS_HEIGHT 1000 void main() { char line [256], filename [256] , *token; int i , j ; int mosaics [MAX_SHOT_NUM] ;
FILE *invec; int Width, Height, shots_num = 0; unsigned char **R, **G, **B, **R1, **G1, **B1; f or (i=0 ; i<MAX_SHOT_NUM; i++) mosaics [i] = 0; sprintf (filename, "%sdata. txt" , DIRNAME) ; invec = fopen (filename, "r") ; while ( (fgetsdine, 255, invec)) != NULL) { token = strtok(line, " ") ; while (token != NULL) { mosaics [shots_num++] = atoi (token) ; token = strtok(NULL, " ") ;
} } f close (invec) ; printf ("There are %d shots\n", shots_num) ;
R = Alloc atrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH) G = AllocMatrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
B = AllocMatrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
Rl = AllocMatrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
Gl = AllocMatrix(MAX_MOS_HEIGHT, AX_MOS_WIDTH) Bl = AllocMatrix(MAX_MOS_HEIGHT, MAX_MOS_WIDTH) for (i=0; i<shots_num;i++) { sprintf (filename, "%smos\\Original\\Mos_color_med%d.ppm" , DIRNAME , mosaics [i] ) ;
ReadPPMAllocated(filename, &Width, &Height, &R, &G, &B) ; sprintf (filename, "%smos\\Mos_median%d.ppm" , DIRNAME, mosaics [i] ) ;
MedianFilterRGB (FILTER_SIZE, Width, Height, R, G, B, Rl, Gl, Bl) ;
WritePPM (filename, Rl, Gl, Bl, Width, Height) ; } }
hist.h This is a header file that lists all the function which are implemented in hist.c and will be explained there. It also defines some constants that set the size of the histogram and the histogram structure (three-dimensional histogram) used throughout the program. #ifndef _HIST H
#define _HIST H
#define HIST_SIZE 16 #define SIZE_OF_Y 3
#define SIZΞ_OF__TJ 18 #define SIZE_OF_V 6
#define SIZE_OF_R 16 #define SIZE_OF_G 16
#define SIZΞ_OF_B 16 typedef struct { double ***yuv; int num_of_bins ; int am_bins_Y int am_bins_V int am_bins_U
} HIST;
HIST* AllocHist (int size); HIST* AllocHist2 (int sizeY, int sizeU, int sizeV) ; void FreeHist (HIST *) ; void FillHist (HIST *hist, unsigned char **pixels, unsigned char* Y) ; void ZerofizeHist (HIST *) ; double GetDistHistHist(HIST *hl,HIST *h2) ; double FullHistDiff l (HIST *hl,HIST *h2) ; double FullHistDiffL2 (HIST *hl,HIST *h2) ;
HIST * FillCummHistFromRGBArrayl (int sizeY, int sizeU, int sizeV, unsigned char **R, unsigned char **G, unsigned char **B, int start , int endx, int startY, int endY) ; HIST * FillHistFromRGBArray2 (int sizel, int sizeH, int sizeS, unsigned char **R, unsigned char **G, unsigned char **B, int startX, int endX, int startY, int endY) ;
#endif
39 hist.c Includes the following routines double FuliHistDif f i (HIST *hi, Hisτ *h2) : Function to compute the Ll norm between two histograms. double FuliHistDif fL2 (HIST *hi,Hisτ *h2 ) : Function to compute the L2 norm between two histograms. void zerofizeHist (HIST *hist) : Initialization function which sets all elements of the histogram to zero.
HIST* AllocHist (int size) : Function to allocate space for the histogram structure. All three dimensions of the histogram are set to be of the same size.
HIST* AllocHist2 (int sizeY, int sizeU, int sizeV) Function to allocate space for the histogram structure. Each dimension of the histogram is set to according to the size specified by the parameters to the function (sizeY, sizeU, sizeV). void FreeHist (HIST *h) : Function to free the space allocated for the histogram structure.
HIST * FillHistFromRGBArrayl (int sizeY, int sizeU, int sizeV, unsigned char **R, unsigned char **G, unsigned char **B, int startx, int endx, int startY, int endY) : Function that takes an image and computes its RGB color histogram (three dimensional histogram in RGB color space). It returns a histogram structure containing the histogram values.
HIST * FillHistFromRGBArray2 (int sizel , int sizeH, int sizes , unsigned char **R, unsigned char **G, unsigned char **B , int startx, int endx, int startY, int endY) : Function that takes an image and computes its HSI color histogram (three dimensional histogram in HSI color space). It returns a histogram structure containing the histogram values.
ttinclude <stdio.h> #include <stdlib.h> #include <math.h> #include "hist.h"
Static double tWθ_jpi = 2*3.1415926535; double FuliHistDiffLl (HIST *hl,HIST *h2){ int i , j , k; double sum=0; double hl_val,h2 val; for (i=0; i<hl->am_bins_Y; i++) for ( j =0 ; j <hl->am_bins__U; j ++) for (k=0;k<hl->am_bins_V;k++) { hl_val=hl->yuv[i] [j] [k] ; h2_val=h2->yuv[i] [ j ] [k] ; sum += f abs (hi val-h2 val) ;
} return sum/ 2;
}
double FuliHistDif fL2 (HIST *hl,HIST *h2) { int i,j,k; double sum=0; double hl_val , h2_val ,- for (i=0;i<hl->am_bins_Y;i++) for (j=0; j<hl->am_bins_U; j++) for (k=0;k<hl->am_bins_V;k++) { hl_val=hl->yuv[i] [j] [k] ; h2_val=h2->yuv [i] [ j ] [k] ; sum += (hl_val-h2_val) * (hl_val-h2_val) ;
} sum = sqrt (sum) ; return sum;
void ZerofizeHist (HIST *hist)
{ int i,j,k; for (i=0; i<hist->am_bins_Y;i++) for (j=0; j<hist->am_bins_U; j++) for (k=0;k<hist->am_bins_V;k++) hist->yuv[i] [j] [k]=0; hist->num_of_bins = 0; }
HIST* AllocHist (int size)
{
HIST *retval; int i, j ; retval= (HIST*) calloc (1 , sizeof (HIST) ) ; retval->am__bins_Y = size; retval->am_bins_U = size; retval->am_bins_V = size; retval->yuv= (double***) calloc (size, sizeof (double**) ) ; if (retval->yuv == NULL) { fprintf (stderr, "can't ellocate yuv histogram\n") ; exit (2) ;
} for (i=0;i<size;i++) { if ( (retval->yuv [i] = (double**) calloc (size, sizeof (double*) ) ) == NULL) { fprintf (stderr, "ca 1 t ellocate yuv histogram\n"] exit (2) ;
} for (j=0; j<size; j++) if ( (retval-
>yuv [i] [j] = (double*) calloc (size, sizeof (double) ) ) == NULL) { fprintf (stderr, "can1 t ellocate yuv histogram\n") ; exit (2) ; }
}
ZerofizeHist (retval) ; return retval;
}
HIST* AllocHist2 (int sizeY, int sizeU, int sizeV)
{
HIST *retval; int i , j ; retval= (HIST*) calloc (1, sizeof (HIST) ) ; retval->am_bins_Y = sizeY; retval->am_bins_U = sizeU; retval->am bins V = sizeV; retval->yuv= (double***) calloc (sizeY, sizeof (double**) ) ; if (retval->yuv == NULL) { fprintf (stderr, "can' t ellocate yuv histogram\n") ; exit (2) ; } for (i=0;i<sizeY;i++) { if ( (retval->yuv [i] = (double**) calloc (sizeU, sizeof (double*) ) ) == NULL) { fprintf (stderr, "can11 ellocate yuv histogram\n") ; exit (2) ;
} for (j=0; j<sizeU; j++) if ( (retval- >yuv[i] [j] = (double*) calloc (sizeV, sizeof (double) ) ) == NULL) { fprintf (stderr, "can' t ellocate yuv histogram\n") ; exit (2) ; } } ZerofizeHist (retval) ; return retval;
}
void FreeHist (HIST *h) { int i , j ,- if (h==NULL) { fprintf (stderr, "h==NULL in FreeHist\n") exit (1) ;
if (h->yuv==NULL) { fprintf (stderr, "h->yuv==NULL in FreeHist\n") ; }else
{ for (i=0;i<h->am_bins_Y;i++) { for (j=0; j<h->am_bins_U; j++) free (h->yuv [i] [j ] ) ;
} free (h->yuv) ; } free (h) ; h=NUL ;
}
HIST * FillHistFromRGBArrayl (int sizeY, int sizeU, int sizeV, unsigned char **R, unsigned char **G, unsigned char **B, int startX, int endx, int startY, int endY)
{ int x,y, i, j, k, count=0; unsigned char indϋ, indV, indY; double Rval, Gval, Bval; double sumRGB; double Xn, Yn, Zn; HIST *hist = AllocHist2 (sizeY, sizeU, sizeV) ;
for (x=startX,-x<=endX;x++) for (y=startY;y<=endY;y++) { if (!R[x] [yl && !G[x] [y] && !B[x] [y]) { continue;
}
Rval = R[xl [y] / 255.0;
Gval = G[x] [y] / 255.0; B Bvvaall = = B[x] [y] / 255.0;
Xn = 0 .490*Rval + 0.320*Gval + 0.200*Bval
Yn = 0.177*Rval + 0.813*Gval + 0.011*Bval
Zn = 0.000*Rval + 0.010*Gval + 0.990*Bval
Rval = 1.910*Xn - 0.533*Yn -0.288*Zn; G Gvvaall = = -0.985*Xn + 2.00*Yn -0.028*Zn;
Bval = 0.058*Xn - 0.118*Yn + 0.896*Zn; sumRGB = Rval + Gval + Bval; Rval = Rval / sumRGB; Gval = Gval / sumRGB;
Bval = Bval / sumRGB; indY = (unsigned char) (Bval * (sizeY)); indU = (unsigned char) (Rval * (sizeU) ) ; indV = (unsigned char) (Gval * (sizeV) ) ; if (indY<0) indY = 0; if (indU<0) indU = 0; if (indV<0) indV = 0; if (indY>sizeY-l) indY = sizeY-1; if (indU>sizeU-l) indU = sizeU-1; if (indV>sizeV-l) indV = sizeV-1; count++; hist->yuv[indY] [indU] [indV] +=1.0;
} hist->num_of_bins = count; if (count) { for (i=0;i<hist->am_bins_Y;i++) for (j=0; j<hist->am_bins_U; j++) for (k=0;k<hist->am_bins_V;k++) hist->yuv[i] [j] [k] /= count;
} return hist; }
HIST * FillHistFromRGBArray2 (int sizel, int sizeH, int sizes, unsigned char **R, unsigned char **G, unsigned char **B, int start , int endx, int startY, int endY)
{ int x,y, i, j, k, count=0; unsigned char indl, indH, indS; double Rval, Gval, Bval; double min val, h, s, v,dark limit;
HIST *hist = AllocHist2 (sizel, sizeH, sizes) for (x=startX;x<=endX;x++) for (y=startY;y<=endY;y++) { if (! M [y] ScE !G[x] [y] && !B[x] [y]) { continue;
}
Rval = R [x] [y] / 255.0
Gval = G [x] [y] / 255.0
Bval = B [x] [y] / 255.0 v = 0.299*Rval+0.587*Gval+0.114*Bval; min_val = min (Rval, Gval); min_val = min (min_val, Bval) ; s = (v-min_val) /v; min_val = 2 * sqrt ( (Rval -Gval) * (Rval -Gval) + (Rval- Bval) * (Gval-Bval) h = acos ( (2*Rval-Gval-Bval) /min_val) ; if (Bval/v > Gval/v) h = two_pi - h; h /= two_pi; if (s<=0.08) { indI=indS=0; indH = (unsigned char) (v * 4) ; if (indH>3) indH=3;
Figure imgf000047_0001
ilse{ s -= 0. 08; indS = (unsigned char) ( (sizeS-1) * (s/0.92) ) +1; indH = (unsigned char) (h * sizeH) ; dark_limit = 0.5/sizeI; if (v < dark_limit) indl=0; else{ v -= dark_limit; indl = (unsigned char) ( (v/ (1- dark_limit) )* (sizel-l) +1) ;
} } if (indl<0) indl = 0; if (indH<0) indH = 0; if (indS<0) indS = 1; if (indl>sizel-l) indl = sizel-l if (indH>sizeH-l) indH = sizeH-1 if (indS>sizeS-l) indS = sizeS-1 hist->yu [indl] [indH] [indS]+=1.0; count++;
} hist->num_of_bins = count; if (count) { for (i=0;i<hist->am_bins_Y;i++) f or ( j =0 ; j <hist- >am_bins_U; j ++) for (k=0 ;k<hist->am_bins_V;k++) hist->yuv[i] [j] [k] /= count;
} return hist;
image.h This is a header file that lists all the function which are implemented in image.c and will be explained there.
#ifndef _IMAGE_H #define _IMAGE_H void WritePGM (char *path, unsigned char **im, int Width, int Height) ; void WritePPM (char *path, unsigned char **R, unsigned char **G, unsigned char **B, int Width, int Height); void ReadPPM (char *FileName, int *Width, int *Height, unsigned char ***R, unsigned char ***G, unsigned char ***B) ; void ReadPPMAllocated(char *FileName, int *Width, int *Height, unsigned char ***R, unsigned char ***G, unsigned char ***B) ; unsigned char ** AllocMatrix(int rows, int cols); void FreeMatrix (unsigned char **m) ; unsigned char ** RGB2Gray (unsigned char **R, unsigned char **G, unsigned char **B, int Width, int Height) ; void MedianFilterRGB (int filter_size, int Width, int Height, unsigned char **Rin, unsigned char **Gin, unsigned char **Bin, unsigned char **Rout, unsigned char
**Gout, unsigned char **Bout) ;
#endif
image. c Includes the following routines: void WritePPM (char *path, unsigned char **R, unsigned char **G, unsigned char **B, int Width, int Height) :Function to write a color image into a file using "ppm" format. void WritePGM (char *path, unsigned char **im, int Width, int Height) : Function to write a gray-level image into a file using "pgm" format. void ReadPPM (char *FileName , int *Width, int *Height , unsigned char ***Rarr, unsigned char ***Garr, unsigned char ***Barr) :
Function to read a color image from a file stored in "ppm" format. It allocates space for the three color channels (R,G,B), then stores the values in them. void ReadPPMAllocated (char *FileName, int *Width, int *Height, unsigned char ***R, unsigned char ***G, unsigned char
***B) : Function to reads a color image from a file stored in "ppm" format. It assumes that space has already been allocated for the three color channels (R,G,B), and stores the values in them. unsigned char ** AllocMatrix (int rows , int cols) : Function to allocate space for a two dimensional array of unsigned characters - each such two- dimensional array is used throughout the program to store a single color channel of an image. void FreeMatrix (unsigned char **m) : Function to free the memory allocated to a two dimensional array (described above). void MedianFilterRGB (int filter_size , int Width, int Height , unsigned char **Rin, unsigned char **Gin, unsigned char **Bin, unsigned char **Rout , unsigned char **Gout , unsigned char **Bout) : Function that takes an image applies a median filter on that image.
This operation smoothes the image, each pixel in the image gets the average value of its neighbors instead of its own original value. The size of such a neighborhood is passed to the function as a parameter. unsigned char ** RGB Gray (unsigned char **R, unsigned char **G, unsigned char **B, int Width, int Height) : Function to convert a color image (three channels R,G,B) into a gray-scale single channel image. This function is used by the previous function "MedianFilterRGB".
#include <stdio .h> #include <stdlib . h> #include <string . h> #include " image . h" void WritePPM (char *path, unsigned char * *R, unsigned char **G, unsigned char **B, int Width, int Height) { int x,y, index=0 ; FILE *fp; if ((fp = fopen(path, »wb") ) == NULL) { fprintf (stderr, "WritePPM: Could not open file %s\n",path); return; } fprintf (fp, "P6\n") ; fprintf (fp, "%d\n", Width); fprintf (fp, "%d\n" , Height); fprintf (fp, "255\n") ; for (y=0; y<Height; y++)
{ for (x=0; x<Width; x++)
{ if ( (fputc (R [y] [x] , fp) == EOF) | |
(fputc (G [y] [x] , fp) == EOF) ] ] ( fputc (B [y] [x] , fp) == EOF) )
{ fprintf (stderr, "WritePPM: Could not write to file %s\n",path); return; }
} } fclose (fp) ; }
void WritePGM (char *path, unsigned char **im, int Width, int Height)
{ FILE *fp; int y, x; if ( (fp = fopen(path, "wb")) == NULL) { fprintf (stderr, "WritePGM: Could not open file %s\n", path); exit(l);
} if (im == NULL) { fprintf (stderr, "WritePGM: im==NULL\n") ; exit (1) ; fprintf (fp, "P5\n"); fprintf (fp, "%d\n", Width); fprintf (fp, "%d\n", Height); fprintf (fp, "255\n") ; for (y=0; y<Height; y++) for (x=0; x<Width; x++) fputc (im [y] [x] , fp) ; fclose (fp) ; } void ReadPPM (char *FileName, int *Width, int *Height, unsigned char ***Rarr, unsigned char ***Garr, unsigned char ***Barr) { int y, x, index = 0; FILE *FilePtr; char NextChar; int SizeCount=0; char Type; char Comment [80] ; int TmpNum = 0; int HeaderSize, MaxValue; unsigned char *LineBuf, *InPixelPtr; unsigned char **R, **G, **B; if ( (FilePtr = fopen (FileName, "r")) == NULL) { fprintf (stderr, "ReadPPM: Could not open file %s\n", FileName) ; exit (1) ;
}
NextChar = getc (FilePtr) ; if (NextChar 1= 'P') { fprintf (stderr, "ReadPPM: %s not PxM file\n" , FileName) exit (1) ; }
SizeCount++;
Type = getc (FilePtr) ; SizeCount++;
NextChar = getc (FilePtr) ; SizeCount++; while (( (NextChar < '0') || (NextChar > '9': { if (NextChar == '#')
{ fgets (Comment, 80, FilePtr); SizeCount += strlen (Comment) ;
}
NextChar = getc (FilePtr) ;
SizeCount++;
} while ((NextChar >= '0') && (NextChar <= '9')
{
TmpNum = (TmpNum * 10) + (NextChar-48) ;
NextChar = getc (FilePtr) ; SizeCount++;
}
* idth = TmpNum; while (( (NextChar < '0') || (NextChar > '9')))
{ if (NextChar == '#•)
{ char Comment [80] ; fgets (Comment, 80 , FilePtr) ;
SizeCount += strlen (Comment) ;
}
NextChar = getc (FilePtr) ;
SizeCount++; }
TmpNum = 0; while ((NextChar >= '0') && (NextChar <= '9'))
{ TmpNum = (TmpNum * 10) + (NextChar-48) ;
NextChar = getc (FilePtr) ; SizeCount++;
}
*Height = TmpNum; while ( (NextChar < '0') || (NextChar > '9'))
{ if (NextChar == ' # ' )
{ char Comment [80] ; fgets (Comment, 80, FilePtr); SizeCount += strlen (Comment) ;
}
NextChar = getc (FilePtr) ; SizeCount++;
}
TmpNum = 0; while ((NextChar >= '0') && (NextChar <= '9')) {
TmpNum = (TmpNum * 10) + (NextChar-48) ;
NextChar = getc (FilePtr) ;
SizeCount++;
} MaxValue = TmpNum;
HeaderSize = SizeCount; fclose (FilePtr) ;
if ((FilePtr = fopen (FileName, "rb") ) == NULL) { fprintf (stderr, "ReadPPM: Could not open file %s\n", FileName) ; exit (1) ; } if (fseek (FilePtr, HeaderSize, SEEK_SET) != 0)
{ fprintf (stderr, "ReadPPM: Could not read header of %s\n", FileName) ; f close (FilePtr) exit ( 1) ; }
LineBuf = (unsigned char *) malloc ( (*Width) *3) ; R = AllocMatrix(*Height, *Width) G = AllocMatrix(*Height, *Width) B = AllocMatrix(*Height, *Width) for (y=0; y<*Height; y++)
{ if (fread (LineBuf, 1, (*Width)*3, FilePtr) != (size_t
) (*Width)*3)
{ ' fprintf (stderr, "ReadPPM: Could not read Line of
%s\n", FileName) free (LineBuf) ; fclose (FilePtr) ; exit (1) ;
InPixelPtr = LineBuf; for (x=0; x<(*Width); x++) {
R[y] [x] = *InPixelPtr++;
G[y][x] = *InPixelPtr++; ' B [y] [x] = *InPixelPtr++;
free (LineBuf) ; fclose (FilePtr) ; *Rarr = R; *Garr = G; *Barr = B;
} void ReadPPMAllocated (char *FileName, int *Width, int *Height, unsigned char ***R, unsigned char ***G, unsigned char ***B)
{ int y, x, index = 0;
FILE *FilePtr; char NextChar; int SizeCount=0; char Type; char Comment [80] ; int TmpNum = 0; int HeaderSize, MaxValue; unsigned char *LineBuf, *InPixelPtr; if ((FilePtr = fopen (FileName, "r")) == NULL) { fprintf (stderr, "ReadPPM: Could not open file %s\n", FileName) ; exit (1) ; } NextChar = getc (FilePtr) ; if (NextChar != 'P') { fprintf (stderr, "ReadPPM: %s not PxM file\n" , FileName); exit (1) ;
} SizeCount++;
Type = getc (FilePtr) ;r SizeCount++;
NextChar = getc (FilePtr) ; SizeCount++; while (( (NextChar < '0') || (NextChar > '9')))
{ if (NextChar == ' # ' )
{ fgets (Comment, 80, FilePtr);
SizeCount += strlen (Comment) ;
}
NextChar = getc (FilePtr) j ; SizeCount++ ; } while ( (NextChar >= ' 0 ' ) && (NextChar <= ' 9 ' )
{
TmpNum = (TmpNum * 10) + (NextChar-48) ; NextChar = getc (FilePtr) ;
SizeCount++;
}
*Width = TmpNum; while ( ( (NextChar < ' 0 ' ) | | (NextChar > ' 9 ' ) ) ) { if (NextChar == '#')
{ char Comment [80] ; fgets (Comment, 80, FilePtr); SizeCount += strlen (Comment) ;
}
NextChar = getc (FilePtr) ; SizeCount++; }
TmpNum = 0; while ( (NextChar >= '0 ' ) && (NextChar <= ' 9 ' )
{
TmpNum = (TmpNum * 10) + (NextChar-48) ; NextChar = getc (FilePtr) ;
SizeCount++;
}
*Height = TmpNum; while ( (NextChar < '0') || (NextChar > '9'))
{ if (NextChar == ■ # ■ )
{ char Comment [80] ; fgets (Comment , 80 , FilePtr) ; SizeCount += strlen (Comment) ;
}
NextChar = getc (FilePtr) ; SizeCount++;
}
TmpNum = 0; while ( (NextChar >= ' 0 ' ) && (NextChar <= ' 9 ' ) ) {
TmpNum = (TmpNum * 10) + (NextChar-48) ; NextChar = getc (FilePtr) ; SizeCount++;
} MaxValue = TmpNum; while ( (NextChar == ' ') || (NextChar == 10) || (NextChar ==
13)) NextChar = getc (FilePtr) ;
SizeCount++; }
HeaderSize = SizeCount-1; fclose (FilePtr) ;
if ((FilePtr = fopen (FileName, "rb")) == NULL) { fprintf (stderr, "ReadPPM: Could not open file %s\n", FileName) ; exit (1) ;
} if (fseek (FilePtr, HeaderSize, SEEK_SΞT) != 0)
{ fprintf (stderr, "ReadPPM: Could not read header of %s\n" ,
FileName) ; fclose (FilePtr) ; exit (1) ;
}
LineBuf = (unsigned char *) malloc ( (*Width) *3) ; for (y=0; y<*Height; y++)
{ if (fread(LineBuf, 1, (*Width)*3, FilePtr) != (size_t
) (*Width) *3)
{ fprintf (stderr, "ReadPPM: Could not read Line of
%s\n" , FileName) ; free (LineBuf) ; fclose (FilePtr) ; exit (1) ;
}
InPixelPtr = LineBuf; for (x=0; x<(*Width); x++) {
(*R) [y] [x] = *InPixelPtr++;
(*G) [y] [x] = *InPixelPtr++; ( *B) [y] [x] = *InPixelPtr++ ;
} } free (LineBuf ) ; f close (FilePtr) ;
}
unsigned char ** AllocMatrix (int rows , int cols)
{ int loc_ind; unsigned char **t; if ( (t = (unsigned char **) malloc (rows*sizeof (unsigned char *) ) )==NULL) { fprintf (stderr, "Can' t allocate array [%d,%d]\n", rows, cols) ; exit (1) ;
} if ((t[0] = (unsigned char *) malloc (rows*cols) ) ==NULL) { fprintf (stderr, "Can' t allocate array [%d, %d] \n" , rows, cols) ; exit (1) ;
} for (loc_ind=l; loc_ind<rows;loc_ind++) t [loc_ind] = t [loc_ind-l] ÷cols ; return t ; } void FreeMatrix (unsigned char **τn)
{ free(m[0] ) ; free ( ) ; }
static int *vect; static int **ptr_vect;
void MedianFilterRGB (int filter_size, int Width, int Height, unsigned char **Rin, unsigned char **Gin, unsigned char **Bin, unsigned char **Rout, unsigned char **Gout, unsigned char **Bout) { unsigned char **Y = RGB2Gray(Rin, Gin, Bin, Width, Height) ; int step = filter_size / 2; int size = filter_size*filter_size; int i, j, k, 1, t, ind; int *temp_ptr; static int med_init = 0; if ( !med_init) { vect = (int *) malloc (sizeof (int) * size); ptr_vect = (int **) malloc (sizeof (int *) size) ptr_vect [0] = (int *) malloc (sizeof (int) size) for (i=l;i<size;i++) ptr_vect [i] = ptr_vect [i-l] + 1; med init=l; =step; i<Height-step; i++) for (j=step; j<Width-step; j++) { ind=0; for (k=-step;k<=step;k++) for (l=-step; l<=step;l++) { ptr_vect [ind] = &vect [ind] vect [ind++] = Y[i+k] [j+1];
} for (k=l;k<size;k++) { temp_j?tr = ptr_vect [k] ; t = k-1; while ( (t>=0) && (*ptr_vect [t] > *temp_ptr) ) { ptr_vect [t+1] = ptr_vect [t] ; t— ;
} ptr_vect [t+1] temp_j?tr;
} ind = (ptr_vect [size/2] vect) k = ind / filter_size; if (k)
1 = ind % k; else
1=0; k - = step; 1 - = step; Rout [i] [j] = Rin[i+k] [j+1] Gout [i] [j] = Gin[i+k] [j+1] Bout[i] [j] = Bin[i+k] [j+1] =0; i<step;i++) for (j=0; j<Width; j++) {
Rout[i] [j] = Rin[i] [j] Gout[i] [j] = Gin[i] [j] Bout[i] [j] = Bin[i] [j]
} =0; i<Height;i++) for (j=0; j<step; ++) {
Rout[i] [j] = Rin.i] [j] Gout[i] [j] = Gin[i] [j] Bout[i] [j] = Bintil [j]
} =Height-step; i<Height; i++) for(j=0; j<Width;j++) {
Rout[i] [j] = Rin[i] [j]
Gout[i] [j] Ginti] [j]
Bout[i] [j] = Binti] [j]
} =0 ; i<Height ; i++) for (j =Width- step; j<Width;j++) { Rout[i] [j] = Rin[i] [ j ] Gout[i] [j] = Ginti] [j] Bout.i] [j] = Binti] [j]
} } unsigned char ** RGB2Gray (unsigned char **R, unsigned char **G, unsigned char **B, int Width, int Height)
{ unsigned char val, **Y; int i , j ;
Y = AllocMatri (Height, Width); for (i=0; i<Height;i++) for (j=0; j<Width; j++) { val = (unsigned char) (0.299 * R[i] [j] + 0.587 *
G[i] [j] + 0.114 * B[i] [j]) ; val = (val < 0) ? 0 : val; val = (val > 255) ? 255 : val; Y[i] [j] = val; } return Y;
}
bilinear_interp.c This file first defines several structures used by the functions implemented in it. This includes some one-dimensional arrays that will be used repeatedly throughout the code, hence there is no need to allocate and free them each time. It also defines two two-dimensional arrays, LUT and rev-LUT which are used as Look-Up-Tables (and Reverse-Look-Up-Table) by some of the functions described below. double round (double x) : Function to compute the rounded value of a floating point number. int MaxSubDiag (int slope_index, int slope_dir, int Xstart , int Xend, int Ystart , int Yend, double **Mat , int Rows , int
Cols , int Limit , double *DiagVal , int *BestStartX, int *BestEndX, int
*BeststartY, int *BestEndY) .- This function is used by the following function "GetBestDiagVal". Given a matrix, start and end points of long diagonal, and a slope, this function uses the LUT to retrieve used values along this diagonal, stores them into a one-dimensional array, then finds the maximal sub-sequence within this array. This is the code that modifies the nearest neighbor algorithm to save computation time, as described above. double GetBestDiagVal (double **Mat , int Rows , int Cols , int Limit , int *BestStartX, int *BestEndX, int *BestStartY, int *BestEndY) : This function uses a list of pre-determined slopes (slopes varying between 25-60 degrees with jumps of 5) and then uses a pre-computed LUT to get the actual entries of the matrix, for each slope. It goes over all possible slopes - hence scans all possible sub-diagonals in the given matrix and chooses the sub-diagonal with the lowest average value. It returns this value along with the diagonal's start and end points. double GetBestDiagValLimited (double **Mat , int Rows , int Cols , int Limit , int *BestStartX, int *BestΞndX, int *BestStartY, int *BestEndY) : Function very similar to the previous one, except that it only checks the sub-diagonals along the main diagonals of the given matrix. This function is used in the "finer" step of the mosaic-matching algorithm, described above. int MaxDirectSubDiag (int vector_length, int Limit , double *DiagVal, int *Beststart, int *BestEnd) : More efficient version of the code in "MaxSubDiag" which operates directly on a vector instead of having to generate it from a given matrix. double GetDirectBestDiagVal (double **Mat , int Rows , int Cols , int Limit , int *BestStartX, int *BestEndX , int *BestStartY, int *BestEndγ) : More efficient version of the code in "GetBestDiagVal" which generates the vector on which "GetDirectBestDiagvai" operates instead if sending all the data to MaxSubDiag - and thus saves computation time. double GetDirectBestDiagValLimited (double **Mat, int Rows , int Cols , int Limit , int*BestStartX, int *BestEndX, int
*BeststartY, int *BestEndY) : More efficient version of the code in "GetBestDiagVaiLimited" which computes the best sub-diagonal directly.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#define MAX_SIZE 500 static double rad2deg = 180/3.1415926535; Static int LUT [5] [MAX_SIZE] ; static int rev_LUT[5] [MAX_SIZE] ; static double vector [MAX_SIZE] ; static int indeces_vector [MAX_SIZE] ; static double partial_sum_vector [MAX_SIZE] ; static double pi = 3.1415926535; static double quarter_pi = 3.1415926535/4; static double step = 3.1415926535/36; double round (double x)
{ double floor_x = floor (x) ; if ( (x - floor_x) > 0.5) return ceil (x) ; else return floor x;
int MaxSubDiag (int slope_index, int slope_dir, int Xstart,int Xend,int Ystart,int Yend, double **Mat, int Rows, int Cols, int Limit, double *DiagVal, int *BestStartX, int *BestEndX, int
*BestStartY, int *BestEndY)
{ int i=Ystart,j=Xstart, index, count=0; double val, BestVal = 10000; int vector_length, len, BestLen, Bestlnd; while (i != Yend && j != Xend) { if (slope_dir ==1) { index = LUT [slope_index] [j ] ; if (rev_LUT [slope_index] [index] ==j && index<Cols && i<Rows) vector [count++] = Mat [i] [index] ;
} else{ index = LUT [slope_index] [i] ; if (rev_LUT [slope_index] [index] ==i && index<Rows && j <Cols) vector [count++] = Mat [index] [j ] ;
} i++ ; j ++ ;
} vector_length = count; for (i=l; i<vector_length; i++) partial_sum_vector [i] =0 ; partial_sum_vector [0] = vector [0] ; for (i=l; i<vector_length; i++) partial_sum_vector [i] = partial__sum_vector [i-l] + vector [i] ; for (len = Limit-1; len < vector_length; len++) { val = partial_sum_vector [len] / (len+1); if (val < BestVal) { BestVal = val; BestLen = len; Bestlnd = 0; } for (i=l; i<vector_length-len; i++) { val = (partial_sum_ ector [i+len] partial_sum_vector [i-l] ) / (len+1); if (val < BestVal) { BestVal = val;
BestLen = len; Bestlnd = i; } } } i=Ystart; j=Xstart; count=0 ; while (i != Yend && j != Xend) { if (slope_dir ==1) { index = LUT [slope_index] [j] ; if (rev_LUT [slope_index] [index] ==j && index<Cols && i<Rows) { if (count == Bestlnd) { *BestStartY = i; *BestStartX = index; } if (count == (Bestlnd+BestLen) ) { *BestEndY = i; *BestEndX = index;
} count++;
} } else{ index = LUT [slope_index] [i] ; if (rev_LUT [slope index] [index] ==i && index<Rows && j<Cols) { if (count == Bestlnd) {
*BestStartY = index; *BestStartX = j ;
} if (count == (Bestlnd+BestLen) ) {
*BestEndY = index; *BestEndX = j ; } count++;
} } i++; j++; )
*DiagVal = BestVal; return BestLen+1;
double GetBestDiagVal (double **Mat, int Rows, int Cols, int Limit, int *BestStartX, int *BestEndX, int *BestStartY, int *BestEndY) { int Xstart, Xend, Ystart, Yend; int DiagLen, BestDiagLen; int colsLimit, rowsLimit, slope_dir; static int slope_setup = 0; int i,j, val, slope_index; double DiagVal, BestDiagVal = 1000; double slope, tan_slope, dvall, dval2 ; int SX,EX,SY,EY, max_val; if ( ! slope_setup) { slope = quarterjoi-step; for (i=l; i<5; i++) { tan_slope = tan(slope); for (j=0; j<MAX_SIZΞ; j++) rev_LUT [i] [j] = -1; for (j=0;j<MAX_SIZE;j++) { val = (int) (round (tan_slope*j )) ;
LUT[i] [j] = val; if (rev_LUT[i] [val] < 0) rey_LUT[i] [val] = j ; else{ dvall = tan_slope* (j -1) ; dval2 = tan_slope*j ; if (fabs (val-dvall) > f abs (val-dval2) ) rev_LUT[i] [val] = j ;
} slope -= step;
} slope_setup=l ;
} slope = quarter_pi+step; for (slope_index=0; slope_index<5; slope_index++) { slope -= step; tan_slope = tan ( slope) ; slope_dir = 1 ; colsLimit = rev_LUT [slope_index] [Limit] ; rowsLimit = Limit ; Ystart=0 ; max_val = max (rev_LU [slope_index] [Cols -1] , Rows - 1) ; Xend = max_val ; Yend = max_val ; for (Xstart=0 ; Xstart < Xend-colsLimit ; Xstart++) { DiagLen = MaxSubDiag (slope_index, slope_dir,
Xstart , Xend, Ystart , Yend, Mat , Rows , Cols ,
Limit , &DiagVal , &SX, &EX, &SY, &ΞY) ; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal ;
BestDiagLen = DiagLen; *BestStartX = LUT [slope_index] [SX] ; *BestEndX = LUT [slope_index] [EX] ; *BestStartY = SY; *BestΞndY = EY;
} Yend- - ;
}
Xstart=0; max_val = max(rev_LUT [slope_index] [Cols-2] , Rows-1);
Xend = max_val; Yend = max_val+l; for (Ystart=l; Ystart < Rows-rowsLimit; Ystart++) { DiagLen = MaxSubDiag (slope_index, slope_dir, Xstart,Xend, Ystart, Yend, Mat, Rows, Cols,
Limit, &DiagVal, &SX, &ΞX, &SY, &EY) ; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen;
*BestStartX = LUT [slope_index] [SX] ; *BestEndX = LUT [slope_index] [EX] ; *BestStartY = SY; *BestEndY = EY; }
Xend--;
} if ( !slope_index) continue; slope_dir = -1; rowsLimit = rev_LUT [slope_index] [Limit] ; colsLimit = Limit; Ystart=0; max_val = max(rev_LUT [slope_index] [Rows-1], Cols-1); Yend = max_val; Xend = max_val; for (Xstart=0; Xstart < Cols-colsLimit; Xstart++) { DiagLen = MaxSubDiag (slope_index, slope_dir, Xstart,Xend, Ystart, Yend, Mat, Rows, Cols,
Limit, &DiagVal, &SX, &EX, &SY, &EY) ; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen; *BestStartX = SX; *BestEndX = EX;
*BestStartY = LUT [slope_index] [SY] ;
*BestEndY = LU [slope_index] [EY] ;
} Yend- - ;
}
Xstart=0 ; max_val = ma (rev_LUT [slope_index] [Rows-1], Cols-2); Yend = max_val; Xend = max_val-l; for (Ystart=l; Ystart < Rows-rowsLimit; Ystart++) { DiagLen = MaxSubDiag (slope_index, slope_dir, Xstart, Xend, Ystart, Yend, Mat, Rows, Cols,
Limit, &DiagVal, &SX, &EX, &SY, &ΞY) ; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen; *BestStartX = SX; *BestEndX = EX;
*BestStartY = LUT [slope_index] [SY] ; *BestEndY = LUT [slope_index] [EY] ;
}
Xend-- ; }
}
return BestDiagVal; } double GetBestDiagValLimited (double **Mat, int Rows, int Cols, int Limit, int *BestStartX, int *BestΞndX, int *BestStartY, int *BestEndY)
{ int DiagLen, BestDiagLen; double DiagVal, BestDiagVal = 1000; DiagLen = MaxSubDiag (0, 1, 0, Cols-1, 0, Rows-1, Mat, Rows, Cols, Limit, &DiagVal,BestStartX, BestEndX, BestStartY, BestEndY); if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen; }
DiagLen = MaxSubDiag (0, 1, 1, Cols-1, 0,Rows-2, Mat, Rows, Cols, Limit, &DiagVal,BestStartX, BestEndX, BestStartY, BestEndY); if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen;
}
DiagLen = MaxSubDiag(0, 1, 0, Cols-2, 1, Rows-1, Mat, Rows, Cols,
Limit, &DiagVal,BestStartX, BestEndX, BestStartY, BestEndY); if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal;
BestDiagLen = DiagLen;
} return BestDiagVal; } int MaxDirectSubDiag (int vector_length, int Limit, double *DiagVal, int *BestStart, int *BestEnd) { double val, BestVal = 10000; int i,len, BestLen, Bestlnd, count=0;
for (i=l; i<vector__length; i++) partial_sum_vector [i] =0; partial_sum_vector [0] = vector [0] ; for (i=l; i<vector_length; i++) partial_sum_vector [i] = partial_sum_vector [i-l] + vector [i] ; for (len = Limit-1; len < vector_length; len++) { val = partial_sum_vector [len] / (len+1); if (val < BestVal) { if ( (indeces_vector [len] -indeces_vector [0] +1) >=
Limit) {
BestVal = val; BestLen = len; Bestlnd = 0; }
} for (i=l; i<vector_length-len; i++) { val = (partial_sum_vector [i+len] - partial_sum_vector [i-l] ) / (len+1); if (val < BestVal) { if ( (indeces_vector [i+len] - indeces_vector [i] +1) >= Limit) {
BestVal = val; BestLen = len; Bestlnd = i;
} } }
} *BestStart = Bestlnd;
*BestEnd = Bestlnd+BestLen;
*DiagVal = BestVal; return BestLen+1; } double GetDirectBestDiagvai (double **Mat, int Rows, int Cols, int Limit, int *BestStartX, int *BestEndX, int *BestStartY, int *BestEndY)
{ int Xstart, Ystart; int DiagLen, BestDiagLen; static int slope_setup = 0; int i,j, count, i_val, j_val, val, slope_dir, slope_index; double DiagVal, BestDiagVal = 1000; double slope, tan_slope; int SX,EX; if ( ! slope_setup) { for (j =0 ; j <MAX_SIZE; j ++) LUT [0] [j ] = j ; slope = quarter_pi-step ; for (i=l; i<5 ; i++) { tan_slope = ta (slope); for (j=0;j<MAX_SIZΞ; j++) { val = (int) (round (tan_slope*j ) ) ; LUT [i] [j ] = val ; } slope -= step;
} slope_setup=l; } slope = quarter_pi+step; for (slope_index=0 ; slope_index<5; slope_index++) { slope -= step; slope_dir = 1; Ystart=0 ; for (Xstart=0; Xstart < Cols-1; Xstart++) { count=0; j_val=Xstart; i_val=Ystart; while (1){ indeces_vector [count] =i_val; vector [count++] =Mat [i_val] [j_val] ; j_val++; i_val=LUT [slope_index] [j_val-Xstart] ; if (i_val> (Rows-1) || j_val> (Cols-1) ) break;
} if ( (i_val-Ystart+l) >=Limit && (j_val-
Xstart+1) >=Limit) {
DiagLen = MaxDirectSubDiag (count, Limit, &DiagVal, &SX, &EX) ; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen; *BestStartX = SX+Xstart; *BestEndX = EX+Xstart;
*BestStartY = LUT [slope_index] [SX] ; *BestΞndY = LU [slope_index] [EX] ;
} } }
Xstart=0; for (Ystart=l; Ystart < Rows-1; Ystart++) { count=0; j_val=Xstart; i_val=Ystart; while (1) { indeces_vector [count] =i_val ; vector [count++] =Mat [i_val] [ j_val] ; j_val++; i_val=LUT [slope_index] [j_val] ÷Ystart ; if (i_val> (Rows-1) | | j_val> (Cols-1) ) break;
} if ( (i_val-Ystart+l) >=Limit && (j_val- Xstart+1) >=Limit) { DiagLen = MaxDirectSubDiag (count, Limit, &DiagVal, &SX, &EX) ; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen;
*BestStartX = SX; *BestEndX = EX; *BestStartY = LU [slope_index] [SX] +Ystart; *BestEndY =
LUT [slope_index] [EX] ÷Ystart;
} }
} if ( !slope_index) continue; slope_dir = -1; Ystart=0; for (Xstart=0; Xstart < Cols-1; Xstart++) { count=0; j_val=Xstart; i_val=Ystart ; while (1) { indeces_vector [count] =j_val; vector [count++] =Mat [i_val] [j_val] ; i_val++; j_val=LUT [slope_index] [i_val] ÷Xstart; if (i_val> (Rows-1) | | j__val> (Cols-1) ) break;
} if ( (i_val-Ystart+l) >=Limit && (j_val- Xstart+1) >=Limit) {
DiagLen = MaxDirectSubDiag (count, Limit, &DiagVal, &SX, &EX) ; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen;
*BestStartX =
LUT [slope_index] [SX]+Xstart;
*BestEndX =
LUT [slope_index] [EX] +Xstart; *BestStartY = SX;
*BestEndY = EX; } } } Xstart=0; for (Ystart=l; Ystart < Rows-1; Ystart++) { count=0; j_val=Xstart; i_val=Ystart; while (1){ indeces_vecto [count] =j_val ; vector [count++] =Mat [i_val] [j_val] ; i_val++; j_val=LUT [slope_index] [i_val -Ystart] ; if (i_val> (Rows-1) | | j_val> (Cols-1) ) break;
} if ( (i_val-Ystart+l) >=Limit && (j_val- Xstart+1) >=Limit) { DiagLen = MaxDirectSubDiag (count,
Limit, &DiagVal,&SX, &EX) if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = DiagLen; *BestStartX = LUT [slope_index] [SX] *BestEndX = LUT [slope_index] [EX] ; *BestStartY = SX+Ystart ; *BestEndY = EX+Ystart ;
return BestDiagVal ;
double GetDirectBestDiagValLimited (double **Mat, int Rows, int Cols, int Limit, int *BestStartX, int *BestEndX, int *BestStartY, int *BestEndY) { int BestDiagLen; int count, i_val, j_val; double DiagVal, BestDiagVal = 1000; double dval; int end_i, end_j , slope_index, slope_dir; if (Rows<Cols) { slope_dir=l; dval = atan2 (Rows, Cols) ; dval *= rad2deg; dval = 45 - dval; slope_index = (int) (dval / 5) ; dval -= slope_index*5; if ( (5.0-dval) > dval) slope_index++;
} else{ slope_dir=-l; dval = atan2 (Cols, Rows) ; dval *= rad2deg; dval = dval - 45; slope_index = (int) (dval / 5) ; dval -= slope_index*5; if ( (5.0-dval) > dval) slope_index++;
} if (slope_index > 4) slope_index=4; if (slope_dir==l) { count=0 j j_val=0; i_val=0; DiagVal=0; while (1) {
DiagVal += Mat [i_val] [j_val] ; count++; end_i = i_val ; end_j = j _val ; j_val++; i_val=LUT [slope_index] [j_val] ; if (i_val> (Rows -1) I I j_val> (Cols -1) ) break; if ( (i_val+l) >=Limit && (j_val+l) >=Limit) { DiagVal /= count; if (DiagVal < BestDiagVal) {
BestDiagVal = DiagVal;
BestDiagLen = count;
*BestStartX 0;
*BestEndX end_j ;
*BestStartY 0;
*BestEndY end i;
} count=0; j_val=l; i_val=0
DiagVal=0; while (1) {
DiagVal Mat [i_val] [j_val] ; count++; end_i i_val ; end_j = j_val ; j_val++; i_val=j_val-1; i_val=LU [slope_index] [j_val-l] ; if (i_val> (Rows-1) || j_val> (Cols-1) ) break;
} if ( (i_val+l) >=Limit && j_val>=Limit) { DiagVal /= count; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = count; *BestStartX l; *BestEndX end_j ; *BestStartY 0; *BestEndY end i;
}
} count=0; j_val=0; i_val=l;
DiagVal=0; while (1) {
DiagVal += Mat [i_val] [j_val] ; count++; end_i = i_val ; end_j = j_val ; j_val++; i_val=LUT [slope_index] [j_val] +1 ; if (i_val> (Rows -1) | | j_val> (Cols-1) break;
} if ( (i_val-l+l) >=Limit && (j_val+l) >=Limit) { DiagVal /= count; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = count; *BestStartX = 0; *BestΞndX = end_j ; *BestStartY = 1; *BestEndY = end_i; } } } else{ count=0 j j_val=0; i_val=0; DiagVal=0; while (1) {
DiagVal += Mat [i_val] [j_val] ; count++; end_i = i_val; end_j = j_val; i_val++; j_val=LUT [slope_index] [i_val] ; if (i_val> (Rows-1) || j_val> (Cols-1) break;
} if (i_val+l) >=Limit && (j_val+l) >=Limit) { DiagVal /= count; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = count; *BestStartX = 0; *BestEndX = end_j ; *BestStartY 0; *BestEndY end_i ;
}
} count=0; j_val=l; i__val=0;
DiagVal=0; while (1) {
DiagVal += Mat [i_val] [j_val] ; count++; end_i = i_val; end_j = j_val; i_val++; j_val=LUT [slope_index] [i_val] +1; if (i_val> (Rows-1) || j_val> (Cols-1) ) break;
} if ( (i_val+l) >=Limit && j_val>=Limit) { DiagVal /= count; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = count;
*BestStartX = 1;
*BestEndX = end_j ;
, *BestStartY = 0;
*BestEndY = end_i;
} } count=0; j_val=0; i_val=l;
DiagVal=0; while (1) {
DiagVal += Mat [i_val] [ j_val] ; count++; end_i = i_val ; end_j = j_val ; i_val++ ; j_val=LU [slope_index] [i_val-l] ; if (i_val> (Rows-1) | | j_val> (Cols -1) ) break;
} if (i_val>=Limit && (j_val+l) >=Limit) { DiagVal /= count; if (DiagVal < BestDiagVal) { BestDiagVal = DiagVal; BestDiagLen = count; *BestStartX = 0; *BestEndX = end_j ;
*BestStartY = 1; *BestEndY = end i;
} } return BestDiagVal;
scenes_frames_str.c: The routine void main ( ) implements the coarse mosaic-matching algorithm, used both for sitcoms and for sports broadcasts. The algorithm is described in detail above. For sitcoms, it reads in a list of scenes of a specified episode, and a list of all mosaics representing each scene (this is the list of representative shots of each scene). It then computes the scene-to-scene distance matrix, and also generates the images that describe the strip-to-strip distance matrix between each pair of mosaics. (For sports broadcasts, it generates a shot-to-shot distance matrix by reading in a list of all shots of the sports sequence and computing the distance between each pair of mosaics.)
#include <stdio . h> #include <stdlib . h> #include < string . h> #include <direct . h> #include <math . h> #include " . . \image\hist . h" #include " . . \image\image . h"
#define BLOCK_DIM 60
#define STRIP_HEIGHT 240
#define DIRNAME "D:\\aya\\sitcoms\\friends2\\"
#define SCENE_NUM 13
#define MAX_SCΞNE_NUM 30
#define MAX_SHOT_NUM 400
#define MAX_RMOSAICS_IN_ _SCENE 10
#define MAX_MOS_WIDTH 2000
#define MAX MOS HEIGHT 1000
#define SCALE_HIST (MAX_MOS_WIDTH*MAX_MOS_HEIGHT)
#define MIN_LAST_STRIP_WIDTH (0.6*BLOCK_DIM)
#define _CREATE_EXAMPLES //#define _SUMMARY //#undef _CREATE_EXAMPLES #undef _SUMMARY
HIST * (*FillHistFunc) (int , int , int , unsigned char **, unsigned char **, unsigned char **, int , int , int , int) ; void (*FreeHistFunc) (HIST *) ; double (*HistDiffFunc) (HIST *,HIST *) ; unsigned char **R1 unsigned char **G1 unsigned char **B1 unsigned char **R unsigned char **G unsigned char **B unsigned char **orgRl; unsigned char **orgGl; unsigned char **orgBl; unsigned char **orgR; unsigned char **orgG; unsigned char **orgB; unsigned char **bigR unsigned char **bigG unsigned char **bigB unsigned char **Y; void main()
{ int scale, ind_a, ind_b, count; int i,j,x,y, k, 1;
FILE *fp; char line [256] , filename [256] ; char f l_f ilename [256] , f 2_f ilename [256] ; int beg_scene [MAX_SCENΞ_NUM] , end_scene [MAX_SCENE_NUM] ; int beg_shot [MAX_SHOT_NUM] , end_shot [MAX_SHOT_NUM] ; int scenes_num=0 , shots_num=0 ; double strip_dist, min_dist, max_dist, min_strip_dist, total_min; int ** mosaics; char *token; double **matrix, **clusters; int Width, Height, val; int Wl, Heightl, W2, Height2 , Widthl, Width2; int WidthGap;
HIST **HTabl, **HTab2; int block_dim = BL0CK_DIM, step = BL0CK_DIM - 1; double debug_max = 0; ii IIII minium min i nn IIIII ii ii IIIII III i mini ii i nun imiiiiii
11 Now choose:
//FillHistFunc = FillHistFromRGBArrayl; // RGB
FillHistFunc = &FillHistFromRGBArray2 ; // HSI
FreeHistFunc = &FreeHist; HistDiffFunc = &FullHistDiffLl; imiiiiii
R = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ; G = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ; B = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ;
Rl = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) Gl = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) Bl = AllocMatrix (MAX MOS HEIGHT, MAX_MOS_WIDTH)
orgR = AllocMatri (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ; orgG = AllocMatrix (MAX_MOS_HEIGHT, MAX_M0S_WIDTH) ; orgB = AllocMatrix (MAX_M0S_HΞIGHT, MAX_M0S_WIDTH) ; orgRl = AllocMatrix (MAX_M0S_HEIGHT, MAX_M0S_WIDTH) orgGl = AllocMatrix (MAX_MOS_HEIGHT, MAX_M0S_WIDTH) orgBl = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
sprintf (filename, "%sscenes .txt" , DIRNAME) ; if ( (fp = fopen(filename, "r") ) ==NULL) { fprintf (stderr, "Can' t open %s \n" , filename) ; exit (0) ;
} while( (fgets(line, 255, fp) ) != NULL){ sscanf (line, " %d %d ", &beg_scene [scenes_num] , &end_scene [scenes_num] ) ; scenes_num++;
} fclose (fp) ; matrix = (double **) malloc (scenes_num*sizeof (double *) matrix [0] = (double *) malloc (scenes_num*scenes_num*sizeof (double) ) ; for (i=l; i<scenes_num;i++) matri [i] = matri [i-l] '+ scenes_num; for (i=0;i<scenes_num;i++) for (j=0; j<scenes_num; j++) matrix [i] [j] = 0;
sprintf (filename, "%sshots_skip5.txt" , DIRNAME) ; if ( (fp = fopen (filename, "r") )==NULL) { fprintf (stderr, "Can' t open %s \n" , filename) ; exit (0) ;
} while ( (fgets (line, 255, fp) ) != NULL){ sscanf (line, " %d %d ", &beg_shot [shots_num] , &end_shot [shots_num] ) ; shots_num++;
} fclose(fp); mosaics = (int **) malloc (scenes_num*sizeof (int *) ) ; mosaics [0] = (int * ) malloc (scenes_num*MAX_RMOSAICS_IN_SCENE*sizeof (int) ) ; for (i=l ; i<scenes_num; i++) mosaics [i] = mosaics [i-l] + MAX_RMOSAICS_IN_SCENE; for (i=0 ; i<scenes_num; i++) for ( j =0 ; j <MAX_RMOSAICS_IN_SCENE; j ++) mosaics [i] [j ] = 0 ; sprintf (filename , " %sscenes_mosaics . txt" , DIRNAME) ; if ( (fp = fopen (filename , "r" ) ) ==NULL) { fprintf (stderr, "Can ' t open %s \n" , filename) ; exit (0) ;
} for (i = 0 ; i<scenes_num,- i++) { count=0 ; fgets (line, 255, fp) ; token = strtok(line, " ") ; while (token != NULL) { count++ ; mosaics [i] [count] = atoi (token) ; token = strtok (NULL , " " ) ;
} mosaics [i] [0] =count ;
} printf ("there are %d scenes\n starting matrix... \n" , scenes_num) ; sprintf (fl_filename, "%s\\clusters" , DIRNAME) ; _mkdir (fl_filename) ; max_dist=0; total_min = 100000.0; for (i=0;i<scenes_num;i++) { for (j=0; j<scenes_num; j++) { printf (" (%d,%d) ",i,j); if (i>=j){ fprintf (fp , " %g " , matrix [j ] [i] ) ; continue ;
} min_dist = 100000 . 0 ; for (ind_a = 1 ; ind_a <= mosaics [i] [0] ; ind_a++) {
sprintf (fl_filename, "%smos\\Original\\Mos_color_med%d.ppm" ,DIRN AMΞ, mosaics [i] [ind_a] ) ;
ReadPPMAllocated (fl_filename, &Width, &Height, &orgRl, &orgGl, &orgBl) ; sprintf (fl_filename, " %smos\\Mos_median%d. ppm" , DIRNAME, mosaics [i] [ind_a] ) ;
ReadPPMAllocated(fl_filename, &Width, &Height, &R1, &G1, &B1) ;
Wl = Width/block_dim; Heightl = Height;
Widthl = Width; WidthGap = Width % block_dim; if (WidthGap > MIN_LAST_STRIP_WIDTH) W1++; HTabl = (HIST **) malloc (Wl*sizeof (HIST *) ) ; for (k=0;k< Width-WidthGap; k+=block_dim) {
HTabl [k/block_dim] = FillHistFunc (SIZE_0F_Y, SIZE_0F_U, SIZE_0F_V, Rl, Gl, Bl, 0, STRIP_HEIGHT-1, k, k+step) ;
} if (WidthGap > MIN_LAST_STRIP_WIDTH) {
HTabl [Wl-1] = FillHistFunc (SIZE_0F_Y, SIZE_OF_U,SIZE_OF_V, Rl, Gl, Bl, 0, STRIPJHEIGHT-1, Width-WidthGap, Width-1) ;
} for (ind_b = 1; ind_b <= mosaics [j] [0] ; ind_b++) { sprintf (f2_filename, " %smos\\θriginal\\Mos_color__med%d. ppm" , DIRN AME, mosaics [j] [ind_b] ) ;
ReadPPMAllocated(f2_filename, &Width, &Height, &orgR, &orgG, &orgB) ; sprintf (f2_filename, "%smos\\Mos_median%d.ppm" , DIRNAME, mosaics [j ] [ind_b] ) ; ReadPPMAllocated(f2_f ilename, &Width,
&Height, &R, &G, &B) ;
Height2 = Height;
Width2 = Width;
W2 = Width/block_dim ;
WidthGap = Width % block_dim; if (WidthGap > MIN_LAST_STRIP_WIDTH)
W2++;
HTab2 = (HIST **) malloc (W2*sizeof (HIST *)) ; for (1=0, -1< Width-WidthGap; l+=block_dim) {
HTab2 [l/block_dim] = FillHistFunc (SIZE_0F_Y, SIZE_OF_U, SIZΞ_0F_V, R, G, B, 0, STRIP_HΞIGHT-1, 1, 1+step) ;
} if (WidthGap > MIN_LAST_STRIP_WIDTH) { HTab2[W2-l] = FillHistFunc (SIZE_OF_Y, SIZE_OF_U, SIZE_0F_V, R, G, B, 0, STRIP_HEIGHT-1, Width-WidthGap, Width- 1) ;
}
#ifdef SUMMARY if (Wl >= W2) { dist = 1000000; for (k=0;k<= W1-W2; k++) { strip_dist = 0; for (1=0 ;1< W2; 1++) { strip_dist += HistDiffFunc (HTabl [k+1] , HTab2 [1]
} if (strip_dist < dist) dist = strip_dist;
else{ dist = 1000000; for (1=0;1<= W2-W1; 1++) { strip_dist = 0 ; for (k=0;k< Wl; k++) { strip_dist +=
HistDiffFunc (HTabl [k] , HTab2 [1+k] )
} if (strip_dist < dist) dist = strip_dist;
} if (dist<min_dist) min_dist = dist;
#endif _SUMMARY
#ifdef _CREATE_EXAMPLES clusters = (double **) malloc (Wl*sizeof (double *) ) ; clusters [0] = (double *) malloc (Wl*W2*sizeof (double) ) ; for (k=l;k< Wl; k++) clusters [ ] = clusters [k-1] + W2 ; max_dist = 0; min_strip_dist = 100000; for (k=0;k< Wl; k++) { for (1=0; 1< W2; 1++) { strip_dist =
HistDiffFunc (HTabl [k] , HTab2 [1] ) ; clusters [k] [1] = strip_dist; if (max_dist < strip_dist) max_dist = strip_dist ; if (debug_max < strip_dist ) debug__max = strip_dist ; if (min_strip_dist > strip_dist ) min_strip_dist = strip_dist ;
}
} scale = BL0CK_DIM; bigR = AllocMatri (Widthl + Height2 +BL0CK_DIM+1, Width2 + Heightl +BL0CK_DIM+1) ; bigG = AllocMatrix (Widthl + Height2 +BL0CK_DIM+1, Width2 + Heightl +BL0CK_DIM+1) ; bigB = AllocMatrix (Widthl + Height2 +BL0CK_DIM+1 , Width2 + Heightl +BL0CK_DIM+1) ; for (k=0; k < Widthl + Height2 +BL0CK_DIM+1; k++) for (1=0; l<Width2 + Heightl
+BL0CK_DIM+1; 1++) { bigR [k] [1] = 255 ; bigG [k] [1] = 255 ; bigB [k] [1] = 255 ;
} for (k=Heightl-l; k >= 0; k--) for (1=0; l<Widthl; 1++) { bigR[l] [k] = orgRl [Heightl-
1-k] [1] ; bigG[l] [k] = orgGl [Heightl- 1-k] [1] ; bigB[l] [k] = orgBl [Heightl- 1-k] [1] ;
} for (k=0; k<Height2; k++) for (1=0; 1< Width2 ; 1++) { bigR [k+Widthl+BL0CK_DIM/2 ] [1+Heightl] = orgR [k] [1] ; bigG [k+Widthl+BL0CK_DIM/2 ] [1+Heightl] = orgG [k] [1] ; bigB [k+Widthl+BL0CK_DIM/2 ] [1+Heightl] = orgB [k] [1] ;
} Y = AllocMatrix (Wl*scale, W2*scale) ; for (k=0;k< Wl; k++) for (1=0 ;1< W2; 1++) { val = (unsigned char) ( (clusters [k] [1] ) / 2 * 255); if (val > 255) val=255; for (x=0,-x<scale;x++) for (y=0;y<scale,-y++) Y[k*scale+x] [l*scale+y] = val;
} for (k=0; k<Wl*scale; k++) for (1=0; 1< W2*scale; 1++) { bigR[k] [1+Heightl] = Y[k] [1]; bigG[k] [1+Heightl] = Ytk] [1]; bigB [k] [1+Heightl] = Y[k] [1]; }
FreeMatri (Y) ; free (clusters [0] ) ; free (clusters) ;
sprintf (f2_filename, "%sclusters\\comp_%d_%d.ppm" , DIRNAME, mosaics [i] [ind_a] ,mosaics [j] [ind_b] ) ;
WritePPM (f2_filename, bigR, bigG, bigB, Width2 + Heightl, Widthl + Height2) ; FreeMatrix (bigR) ;
FreeMatrix (bigG) ; FreeMatri (bigB) ;
#endif // CREATE EXAMPLES
for (k=0;k<W2;k++)
FreeHist (HTab2 [k] ) ; free (HTab2 ) ; } for (k=0;k<Wl;k++)
FreeHist (HTabl [k] ) ; free (HTabl) ; } if (max_dist < min_dist) max_dist = min_dist; if (total_min > min_dist) total_min = min_dist; matrix [i] [j] = matri [j] [i] = min_dist;
} printf ("\n") ;
} printf ("Max value of clusters array: %lf (%g) \n", debug_max, debug_max) ; #ifdef _CREATE_EXAMPLES
FreeMatrix (R) ; FreeMatrix (G) ; FreeMatri (B) ; FreeMatri (Rl) FreeMatrix (Gl) FreeMatrix (Bl) free (clusters [0] ) ; free (clusters) ; free (mosaics [0] ) ; free (mosaics) ; free (matrix [0] ) free (matrix) ,- exit (0) ;
#endif
#ifdef _SUMMARY // temp "print data" code for SCENE_NUM==13 : sprintf (filename, "%ssc.dif" , DIRNAME) ; if ( (fp = fopen (filename, "w") )==NULL) { fprintf (stderr, "Can't open %s for writing\n" , filename) ; exit (0) ;
} fprintf (fp, "#table size:\n%d\n#labels :\n" , scenes_num) ; for (i=l;i<=scenes_num;i++) { fprintf (fp, "scene_%d\n" , i) ;
} fprintf (fp, "\n") ; for (i=l;i<scenes_num;i++)
{ fprintf (fp, "#row %d:\n", i+1) ; for (j=0; j<i; j++) { fprintf (fp, "%g\n" , matrix [i] [j] ) ; }
} f close (fp) ; clusters = (double **) malloc (scenes num*sizeof (double *) ) clusters [0] = (double *) malloc (scenes_num*scenes_num*sizeof (double) ) ; for (i=l ; i<scenes_num; i++) clusters [i] = clusters [i-l] + scenes_num; // row 1 : i=0 ; j =0 ; clusters [i 0] matrix [ j ] 0 ] clusters [i 1] matrix [ j ] 2 ] clusters [i 2 ] matrix [ j ] 6] clusters [i 3 ] matrix [ j ] 11] clusters [i 4] matrix [ j ] 1] clusters [i 5] matrix [ j ] 5 ] clusters [i 6] matri [ j ] 8 ] clusters [i 7] matrix [ j ] 10] clusters [i 8 ] matrix [ j ] 4 ] clusters [i 9] matri [ j ] 9] clusters [i 10 ] = matrix [ j [3 ] ; clusters [i 11] = matri [ j [12 ] clusters [i 12 ] = matrix [ j [7] ; // row 2 : ι=l ; j =2 ; clusters [i 0] = matri [j ] 0 ] clusters [i 1] = matrix [j ] 2 ] clusters [i 2] = matrix [j] 6] clusters [i 3] = matrix [j] 11] clusters [i 4] = matrix [j ] 1] clusters [i 5] = matrix [j ] 5 ] clusters [i 6] = matrix [j ] 8 ] clusters [i 7] = matrix [j ] 10 ] clusters [i 8] = matrix [j] 4] clusters [i 9] = matrix [j ] 9] clusters [i 10] = matrix [j [31 ; clusters [i 11] = matrix [j [12 ] clusters [i] [12] = matrix [j] [7]
// row 3 : i=2; j=6; clusters [i] [0] = matrix [j ; [0]; clusters [i] [1] = matrix [j! [2]; clusters [i] [2] = matrix [ j '. [6]; clusters [i] [3] = matrix [ j ; [11] ; clusters [i] [4] = matrix [ j '. [11; clusters [i] [5] = matrix [j; [5] ; clusters [i] [6] = matrix [j] [8] ; clusters [i] [7] = matrix [j] [10] ; clusters [i] [8] = matrix [j; [4] ; clusters [i] [9] = matrix [ j ; 191; clusters [i] [10] = matrix [j] [3] ; clusters [i] [11] = matrix [j] [12] clusters [i] [12] = matrix [j] [7] ;
/ / row 4 : i=3; j=ll; clusters [i] [0] = matrix [j] [0] clusters [i] [1] = matrix [j] [2] clusters [i] [2] = matrix [j] [6] clusters [i] [3] = matrix [j] [11] clusters [i] [4] = matrix [j ] [1] clusters [i] [5] = matrix [j] [5] clusters [i] [6] = matrix [j] [8] clusters [i] [7] = matrix [j] [10] ; clusters [i] [8] = matrix [j] [4] ; clusters [i] [9] = matrix [j] [9] ; clusters [i] [10] = matrix [j] [3] ; clusters [i] [11] = matrix [j] [12] ; clusters [i] [12] = matrix [j] [7] ;
/ / row 5 : i=4; j=l; clusters [i] [0] = matrix [ j ] [0] clusters [i] [1] = matrix [j] [2] clusters [i] [2] = matrix [j] [6] clusters [i] [3] = matrix [j] [11] clusters [i] [4] = matrix [j] [1] clusters [i] [5] = matrix [j] [5] clusters [i] [6] = matrix [j] [8] clusters [i] [7] = matrix [j] [10] ; clusters [i] [8] = matrix [j] [4] ; clusters [i] [9] = matrix [j] [9] ; clusters [i] [10] = matrix [j] [3] ; clusters [i] [11] = matrix [j] [12] clusters [i] [12] = matrix [j] [7] ;
/ / row 6 : i=5; j=5; clusters [i] [0] = matri [ j ] clusters [i] [1] = matrix [j] clusters [i] [2] = matrix [j] clusters [i] [3] = matrix [j] clusters [i] [4] = matrix [ j ] clusters [i] [5] = matrix [ j ] clusters [i] [6] = matrix [ ] clusters [i] [7] = matrix [j] clusters [i] [8] = matrix [j] clusters [i] [9] = matrix [j] clusters [i] [10] = matrix [j] clusters [i] [11] matrix [j] [12] clusters [i] [12] matrix [j] [7] ;
// row 7: i=6; j=8; clusters [i] [0] matrix [j ; [0]; clusters [i] [1] matrix [j ! [2] ; clusters [i] [2] matrix [j '. [6]; clusters [i] [3] matrix [j '. [11] ; clusters [i] [4] matrix [j [ [1]; clusters [i] [5] matrix [j '. [5]; clusters [i] [6] matrix [j ] [8]; clusters [i] [7] matrix [j [10] ; clusters [i] [8] matrix [j '_ [4]; clusters [i] [9] matrix [j ; [9]; clusters [i] [10] = matrix [; ] [3] ; clusters [i] [11] = matrix [j] [12] clusters [i] [12] = matrix [j] [7] ;
// row 8 : i=7; j=10; clusters [i] [0] matri [ ] [0]; clusters [i] [1] matrix [j ] [2]; clusters [i] [2] matrix [j ] [ 61 ; clusters [i] [3] matri [j ] [11] ; clusters [i] [4] matrix [j ] Cl] ; clusters [i] [5] matrix [j] [5]; clusters [i] [6] matrix [j ] [8]; clusters [i] [7] matri [j ] [10] ; clusters [i] [8] matri [j ] [4]; clusters [i] [9] matrix [j ] [9]; clusters [i] [10] matrix [j [3] ; clusters [i] [11] matrix [j [12] clusters [i] [12] matrix [j [7];
// row 9 : i=8; j=4; clusters [i] [0] matrix [j] [0] clusters [i] [1] matrix [j] [2] clusters [i] [2] matrix [j] [6] clusters [i] [3] matrix [j ] [11] ; clusters [i] [4] matrix [j] [1] clusters [i] [5] matrix [j] [5] clusters [i] [6] matrix [j] [8] clusters [i] [7] matrix [j] [10] ; clusters [i] [8] matrix [j] [4] ; clusters [i] [9] matrix [j] [9] ; clusters [i] [10] = matrix [j] [3] ; clusters [i] [11] = matrix [j ] [12] clusters [i] [12] = matrix [j ] [7] ;
// row 10: i=9; j=9; clusters [i] [0] matrix [j [0]; clusters [i] [1] matrix [j I [2]; clusters [i] [2] matrix [j [6]; clusters [i] [3] matrix [j [11] ; clusters [i] [4] matri [ [ll; clusters [i] [5] matrix [j [5] ; clusters [i] [6] matrix [j ] [8] ; clusters [i] [7] matrix [j 1 [10] ; clusters [i] [8] matri [j [4]; clusters [i] [9] matrix [j [91 ; clusters ~.ϊl [10] matrix [j] [3] ; clusters "i] [11] matrix [j ] [12] clusters Λ [12] matrix [j ] [7] ;
// row 11: i=10; j=3; clusters !i] [0] matrix [j ] [0] clusters ]i] [1] matrix [j] [2] clusters i] [2] matri [j ] [6] clusters !i] [3] matrix [j] [11] ; clusters .i] [4] matrix [j] [1] clusters ϋ] [5] matrix [j ] [5] clusters i] [6] matrix [j] [8] clusters !i] [7] matri [j] [10] ; clusters !i] [8] matrix [j] [4] ; clusters 'i] [9] matrix [j] [9] ; clusters !i] [10] = matri [j] [3] ; clusters !i] [11] = matrix [j ] [12] clusters !i] [12] = matrix [j] [7] ;
// row 12 : i=ll; j=12; clusters ϋ] [0] matrix [j] [0] ; clusters '.H [1] matrix [j] [2] ; clusters '.H [2] matrix [j] [6] ; clusters l [3] matrix [j] [11] ; clusters ϋ] [4] matrix [j] [1] ; clusters Λl [5] matrix [j] [5] ; clusters ϋ] [61 matrix [j] [8] ; clusters !i] [7] matrix [j] [10] ; clusters i] [8] matrix [j] [4] ; clusters .H [9] matrix [j] [9] ; clusters ϋ] [10] = matrix [ j ] [3] ; clusters 'i] [11] = matrix [j] [12] clusters .ϊl [12] = matrix [ j ] [7] ;
// row 13 : i=12; j=7; clusters 'i] [0] matrix [ j [0]; clusters .i] [1] matrix [j. [2]; clusters .ϊl [2] matrix [ j [ [61 ; clusters .a [3] matrix [ j [11] ; clusters !i] [4] matrix [j] [1] ; clusters i] [5] matrix [j; [5] ; clusters .ϊl [61 matrix [ j '_ [8] ; clusters .ϊl [7] matrix [ j [10] ; clusters .ϊ [8] matrix [ j ' [4] ; clusters .ϊl [9] matrix [j [9]; clusters .ϊl [10] matrix [-_ j] [3]; clusters .ϊl [11] matrix [; j] [12] ; clusters l [12] matrix [; j] [7] ; scale = 20; max_dist = 255/ (max_dist-total_min) ;
Y = AllocMatrix (SCENE_NUM*SCale, SCENE_NUM*scale) ; for (i=0; i<SCENE_NUM; i++) for (j=0; j<SCENE_NUM; j++) { if (i==j) val=0; else val = (unsigned char) (max_dist * (clusters [i] [j] - total min)); for (x=0;x<scale,-x++) for (y=0;y<scale,-y++)
Y[i*scale+x] [j*scale+y] = val;
} sprintf (filename, "%sclusters .pgm" , DIRNAME) ; WritePGM (filename, Y, SCENE_NUM*scale, SCENE_NUM* scale) ; FreeMatrix (Y) ; #endif SUMMARY printf ("Program Done\n") ;
[0001] }
scenes_strips_str.c The routine void main() implements the coarse and fine mosaic-matching algorithm, used for sitcoms. The algorithm is described in detail above. It reads in a list of scenes of a specified episode, and a list of all mosaics representing each scene (this is the list of representative shots of each scene). It then computes the scene-to-scene distance matrix, and also generates the images that describe the strip-to-strip distance matrix between each pair of mosaics.
#include <stdio.h>
#include <direct.h>
#include <stdlib.h> #include <string.h>
#include <math.h>
^include " .. \image\hist .h"
#include " .. \image\image . h" #define BLOCK_DIM 60
#define S_BLOCK_DIM 20
#define MIN_DIAG_LEN_ST 3
#define MIN_DIAG_LEN_MT 5 #define MIN_CROPPED_HEIGHT (MIN_DIAG_LEN_ST*BLOCK_DIM)
#define MIN_CROPPED_WIDTH (MIN_DIAG_LEN_MT*BLOCK_DIM)
//#define DIRNAME "D:\\aya\\sitcoms\\friends2\\" //#define SCENE_NUM 13 #define DIRNAME "D:\\aya\\sitcoms\\friends3\\"
#define SCENΞ_NUM 14
#define MAX_RMOSAICS_IN_SCENE 10
#define MAX_MOS_WIDTH 2000 #define MAX_MOS_HEIGHT 1000
#define MAX_BLOCK_HIST_WIDTH (MAX_MOS_WIDTH/S_BLOCK_DIM + 5)
#define MAX_BLOCK_HIST_HEIGHT (MAX_MOS_HEIGHT/S_BLOCK_DIM + 5)
#define MIN_LAST_STRIP_WIDTH (0.5*BLOCK_DIM)
#define MIN_LAST_SMALL_STRIP_WIDTH (0.5*BLOCK_DIM) #define PRE_COMPUTED_THRΞSH 0.5 double round (double x) ; double GetDirectBestDiagvai (double **Mat, int Rows, int Cols, int
Limit, int *BestStartX, int *BestEndX, int
*BestStartY, int *BestEndY) ; double GetDirectBestDiagValLimited (double **Mat, int Rows, int Cols, int Limit, int *BestStartX, int *BestEndX, int *BestStartY, int *BestEndY) ;
HIST * (*FillHistFunc) (int, int, int, unsigned char **, unsigned char **, unsigned char **, int, int, in , int) ; void (*FreeHistFunc) (HIST *) ; double (*HistDiffFunc) (HIST *,HIST *) ; static double rad2deg = 180/3.1415926535; unsigned char **RI unsigned char **G1 unsigned char **B1 unsigned char **R unsigned char **G unsigned char **B
unsigned char **orgRl unsigned char **orgGl unsigned char **orgBl unsigned char **orgR; unsigned char **orgG; unsigned char **orgB; unsigned char **bigR unsigned char **bigG unsigned char **bigB unsigned^ char **γ; static double Xvals [200] ; static double Yvals[200]; static int Xinds [200] ; static int Yinds[200];
void main()
{ int ind_a, ind_b, count; int i, j, k, 1, x, y, k_indl, k_ind2;
FILE *fp; char line [256] , filename [256] ; char f l_filename [256] , f 2_f ilename [256] ; int scenes_num = SCENE_NUM; double min_dist, dist, dval, s_dist; unsigned char val; int ** mosaics; char *token; double **matrix, **clusters, **best; int Wl, Heightl, W2 , Height2, Widthl, Width2 , HI, H2; int WidthGap, HeightGap, WidthGapl, HeightGapl; HIST ***HTabl, ***HTab2; int block_dim = BL0CK_DIM, step = BL0CK_DIM - 1; int new_width, new_height; int start_i_ml, start_i_m2, end_i_ml, end_i_m2; int start_j__ml, start_j_m2, end_j_ml, end_j_m2; int start_i_vec [100] , end_i_vec [100] , start_j_vec [100] , end_j_vec [100] ; int **start_i_arr, **end_i_arr, **start_j_arr, **end_j_arr; int n_start , n_end; double dx, dy; int Zlen; int dummy; HIST ***HSTabl, ***HSTab2; int s_Heightl, s_Widthl, s_Height2, s_Width2 , SWl, SHI, SW2 , SH2; int strip_widthl, strip_width2, block_heightl, block_height2 ;
/////////////////////////////////////////////////////////////// //////////
// Now choose: //FillHistFunc = FillHistFromRGBArrayl; // RGB FillHistFunc = &FillHistFromRGBArray2 ; // HSI
FreeHistFunc = &FreeHist; HistDiffFunc = &FullHistDiffLl; /////////////////////////////////////////////////////////////// ////////// matrix = (double **) malloc (scenes_num*sizeof (double *) ) ; matri [0] = (double *) malloc (scenes_num*scenes_num*sizeof (double) ) ; for (i=l;i<scenes_num;i++) matrix [i] = matrix [i-l] + scenes_num; for (i=0;i<scenes_num;i++) for (j =0 ; j <scenes_num; j ++) matrix [i] [j] = 0;
mosaics = (int **) malloc (scenes_num*sizeof (int *) ) ; mosaics [0] = (int *) malloc (scenes_num*MAX_RMOSAICS_IN_SCENE*sizeof (int) ) ; for (i=l;i<scenes_num;i++) mosaics [i] = mosaics [i-l] + MAX_RMOSAICS_IN_SCENE; for (i=0;i<scenes_num,-i++) for (j =0 ; j <MAX_RMOSAICS_IN_SCENΞ ; j ++) mosaics [i] [j] = 0; sprintf (filename, "%sscenes_mosaics .txt" , DIRNAME) ; if ((fp = fopen (filename, "r") ) ==NULL) { fprintf (stderr, "Can' t open %s \n" , filename) ; exit (0) ; } for (i=0;i<scenes_num;i++) { count=0; fgets (line, 255, fp) ; token = strtok(line, " ") ; while (token != NULL) { count++; mosaics [i] [count] = atoi (token) ; token = strtok (NULL, " " ) ;
} mosaics [i] [0] =count ;
} printf ("there are %d scenes\n starting matrix... \n", scenes_num) ; R = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
G = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) B = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
Rl = AllocMatrix (MAX_M0S_HEIGHT, MAX_M0S_WIDTH) ; Gl = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ; Bl = AllocMatrix (MAX MOS HEIGHT, MAX MOS WIDTH);
orgR = AllocMatri (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ; orgG = AllocMatri (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) ; orgB = AllocMatrix (MAX_MOS_HEIGHT, MAX__MOS_WIDTH) ; orgRl = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH) orgGl = AllocMatrix (MAX_MOS_HΞIGHT, MAX_MOS_WIDTH) orgBl = AllocMatrix (MAX_MOS_HEIGHT, MAX_MOS_WIDTH)
HTabl = (HIST ***) malloc (MAX_BLOCK_HIST_HEIGHT*sizeof (HIST **));
HTabl [0] = (HIST **) malloc (MAX_BLOCK_HIST_HEIGHT*MAX_BLOCK_HIST_WIDTH*sizeof (HIST *) ) ; for (k=l;k<MAX_BLOCK_HIST_HΞIGHT; k++)
HTabl [k] = HTabl [k-1] + MAX_BLOCK_HIST_WIDTH; HTab2 = (HIST ***) malloc (MAX_BLOCK_HIST_HEIGHT*sizeof (HIST
* * ) ) ;
HTab2 [0] = (HIST * * ) malloc (MAX_BLOCK_HIST_HEIGHT*MAX_BLOCK_HIST_WIDTH* sizeof (HIST * ) ) ; for (k=l ; k<MAX_BLOCK_HIST_HEIGHT; k++) HTab2 [k] = HTab2 [k-1] + MAX_BLOCK_HIST_WIDTH;
// smaller version:
HSTabl = (HIST ***) malloc (MAX_BLOCK_HIST_HEIGHT*sizeof (HIST **) ) ; HSTabl [0] = (HIST **) malloc (MAX_BLOCK_HIST_HEIGHT*MAX_BLOCK_HIST_WIDTH*sizeof (HIST *) ) ; for (k=l;k<MAX_BLOCK_HIST_HEIGHT; k++)
HSTabl [k] = HSTabl [k-1] + MAX_BLOCK_HIST_WIDTH; HSTab2 = (HIST ***) malloc (MAX_BLOCK_HIST_HEIGHT*sizeof (HIST **));
HSTab2 [0] = (HIST **) malloc (MAX_BLOCK_HIST_HEIGHT*MAX__BLOCK__HIST_WIDTH*sizeof (HIST *) ) ; for (k=l;k<MAX_BLOCK_HIST_HΞIGHT; k++)
HSTab2 [k] = HSTab2 [k-1] + MAX_BLOCK_HIST_WIDTH;
Y = AllocMatrix (2000 , 2000 ) ;
Wl=100 ; W2=100 ; best = (double ** ) malloc (Wl*sizeof (double * ) ) ; best [0] = (double * ) malloc (Wl*W2*sizeof (double) ) ; for (k=l ; k< Wl ; k++) best [k] = best [k-1] + W2 ; clusters = (double **) malloc (Wl*sizeof (double *) ) ; clusters [0] = (double *) malloc (Wl*W2*sizeof (double) ) ; for (k=l ; k< Wl ; k++) clusters [k] = clusters [k-1] + W2 ; start_i_arr = (int ** ) malloc (Wl*sizeof (int * ) ) ; start_i_arr [0] = (int *) malloc (Wl*W2*sizeof (int) ) ; for (k=l ; k< Wl ; k++) start_i_arr [k] = start_i_arr [k-1] + W2 ; end_i_arr = (int **) malloc (Wl*sizeof (int * ) ) ; end_i_arr [0] = (int * ) malloc (Wl*W2*sizeof (int) ) ; for (k=l ; k< Wl; k++) end i_arr [k] = end i arr [k-l] + W2 ; start_j__arr = (int **) malloc (Wl*sizeof (int *) ) ; start_j__arr [0] = (int *) malloc (Wl*W2*sizeof (int) ) ; for (k=l;k< Wl; k++) start_j_arr [k] = start_j_arr [k-1] + W2; end_j_arr = (int **) malloc (Wl*sizeof (int *) ) ; end_j_arr[0] = (int *) malloc (Wl*W2*sizeof (int) ) ; for (k=l;k< Wl; k++) end_j_arr[k] = end_j_arr [k-1] + W2; sprintf (fl_f ilename, "%s\\clusters" , DIRNAME) ;
_mkdir (fl_f ilename) ; for (i=0; i<scenes_num; i++) { for ( j=0; j<scenes_num; j++) { printf (" (%d,%d) ",i,j); if (i>=j) continue; //compute min distance between scene i and scene j : // Mosaics loop: min_dist = 10000; for (ind_a = 1; ind_a <= mosaics [i] [0]; ind_a++) {
sprintf (fl_filename, "%smos\\0riginal\\Mos_color_med%d.ppm" , DIRN AME, mosaics [i] [ind_a] ) ;
ReadPPMAllocated(fl_filename, &Widthl, &Heightl, &orgRl, &orgGl, &orgBl) ; sprintf (fl_filename, "%smos\\Mos_median%d.ppm" , DIRNAME, mosaics [i] [ind_a] ) ;
ReadPPMAllocated(fl_filename, &Widthl, SHeightl, &R1, &G1, &B1) ;
Wl = Widthl/block_dim; HI = Heightl/block__dim;
WidthGap = Widthl % block_dim; if (WidthGap > MIN__LAST_STRIP_WIDTH) W1++; for (1=0; 1< Hl*block_dim; l+=block_dim) for (k=0;k< Widthl-WidthGap; k+=block_dim)
HTabl [l/block_dim] [k/block_dim] = FillHistFunc (SIZE_0F_Y, SIZE_0F_U, SIZE_0F_V, Rl , Gl , Bl, 1, l+block_dim-l, k, k+step) ; if (WidthGap > MIN_LAST_STRIP_WIDTH) for (1=0,-1< Hl*block_dim; l+=block_dim)
HTabl [l/block_dim] [Wl-1] = FillHistFunc (SIZE_0F_Y, SIZE_0F_U, SIZE_0F_V, Rl, Gl, Bl, 1, l+block_dim-l, Widthl-WidthGap, Widthl-1) ; for (ind_b = 1; ind_b <= mosaics [j] [0] ; ind_b++) { sprintf (f2_filename, "%smos\\0riginal\\Mos_color_med%d.ppm" , DIRN AME, mosaics [ ] [ind_b] ) ; ReadPPMAllocated(f2_filename, &Width2 ,
&Height2, &orgR, &orgG, &orgB) ; sprintf (f2_filename, "%smos\\Mos_median%d. ppm" , DIRNAME, mosaics [j] [ind_b] ) ; ReadPPMAllocated(f2_f ilename, &Width2,
&Height2, &R, &G, &B)
W2 = Width2/block_dim;
H2 = Height2/block_dim;
WidthGap = Width2 % block_dim; if (WidthGap > MIN_LAST_STRIP_WIDTH)
W2 + +; for (1=0;1< H2*block_dim; l+=block_dim) for (k=0;k< Width2 -WidthGap; k+=block dim)
HTab2 [l/block_dim] [k/block_dim] = FillHistFunc (SIZE_0F_Y, SIZE_0F_U, SIZE_0F_V, R, G, B, 1, l+block_dim-l, k, k+step) ; if (WidthGap > MIN_LAST_STRIP_WIDTH) for (1=0, -1< H2*block_dim; l+=block_dim)
HTab2 [l/block_dim] [W2-1] = FillHistFunc (SIZE_0F_Y, SIZE_0F_U, SIZE_0F_V, R, G, B, 1, l+block_dim-l, Width2 -WidthGap, Width2-1) ; for (k_indl=0;k_indl<Wl; k_indl++) for (k_ind2=0;k_ind2<W2; k_ind2++) { for (k=0;k<Hl;k++) { for (1=0;1<H2;1++) { clusters [k] [1] = HistDif f Func (HTabl [k] [k_indl] , HTab2 [1] [k_ind2] ) ;
} } best [k_indl] [k_ind2] =
GetDirectBestDiagvai (clusters, HI, H2, MIN_DIAG_LEN_ST,
&start_j_ml, &end_j_ml, &start_i_ml, &end_i_ml) ; start_i_arr [k_indl] [k_ind2]
= start_i_ml; end_i_arr [k_indl] [k_ind2] = end_i_ml ; start_j_arr [k_indl] [k_ind2] = start_j_ml; end_j_arr [k_indl] [k_ind2] = end_j_ml;
} dist = GetDirectBestDiagvai (best, l,
W2, MIN_DIAG_LEN_MT, &start_j_m2 , &end_j_m2, &start_j_ml, &end_j_ml) ; if (dist > PRE_COMPUTED_THRESH) { continue; } if (end_j_ml-start_j_ml < 2) continue; if (end_j_m2-start_j_m2 < 2) continue; new_height = (end_j_ml- start_j_ml+l) *BL0CK_DIM; new_width = (end_j_m2-
Start_j_m2+1)*BL0CK DIM; if (new_height <= new_width) { dval = (double) new_width / new_height ; strip_widthl = S_BLOCK_DIM; strip_width2 =
(int) (S_BLOCK_DIM*dval) ;
} else { dval = (double) new_height / new_width; strip_width2 = S_BLOCK_DIM; strip_widthl = (int) (S BLOCK DIM* dval) ;
dval = atan ( (end_j_ml- start_j_ml+l) , (end_j_m2-start_j_m2+l) ) ; dval *= rad2deg; dx = end_j_m2-start__j_m2+l; dy = end_j_ml-start_j_ml+l; if (dy>=dx) {
Zlen = dy; dx /= dy; dy = 1; else{
Zlen = dx; dy /= dx; dx = 1;
}
Xvals [0] = Xinds [0] = start_j_m2 ;
Yvals [0] = Yinds [0] = start__j_ml ; for (k=l ; k<Zlen; k++) {
Xvals [k] = Xvals [k-1] + dx Yvals tk] = Yvals [k-1] + dy Xinds [k] = round (Xvals [k] ) Yinds [k] = round (Yvals [k] )
} for (k=0 ; k<100 ; k++) { start_i_vec [k] = 0 ; end_i_vec [k] =0 ; start_j_vec [k] =0 ; end_j_vec [k] =0 ;
for (k=0; k<Zlen; k++) { k_indl = Yinds [k] , k ind2 = Xinds [k] , start_i_vec [start_i_arr [k_indl] [k_ind2] ] ++; end_i_vec [end_i_arr [k_indl] [k_ind2] ] ++ ; start_j_vec [start_j_arr [k_indl] [k_ind2 ] ] ++; end_j_vec [end_j_arr [k_indl] [k_ind2 ] ] ++,-
} count =-1 , • for (k=0 ; k<Hl ; k++) if (start_i_vec [k] > count) { start_i_ml = k; count = start_i_vec [k] ;
} count=-l; for (k=0;k<Hl; k++) if (end_i_vec [k] >= count) { end_i_ml = k; count = end_i_vec [k] ;
} count=-1; for (k=0;k<H2; k++) if (start_j_vec [k] > count) { start_i_m2 = k; count = start_j_vec [k] ;
} count=-l; for (k=0;k<H2; k++) if (end_j_vec [k] >= count) { end_i_m2 = k; count = end_j_vec [k] ;
} if (end_i_ml-start_i_ml+l <
MIN_DIAG_LΞN_ST) { n_end = end_i_ml; n_start = start_i_ml; for (k=0; k<Zlen; k++) { k_indl = Yinds [k] ; k_ind2 = Xinds [k] ; if (start_i_ml == start_i_arr [k_indl] [k_ind2] if (n_end < end_i_arr [k_indl] [k_ind2] ) n_end = end i arr [k indl] [k ind2] ;
} for (k=0; k<Zlen; k++) { k_indl = Yinds [k] ; k_ind2 = Xinds [k] ; if (end_i_ml == end_i_arr [k_indl] [k_ind2] ) if (n_start > start_i_arr [k__indl] [k_ind2] ) n start start i arr[k indl] [k ind2] ; Check-affine.m: this function, in Matlab code, aids in computing the type of motion within each shot - static, panning, zoom-in or zoomed out.
GlobalPath = 'D:\aya\sitcoms\';
FigDim=100; SqDim=10; TransDim = 10; %% Choose Sitcoms:
GlobalPath = [GlobalPath ' friends2_shots\ ' ] ;
Base=100000; thresh=20;
Shots = load ( [GlobalPath, 'shots_skip5.txt']); ShotsBeg = Shots (: , 1) ;
ShotsEnd = Shots (: , 2);
ShotsNu = length (ShotsBeg) ; eval ( [ ' cd ' , GlobalPath, ' ; ' ] ) ; for ShotIndex=l: ShotsNum ShotIndex
Path = [GlobalPath, 'shot', num2str (Shotlndex) , ' \ ' ] ; eval( ['cd ' ,Path, ';']);
TRANS = load(' .aff ') ;
Num=length (TRANS ( : , 1) ) ;
TRANS = all2MATLAB ( [TransDim, TransDim], TRANS); colors = ['b' 'r' 'g' 'y' 'C 'm' 'k']; num_of_colors = length (colors) ; bx = [-FigDim FigDim FigDim -FigDim] ; by = [FigDim FigDim -FigDim -FigDim] ; fill (bx, by, 'w') ; hold on;
X = [-SqDim SqDim SqDim -SqDim] ; Y = [SqDim SqDim -SqDim -SqDim] ; Bord_top = -SqDim;
Bord_bottom = SqDim; Bord_left = -SqDim; Bord_right = SqDim; fill(X,Y, 'k') ; for ind=l:Num this_color = colors (mod(ind-l,num_of_colors) +1) ; T = TRANS (ind, 1: 9) ;
T = reshape (T, 3, 3) ;
T = T ' ;
Z = T* [X;Y; [l l l l]]
X = Z(l, :) ./Z(3, :) Y = Z(2, :) ./Z(3, :)
% update borders : Bord_top = min(Bord_top, Y(3));
Bord_top = min(Bord_top, Y(4));
Bord_bottom = max(Bord_bottom, Y(l))
Bord_bottom = ma (Bord_bottom, Y(2)) Bord_left = min(Bord_lef t, X(l))
Bord_left = min(Bord_left, X(4))
Bord_right = max(Bord_right, X(2));
Bord_right = max(Bord_right, X(3)); fill(X,Y,this_color) ; end clear TRANS;
% save image into file: eval(['! del /F aff.*']);
Border = [Bord_top, Bord_bottom, Bord_left, Bord_right] save 'aff.txt' Border -ASCII print -dpsc aff.ps saveas (gcf, ' aff . jpg' ) hold off; end

Claims

CLAIMS What is claimed is:
1. A method for summarizing a video comprising a plurality of consecutive frames, the method comprising the steps of: a) dividing said plurality of consecutive frames into a plurality of sequences of consecutive frames; b) preparing a mosaic representation for each sequence of consecutive frames; c) comparing said mosaic representations by determining similarities between said mosaic representations; and d) clustering said mosaic representations into physical settings based upon said similarities between said mosaic representations.
2. The method as defined in claim 1, wherein a sequence of consecutive frames comprises a shot.
3. The method as defined in claim 1, wherein said step of preparing a mosaic representation of each sequence of consecutive frames comprises determining a reference frame for each shot.
4. The method as defined in claim 3, wherein said step of preparing a mosaic representation of each sequence of consecutive frames comprises computing motion transformations between successive frames in the sequence of consecutive frames and using the motion transformations to map each frame onto a reference image plane.
5. The method as defined in claim 3, wherein the step of preparing a mosaic representation comprises preparing a color mosaic representation.
6. The method as defined in claim 5, wherein the step of preparing a color mosaic representation comprises, for each color pixel in a frame having a color value, converting the color pixel to a gray-level pixel and maintaining pointers from the gray-level pixel to its respective color value.
7. The method as defined in claim 6, further comprising computing, for each pixel in the mosaic representation, a median gray-level for all pixels in the frames mapped to the reference image plane and using the respective color value corresponding to the median gray-level.
8. The method as defined in claim 1, wherein said step of comparing said mosaic representations comprises, for pairs of said mosaic representations, performing an alignment of said mosaic representations.
9. The method as defined in claim 8, wherein said step of performing an alignment of said mosaic representations comprises, for each pair of said mosaic representations, dividing each said mosaic representation into a plurality of strips and comparing a pair of strips including one strip from each respective mosaic representation.
10. The method as defined in claim 9, wherein said step of performing an alignment of said mosaic representations further comprises, for each said pair of strips, determining a vertical alignment.
11. The method as defined in claim 10, wherein the step of determining the vertical alignment of each of said pairs of strips comprises, for each said pair of strips, determining a best diagonal in a distance matrix and determing a distance value from said best diagonal.
12. The method as defined in claim 9, wherein said step of performing an alignment of said mosaic representations further comprises, for each said pair of mosaic representations, determining a horizontal alignment.
13. The method as defined in claim 12, wherein the step of determining the horizontal alignment of each of said pairs of mosaic representations comprises, for each of said pairs of mosaic representations, determining a best diagonal in a second distance matrix and determining a second distance value from the best diagonal.
14. The method as defined in claim 8, wherein said step of comparing said mosaic representations further comprises, retaining a subset of said mosaic representations.
15. The method as defined in claim 14, wherein said step of retaining a subset of said mosaic representations comprises, for pairs of said mosaic representations, determining a threshold distance value.
16. The method as defined in claim 15, wherein said step of comparing said mosaic representations further comprises retaining pairs of said mosaic representations having distance values less than or equal to said threshold.
17. The method as defined in claim 14, wherein said step of comparing said mosaic representations further comprises, for said pairs of mosaic representations, performing a second alignment.
18. The method as defined in claim 17, wherein said step of performing a second alignment of said pairs of mosaic representations comprises cropping each of said mosaic representations.
19. The method as defined in claim 18, wherein said step of performing a second alignment of said pairs of mosaic representations comprises dividing said each of said mosaic representations into a plurality of strips and comparing a pair of strips including one from each respective mosaic representation.
20. The method as defined in claim 19, wherein each of said strips is narrower than said strips prepared in said step of performing said first alignment of said mosaic representations.
21. The method as defined in claim 19, wherein said step of performing a second alignment of said pairs of mosaic representations further comprises, for each said pair of strips, determining a vertical alignment.
22. The method as defined in claim 14, wherein said step of performing a second alignment of said pairs of mosaic representaions further comprises, for each said pair of mosaic representaions, determining a horizontal alignment.
23. The method as defined in claim 1, further comprises dividing said sequence of consecutive frames into a plurality of scenes, and preparing one or more mosaic representations for each scene.
24. The method as defined in claim 23, wherein the step of comparing said mosaic representations comprises comparing a pair of scenes by determining the distance value between said scenes in said pair of scenes.
25. The method as defined in claim 24, wherein the step of determining the distance value between said scenes in said pair of scenes comprises determining the minimum distance between pairs of mosaics including one mosaic from each of said scenes.
26. The method as defined in claim 25, further comprising clustering each of said distance values of pairs of scenes to a matrix arranged by physical settings.
27. The method as defined in claim 1, further comprising, for a plurality of videos, identifying the frequency in which physical settings appear in said plurality of videos.
28. The method as defined in claim 1, further comprising displaying representations of said physical settings of said video.
PCT/US2003/009704 2002-03-27 2003-03-27 Methods for summarizing video through mosaic-based shot and scene clustering WO2003084249A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003226140A AU2003226140A1 (en) 2002-03-27 2003-03-27 Methods for summarizing video through mosaic-based shot and scene clustering

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US36809202P 2002-03-27 2002-03-27
US60/368,092 2002-03-27

Publications (2)

Publication Number Publication Date
WO2003084249A1 WO2003084249A1 (en) 2003-10-09
WO2003084249A9 true WO2003084249A9 (en) 2004-02-19

Family

ID=28675443

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/009704 WO2003084249A1 (en) 2002-03-27 2003-03-27 Methods for summarizing video through mosaic-based shot and scene clustering

Country Status (2)

Country Link
AU (1) AU2003226140A1 (en)
WO (1) WO2003084249A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8089563B2 (en) * 2005-06-17 2012-01-03 Fuji Xerox Co., Ltd. Method and system for analyzing fixed-camera video via the selection, visualization, and interaction with storyboard keyframes
JP4697221B2 (en) 2007-12-26 2011-06-08 ソニー株式会社 Image processing apparatus, moving image reproduction apparatus, processing method and program therefor
US8824801B2 (en) 2008-05-16 2014-09-02 Microsoft Corporation Video processing
RU2583764C1 (en) 2014-12-03 2016-05-10 Общество С Ограниченной Ответственностью "Яндекс" Method of processing request for user to access web resource and server

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0976089A4 (en) * 1996-11-15 2001-11-14 Sarnoff Corp Method and apparatus for efficiently representing, storing and accessing video information
US5956026A (en) * 1997-12-19 1999-09-21 Sharp Laboratories Of America, Inc. Method for hierarchical summarization and browsing of digital video

Also Published As

Publication number Publication date
WO2003084249A1 (en) 2003-10-09
AU2003226140A1 (en) 2003-10-13

Similar Documents

Publication Publication Date Title
Aner et al. Video summaries through mosaic-based shot and scene clustering
US7760956B2 (en) System and method for producing a page using frames of a video stream
US8306334B2 (en) Methods of representing and analysing images
US20050228849A1 (en) Intelligent key-frame extraction from a video
KR100636910B1 (en) Video Search System
Kim et al. Efficient camera motion characterization for MPEG video indexing
US7889794B2 (en) Extracting key frame candidates from video clip
US20070182861A1 (en) Analyzing camera captured video for key frames
US20080019661A1 (en) Producing output video from multiple media sources including multiple video sources
JP5097280B2 (en) Method and apparatus for representing, comparing and retrieving images and image groups, program, and computer-readable storage medium
JP2006510072A (en) Method and system for detecting uniform color segments
US6904159B2 (en) Identifying moving objects in a video using volume growing and change detection masks
US20070030396A1 (en) Method and apparatus for generating a panorama from a sequence of video frames
JP2004508756A (en) Apparatus for reproducing an information signal stored on a storage medium
US20110038532A1 (en) Methods of representing and analysing images
WO2013056311A1 (en) Keypoint based keyframe selection
KR100862939B1 (en) Image recording and playing system and image recording and playing method
Aner-Wolf et al. Video summaries and cross-referencing through mosaic-based representation
WO2003084249A9 (en) Methods for summarizing video through mosaic-based shot and scene clustering
EP1640913A1 (en) Methods of representing and analysing images
Ciocca et al. Dynamic key-frame extraction for video summarization
JP3499729B2 (en) Method and apparatus for spatio-temporal integration and management of a plurality of videos, and recording medium recording the program
Aner et al. Mosaic-based clustering of scene locations in videos
Aner-Wolf et al. Beyond key-frames: The physical setting as a video mining primitive
Aner Video summaries and cross-referencing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
COP Corrected version of pamphlet

Free format text: PAGES 1/24-24/24, DRAWINGS, REPLACED BY NEW PAGES 1/30-30/30; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP