WO2005072239A2 - Methods and systems for analyzing and summarizing video - Google Patents

Methods and systems for analyzing and summarizing video Download PDF

Info

Publication number
WO2005072239A2
WO2005072239A2 PCT/US2005/001859 US2005001859W WO2005072239A2 WO 2005072239 A2 WO2005072239 A2 WO 2005072239A2 US 2005001859 W US2005001859 W US 2005001859W WO 2005072239 A2 WO2005072239 A2 WO 2005072239A2
Authority
WO
WIPO (PCT)
Prior art keywords
board
content
frames
blocks
video
Prior art date
Application number
PCT/US2005/001859
Other languages
French (fr)
Other versions
WO2005072239A3 (en
Inventor
John R. Kender
Tiecheng Liu
Original Assignee
The Trustees Of Columbia University In The City Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Trustees Of Columbia University In The City Of New York filed Critical The Trustees Of Columbia University In The City Of New York
Publication of WO2005072239A2 publication Critical patent/WO2005072239A2/en
Publication of WO2005072239A3 publication Critical patent/WO2005072239A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/785Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/786Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using motion, e.g. object motion or camera motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation

Definitions

  • the present invention relates to the field of instructional videos. More particularly, this invention relates to methods and systems for analyzing and summarizing instructional videos.
  • Two levels at which a video can be analyzed or understood include a syntactic level and an event level.
  • Analyzing on a syntactic level refers to video analysis based on syntactic (low-level image) structures of videos, rather than content.
  • the structures of videos are examined to determine the meaning and relationship between image objects. For example, assuming that a frame of video includes one or more sentences of text, a consideration of the text on a syntactic level includes reading and evaluating the video frame at a level of its letters, not at a level of the concepts being presented by the letters.
  • Understanding on an event level refers to an understanding based on video content, rather than how the video is captured or edited.
  • low-level image processing techniques are used to segment frames into board background blocks, board content blocks, and irrelevant blocks. These blocks are subsequently merged into four region types, namely, board background, board content, occlusion by instructor, and irrelevant regions.
  • region types namely, board background, board content, occlusion by instructor, and irrelevant regions.
  • the fluctuation of video content can be measured, where local maxima in the number of chalk pixels are used to define key frames.
  • Mosaiced images are also developed in certain embodiments to compensate for the fact that some video content may not appear in some frames due to occlusion by the instructor.
  • the concept of semantic teaching units is also presented, and these units are detected by a model based on the recognition of actions of instructors, and on the measurement of temporal duration and spatial location of board content.
  • the invention provides a method for processing a video of a board presentation, wherein the video comprises a plurality of board image frames, the method comprising the steps of: modeling the color distribution of the board; dividing the board image of at least some frames into a plurality of blocks; for the at least some frames, determining which blocks are board content blocks; detecting the amount of combined content in the content blocks for the at least some frames; and selecting key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
  • the invention provides a system for a video of a board presentation, wherein the video comprises a plurality of board image frames, the system comprising: a projection surface that is adapted to display writing; at least one detection device that captures the video; and a processor coupled to the at least one detection device that models the color distribution of the board, divides the board image of at least some frames into a plurality of blocks, determines which blocks are board content blocks for the at least some frames, detects the total number of content chalk pixels in the content blocks for the at least some frames, and selects key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
  • the invention provides a system for processing a video of a board presentation, wherein the video comprises a plurality of board image frames, the system comprising: a processor that: models the color distribution of the board; divides the board image of at least some frames into a plurality of blocks; determines which blocks are board content blocks for the at least some frames; detects the total number of content chalk pixels in the content blocks for the at least some frames; and selects key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
  • FIG. 1 is a simplified illustration of one embodiment of a board-camera system in which certain embodiments of the present invention may be implemented;
  • FIG. 2 is a simplified flow chart illustrating the steps performed in determining the classification of a given block in accordance with the principles of certain embodiments of the present invention;
  • FIG. 3 is a simplified illustration showing one frame of video that has been segmented into different regions in accordance with the principles of certain embodiments of the present invention
  • FIG. 4 is a graph showing the fluctuation of the number of chalk or content pixels as measured for the first 4,500 frames of a sample video segment in accordance with the principles of certain embodiments of the present invention
  • FIG. 5 is a graph showing the fluctuation of the area of the occlusion by the instructor region in the board area as measured for the first 4,500 frames of a sample video segment in accordance with the principles of certain embodiments of the present invention.
  • FIG. 6 is a simplified flow chart illustrating the steps performed in determining the actions of the instructor in accordance with the principles of certain embodiments of the present invention.
  • Methods and systems are provided for analyzing and summarizing, for example, instructional videos of unrehearsed and unedited blackboard (board)-presentations. While the description below focuses on videos of board presentations in order to illustrate various principles of the present invention, it will be understood that the invention is also applicable to other types of videos. For example, the methods and systems described herein may be used to analyze and summarize instructional videos of computer presentations where successive hand- written slide images are projected onto a screen or wall, or instructional videos of marker board presentations where an instructor uses a marker, rather than chalk, to write on the board. Moreover, while it is assumed herein that instructional videos are produced by lightly trained individuals who do poor camerawork and little or no editing, the invention is not limited to this class of instructional videos.
  • FIG. 1 is a simplified illustration of one embodiment of a board-camera system 100 in which the present invention may be implemented.
  • system 100 includes a video camera 104 for capturing video of board 108 (including, e.g., chalk content 112).
  • camera 104 may be used to capture video of board 108 while an instructor (not shown) is presenting instructional material to an audience (e.g., students). While only a single camera 104 is shown in FIG. 1, it will be understood that the invention is not limited in this manner.
  • camera 104 may be any suitable type of video camera
  • board 108 may be any suitable type of board on which an instructor may write (e.g., with a marker, chalk, etc.).
  • camera 104 may be replaced by any suitable type of detection device that is capable of detecting writing on (or the projection of writing onto) board 108 or any other suitable type of surface (e.g., a wall, screen, etc.).
  • System 100 also includes a computer 116 and a monitor 120 connected thereto, and although not shown in FIG. 1, may further include one or more input devices (e.g., a keyboard and mouse) connected to computer 116 via connection section 124.
  • Computer 116 may be, for example, a DELL PRECISION computer with an INTEL PENTIUM processor (1.8 GHz) and 1 Gb of SDRAM memory.
  • Video to be analyzed and summarized according to the invention by computer 116 is provided by camera 104 (not necessarily while the video is in the process of being captured, assuming that camera 104 has some form of media on which to store video content prior to transfer to computer 116).
  • the video from camera 104 may be captured by computer 116 using any suitable type of video card 128. It will be understood that, in accordance with the principles of the present invention, the image processing described below may be occurring in a processor that is, for example, the central processing unit (CPU) of computer 116.
  • CPU central processing unit
  • a board that is the central focus of an instructional video may have multiple (potentially sliding) panels, where an instructor may chose to alternate the use of these panels for writing instructional material. Additionally, there may be changes in lighting conditions (e.g., as seen when changing from one camera to another camera when multiple cameras are being used).
  • the camera(s) being used may be subject to frequent (and often unintended) motion, particularly when being controlled by a cameraman instead of being placed in a stationary position, or when the camera needs to zoom in to the board in order to maintain adequate resolution (at which time it is possible that only a portion of the board is being captured by the camera).
  • an instructor is obviously prone to movement, and is likely to occlude the board image at least from time to time.
  • There may also be spatial and temporal inconsistencies in the board background color, due to the wear of the board and the trail or residue due to the accumulation of partially erased chalk.
  • the methods and systems described in greater detail below for analyzing and summarizing videos may be used to overcome some or all of these difficulties in analyzing and summarizing videos, such as instructional videos.
  • the non-uniform color distribution on a board mentioned above (which may be the result of, for example, chalk residue) generally precludes the use of straight-forward models that assume board area to have a uniform background color. Therefore, in accordance with the principles of the present invention, the color distribution of board background is modeled, and this model is used to segment video content areas.
  • ⁇ r, g, b ⁇ ⁇ r + k - r',g + k - g',b + k - b' ⁇
  • ⁇ r,g,b ⁇ the color of the board
  • ⁇ r', g', b' ⁇ the color of chalk
  • k represents how much chalk has been left on that pixel (sometimes unintentionally through accumulated erasures).
  • n is modeled as a binomial distribution, and the real board color ⁇ r,g,b ⁇ and the parameters of the distribution of n are estimated by processing a single frame iteratively in the manner set forth below.
  • a block-based process is used to estimate the parameters of the unchalked board background.
  • the frame is divided into blocks.
  • the frame is divided into blocks with size 16 by 16 pixels.
  • the threshold may be obtained by first making a coarse estimation of board color by searching the dominant color of the frame using a histogram
  • the average color of these blocks is clustered.
  • the main cluster is then obtained by iteratively discarding outliers (i.e., those blocks having a higher average color value) with regard to the determined average color of all these blocks having edge pixels with color intensities less than the predefined threshold. This iterative process may be repeated a certain number of times or until a threshold amount of outliers has been removed. All blocks in the main cluster are considered to be board background blocks, where the blocks with less standard deviation of color are those less affected by chalk dust or residue.
  • the block with the least color variance in this cluster is assumed to be the one least affected by chalk pixels, and the average color ⁇ r,g,b ⁇ of that block is selected as an estimation of board background color. All pixels in board background blocks are used to estimate the average and standard deviation of k, i.e., ⁇ and ⁇ . It should also be noted that, since board color ⁇ r, g,b ⁇ depends on lighting conditions, each time a change of camera, camera direction, camera zoom, etc. (a new shot) is detected, the board color is re-estimated.
  • the board background color ⁇ r, g,b ⁇ and the value of chalk parameters ⁇ and ⁇ are estimated in successor frames of the same shot, and blocks of each frame are classified into one of three categories: board background blocks, board content blocks, and irrelevant blocks. While using a block-based process such as described in greater detail below provides potentially less accurate results when compared to using a pixel-based process (whereby each pixel is independently analyzed), such a process greatly reduces computation complexity and produces reasonable results based on empirical analyses.
  • FIG. 2 is a simplified flow chart illustrating the steps performed in determining the classification of a given block in accordance with the principles of the present invention.
  • the similarity of the block to the background block can be determined in other ways (e.g., the median could be used instead of the average, etc.).
  • a pixel color test is performed, whereby each pixel ⁇ r, g, b ⁇ in the block is tested.
  • a pixel may be considered to be a board background pixel if the following conditions are met: r - r ⁇ + a ⁇ and g - g
  • the ratio of board background pixels to all pixels in a block is then calculated, and the ratio is compared to a predefined threshold value.
  • an edge point test is used to determine whether the number of edge points in the block is less than a predefined threshold. In general, background blocks will contain almost no edge points.
  • step 208 it is determined whether the block has passed (satisfied) all three of the above described tests (corresponding to steps 202-206). If the block has passed all three of these tests, it is determined to be a board background block at step 210.
  • step 212 If the block has not passed all three of the above described tests, it is determined at step 212 whether the block has passed (satisfies) the first two tests (corresponding to steps 202 and 204). If it has passed the first two tests, the block is determined to be a board content block at step 214.
  • the block is determined to be an irrelevant block at step 216 (e.g., a block that correspond to portions of the board that is obstructed and not suitable for writing).
  • FIG. 3 shows one frame of video that has been segmented into different regions, where the region of board content has been determined, and the position of the instructor has been estimated.
  • the frame shown in FIG. 3 includes board region 304, region occluded by instructor 308, and irrelevant region 312.
  • Position 314 shown in FIG. 3 is the estimated center of mass of the instructor.
  • board region 304 shown in FIG. 3 generally includes both board content blocks (which includes teaching content, such as text 318) and board background blocks.
  • a pixel may be considered to be a chalk pixel if the color is greater that the board background color by a threshold amount.
  • the number of total chalk pixels in the board content region is determined, and this number is used as a heuristic measurement of content (given that changes in the number of chalk pixels represents changing video content).
  • a function is then obtained that relates the total number of chalk pixels to the frame number.
  • Local maxima in the function correspond to frames that contain more chalk pixels, and thus more video content. By searching these maxima and tracking the movements of camera and instructor, key frames can be extracted from the video (and combined to provide a summary of the video) in the manner set forth below.
  • camera motion is first detected prior to the detection of key frames. It is noted that a change of the value of the content function may be caused not only by the addition or removal of actual content, but also by camera motion or by camera zoom.
  • camera zoom-in may cause a significant increase in the number of chalk pixels even when the content on the board has not changed.
  • an instructor ceases to obstruct a content portion of the board, the number of chalk pixels detected may increase, even though the actual content on the board has not changed. Accordingly, comparing the values of content function on a whole video fails to provide an accurate indication of actual board content. Board content is most reliably detected within those static video segments with little or no detected camera changes.
  • a relatively fast approximation method is used to detect camera motion based on optical flow (where optical flow refers to an apparent motion in a frame based on pixel displacements).
  • optical flow refers to an apparent motion in a frame based on pixel displacements.
  • This equation relates the optic flow vector of the image spatial gradient (obtained by the Sobel operator) and the image temporal gradient (obtained by subtracting temporally adjacent frames).
  • the motion of a border area is then defined as the average motion vector of all pixels in that area. For example, for each frame I, and a pixel at (x, v), ⁇ v is solved for the value that minimizes the following expression: ⁇ (VI(x,y,t) -v + I, (x,y,t)) ⁇ 2 , where S is the sampled pixel's 3 pixel by 3 pixel neighborhood, and G(x,y) is a Gaussian smoothing function.
  • the motion of each of the four border areas is the average motion vector of all pixels in that area.
  • the local content maxima (which correspond to frames with more video content than other frames) is determined.
  • the frame with minimal instructor occlusion is chosen by measuring area of regions occluded by the instructor in each frame.
  • FIGS. 4-5 show two results from the application of an algorithm (using, e.g., computer 116 shown in FIG. 1) in accordance with one embodiment of the invention to the extraction of board summary images from a 17-minute video segment of an instructional video taken in a real, non-instrumented classroom.
  • FIG. 4 shows the fluctuation of the number of chalk or content pixels (reflecting the change of video content) as measured for the first 4,500 frames of the video segment.
  • Marks 402, 404, 406, and 408 in FIG. 4 show approximate positions of extracted summary (key) frames.
  • FIG. 5 shows the fluctuation of the area of the occlusion by the instructor region in the board area, measured in terms of number of blocks occluded by the instructor (and includes marks 502, 504, 506, and 508 respectively corresponding to marks 402, 404, 406, and 408 of FIG. 4).
  • the significance of knowing this fluctuation can be can be seen in connection with marks 406 and 506 of FIGS. 4 and 5, respectively.
  • FIG. 4 it can be seen that there are frames immediately to the left of mark 406 that include a greater number of chalk or content pixels.
  • mark 406 is selected as representative of a key frame because it corresponds to a frame having lesser occlusion by the instructor as indicated by corresponding mark 506 of FIG. 5.
  • Mosaic images are made based on the results of frame region segmentation. For each segment of video without camera movement, image registration is used to get a mosaic image, where the occlusion caused by the instructor is effectively removed. To further simplify the process, detected content blocks are compared and registered. In each segment of video content without camera motion, the mosaiced image is compared with selected key frames. If this mosaic image is not of better quality than one of the already-selected key frames which also has no instructor occlusion, the mosaic image is discarded.
  • semantic mosaicing of slides is also provided.
  • Semantic mosaicing begins with the extraction of content areas, using a method such as described above.
  • the text lines and figures in each frame are detected.
  • the text lines are detected in each frame by first binarizing the content area. It will be understood that any suitable thresholding technique as known in the art may be used to binarize the content area of each frame. Since the handwritten text lines on the board tend to have their own skew angles, standard methods for finding lines of scanned printed text by calculating a global skew angle may not be adequate. Therefore, in accordance with the invention, a progressive refinement approach may be used to search for text lines in the board content area of each frame.
  • the progressive refinement approach for searching for text lines includes placing a large bounding box over all the content area. Next, this top-level box is divided into adjacent bounding boxes if the top-level content can be separated by a horizontal line (because there is more than one line of text). Then, each of these smaller divisions is refined by searching for its minimal bounding box. The process recurs until no further separating horizontal lines are found.
  • vertical lines may be used to further extract words for recognition and indexing. Figures can be bound in a similar manner to lines of text.
  • the developing text lines may be recorded in a linked list, together with their local skew angle and the coordinates and center point of their bounding boxes.
  • figures may be recorded in a linked list together with the coordinates and center point of their bounding boxes.
  • frame-to-frame movement of text lines is small (although occlusion by the instructor sometimes exaggerates the apparent movement of their bounding boxes).
  • the present frame includes n text lines, and m text lines have already been detected and stored from all prior frames for the current instructional segment.
  • T For each present text line T, with center point C
  • C For each present text line T, with center point C
  • a search is performed for the nearest center point in the prior text lines. If the nearest center point is found to be C,* (i.e., the center point for T,*), an approximate spatial translation and rotation transform from the current text lines to the prior text lines can be easily computed from ⁇ , and ⁇ ,*, the respective skew angles of C, and C,*. This transform is then applied to all other prior text lines, and every present line center is paired to its nearest corresponding transformed prior line center.
  • the matching quality induced by this transform is measured by computing a single global error measure, including the sum of an error measure for each pairing. This, in turn, is the sum of the absolute values of the differences of each of the four corresponding bounding box coordinates in the pairing.
  • the above matching process is then repeated n times, starting with each present text line, and the one that yields the lowest global error is chosen.
  • the box-to-box matches are then confirmed by scaling one to match the other, and by matching their contexts pixel-wise.
  • the contents of several virtual slides obtained in the manner described above may be stitched together to achieve a summary slide that covers the entire visual content of the video.
  • a summary slide can be used, for example, to index and summarize the video so as to allow efficient and effective cross-video browsing.
  • temporal development can also be detected.
  • This enables the reconstruction of the content of the original video on a frame-by-frame basis (using the cleaner, rectified, unoccluded, whole virtual mosaiced slides), by displaying only those portions of the slide that have been created by that frame and synchronizing them to any audio that accompanies the video.
  • This reconstructed video maintains (and enhances) the visual content (including, for example, text lines, figures, emphasis marks, etc.) of the original video, while irrelevant areas may be detected and discarded.
  • a semantic teaching unit (e.g., of instructional videos) is defined herein as a temporal- spatial unit related with one topic of teaching content.
  • Key frames, and even mosaiced images are inadequate for this purpose, as they are units which are only based on spatial location of video content.
  • key frames and mosaiced images are simple spatial constructs that may contain more than one teaching topic, or only a portion of a teaching topic (e.g., when one teaching topic occupies several key frames).
  • semantic teaching units are based on teaching content, they are more natural for students to view, and may aid students in understanding instructional videos. Summarizing and indexing instructional videos based on these units (which are more semantically related to teaching purpose, rather than spatial units (key frames) or temporal shots (video shots)), is thus more useful in practice.
  • a semantic teaching unit is characterized as a series of frames (not necessarily contiguous) displaying a certain sequence of instructor actions within the spatial neighborhood of a developing region of written content.
  • the visual content of one semantic teaching unit tends to have content pixels around a compact area in a board, and usually lasts for a substantial fraction of the video.
  • S represent the action of instructor starting a new topic by writing on board area which is not close to areas of other teaching units
  • W represent the action of the instructor writing one or more lines of content on the same topic
  • E represent the action of the instructor emphasizing teaching content by adding symbols, underlining, or adding few words
  • T represent the action of the instructor talking, either on or off camera.
  • the grammar of a semantic teaching unit may be defined as ST((W ⁇ E)T) n . That is, a semantic teaching unit usually starts with instructor initiating a new topic by writing on the board (S), followed by discussion involving a talk or explanation (7), and is then usually followed by actions of writing (W) or emphasis (E) alternating with talking (T).
  • one semantic teaching unit may be ST ET WT WT (a realization of the grammar ST((W ⁇ E)T )").
  • the instructor's actions first need to be classified into starting (S), writing (W), emphasizing (E), or talking (T). These three categories are recognized by measuring the area occluded by the instructor, and by detecting the instructor's face. Since the region occluded by the instructor within the board area has already been segmented, the head region of the instructor can be detected and it can be determined whether the instructor is facing the board or the audience (and camera). In particular, if there is a large amount of skin tone color in the head region, the instructor is facing the audience (and camera). Otherwise, the instructor is facing the board. [0055] The actions of the instructor can be detected (classified) according to the steps of the flow chart shown in FIG. 6.
  • step 602 there is a check to see if an instructor region is found. If it is determined at step 604 that an instructor region is not found (e.g., because the instructor is out of the camera's view of the board), at step 606, the video segment is classified as talking (I).
  • step 604 If it is determined at step 604 that there is an instructor region, then it is determined at step 608 whether the instructor is facing the board. If the instructor is determined not to be facing the board at step 608, then at step 610 (and similar to step 606), the video segment is classified as talking (7).
  • step 612 it is determined whether there is an increase in content area after a specified (e.g., predetermined) period of time. If it is determined at step 612 that there is not an increase in content area after the specified period of time, at step 614, the video segment is again classified as talking (7).
  • a specified e.g., predetermined
  • step 612 Assuming it is determined at step 612 that there is an increase in content after the specified period of time, there is one of two possibilities. Namely, if at step 616 it is determined that the newly added content area is not close to any other content area, the action is classified as the starting of a new teaching unit (S) at step 618. Otherwise, if the newly added content area is determined to be close to another content area at step 616, at step 620, the action is classified as either writing (W) or emphasizing (E), depending on the relative increase in content and the relative amount of time spent facing the board (the action of emphasizing (E) is characterized by a shorted duration of action compared to writing (W)). As persons versed in the art will appreciate, recognizing emphasizing action (E) is important not only in recognizing teaching units, but also in locating emphasized topics in instructional videos.
  • the detection of instructor actions takes into account the likelihood that semantic teaching units tend to concentrate on one area of the board. For example, the boundaries of each content region can be measured, and if there are two content regions relatively far away from each other, they may be considered as separate semantic teaching units. There may also be a restriction on the duration of semantic teaching units. For example, according to various embodiments, a semantic teaching unit cannot be too long or too short (e.g., must fall within a predetermined temporal range). These and other criteria may all be combined to assist in the detection of semantic teaching units in accordance with the principles of the present invention.
  • one teaching unit may follow another teaching unit in time (due, e.g., to its nature or content), or there may not be a required or desired temporal relationship between the two.
  • teaching topics can be detected at least in part based on the temporal relation among different semantic teaching units.
  • one teaching unit may refer to another (although are not required to do so).

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Closed-Circuit Television Systems (AREA)
  • Image Analysis (AREA)

Abstract

Methods and systems are provided for analyzing and summarizing instructional videos. According to certain embodiments, video frames are segmented into board background blocks, board content blocks, and irrelevant blocks, and are subsequently merged into a board background region, board content region, occlusion by instructor region, and irrelevant region. The number of chalk pixels in the content areas of each frame are used to define key frames. Mosaiced images are developed to compensate occlusions by an instructor. Additionally, the concept of semantic teaching units is presented, where such semantic units are detected by a model based on the recognition of actions of instructors, and on the measurement of temporal duration and spatial location of board content. Various alternate embodiments are also disclosed.

Description

METHODS AND SYSTEMS FOR ANALYZING AND SUMMARIZING VIDEO
Cross-Reference to Related Application
[0001] This application claims the benefit of United States Provisional Patent Application No. 60/538,374, filed January 21, 2004, which is hereby incorporated by reference herein in its entirety.
Statement Regarding Federally Sponsored Research or Development
[0002] The invention was made with United States Government support under NSF contract EIA-00-71954, and the United States Government may have certain rights in the invention.
Field of the Invention
[0003] Generally speaking, the present invention relates to the field of instructional videos. More particularly, this invention relates to methods and systems for analyzing and summarizing instructional videos.
Background of the Invention
[0004] Two levels at which a video can be analyzed or understood include a syntactic level and an event level. Analyzing on a syntactic level (using, e.g., structural pattern analysis) refers to video analysis based on syntactic (low-level image) structures of videos, rather than content. At this level, the structures of videos are examined to determine the meaning and relationship between image objects. For example, assuming that a frame of video includes one or more sentences of text, a consideration of the text on a syntactic level includes reading and evaluating the video frame at a level of its letters, not at a level of the concepts being presented by the letters. Understanding on an event level, on the other hand, refers to an understanding based on video content, rather than how the video is captured or edited.
[0005] In the past, efforts relating to the analysis and summarization of video images has concentrated on highly structured and often commercially edited videos. The explicit and implicit rules of video construction have been a great aid in the analysis and summarizing of these videos using syntactic structures found in these videos. However, approaches of analyzing and summarizing video content based on syntactic structures have several restrictions. For example, while syntactic structures aid in understanding videos with strong narrative structure (e.g., video shots that are organized to form a story line in movies), the diversity of editors and editing styles make an accurate analysis of video content more difficult. Moreover, in completely unedited videos (e.g., some distance instructional videos or home videos having relatively low production quality), or semi-edited videos (e.g., realtime sports videos), purely syntactic approaches are likely to be ineffective due to the lack of syntactic structures within these videos.
[0006] Accordingly, it is desirable to provide new methods and systems for analyzing and summarizing instructional videos that are effective even with unedited or only semi-edited videos, such as instructional videos of board presentations that are taken from real classrooms.
Summary of the Invention
[0007] Methods and systems are provided for analyzing and summarizing instructional videos. According to certain embodiments of the invention, low-level image processing techniques are used to segment frames into board background blocks, board content blocks, and irrelevant blocks. These blocks are subsequently merged into four region types, namely, board background, board content, occlusion by instructor, and irrelevant regions. Using the number of chalk (or other form of writing) pixels in the content areas of each frame, the fluctuation of video content can be measured, where local maxima in the number of chalk pixels are used to define key frames. Mosaiced images are also developed in certain embodiments to compensate for the fact that some video content may not appear in some frames due to occlusion by the instructor. The concept of semantic teaching units is also presented, and these units are detected by a model based on the recognition of actions of instructors, and on the measurement of temporal duration and spatial location of board content.
[0008] In one embodiment, the invention provides a method for processing a video of a board presentation, wherein the video comprises a plurality of board image frames, the method comprising the steps of: modeling the color distribution of the board; dividing the board image of at least some frames into a plurality of blocks; for the at least some frames, determining which blocks are board content blocks; detecting the amount of combined content in the content blocks for the at least some frames; and selecting key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
[0009] In a second embodiment, the invention provides a system for a video of a board presentation, wherein the video comprises a plurality of board image frames, the system comprising: a projection surface that is adapted to display writing; at least one detection device that captures the video; and a processor coupled to the at least one detection device that models the color distribution of the board, divides the board image of at least some frames into a plurality of blocks, determines which blocks are board content blocks for the at least some frames, detects the total number of content chalk pixels in the content blocks for the at least some frames, and selects key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
[0010] In a third embodiment, the invention provides a system for processing a video of a board presentation, wherein the video comprises a plurality of board image frames, the system comprising: a processor that: models the color distribution of the board; divides the board image of at least some frames into a plurality of blocks; determines which blocks are board content blocks for the at least some frames; detects the total number of content chalk pixels in the content blocks for the at least some frames; and selects key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
Brief Description of the Drawings
[0011] Additional embodiments of the invention, its nature and various advantages, will be more apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
[0012] FIG. 1 is a simplified illustration of one embodiment of a board-camera system in which certain embodiments of the present invention may be implemented; [0013] FIG. 2 is a simplified flow chart illustrating the steps performed in determining the classification of a given block in accordance with the principles of certain embodiments of the present invention;
[0014] FIG. 3 is a simplified illustration showing one frame of video that has been segmented into different regions in accordance with the principles of certain embodiments of the present invention;
[0015] FIG. 4 is a graph showing the fluctuation of the number of chalk or content pixels as measured for the first 4,500 frames of a sample video segment in accordance with the principles of certain embodiments of the present invention;
[0016] FIG. 5 is a graph showing the fluctuation of the area of the occlusion by the instructor region in the board area as measured for the first 4,500 frames of a sample video segment in accordance with the principles of certain embodiments of the present invention; and
[0017] FIG. 6 is a simplified flow chart illustrating the steps performed in determining the actions of the instructor in accordance with the principles of certain embodiments of the present invention.
Detailed Description of the Invention
[0018] The following description includes many specific details. The inclusion of such details is for the purpose of illustration only and should not be understood to limit the invention. Moreover, certain features which are well known in the art are not described in detail in order to avoid complication of the subject matter of the present invention. In addition, it will be understood that features in one embodiment may be combined with features in other embodiments of the invention.
[0019] Methods and systems are provided for analyzing and summarizing, for example, instructional videos of unrehearsed and unedited blackboard (board)-presentations. While the description below focuses on videos of board presentations in order to illustrate various principles of the present invention, it will be understood that the invention is also applicable to other types of videos. For example, the methods and systems described herein may be used to analyze and summarize instructional videos of computer presentations where successive hand- written slide images are projected onto a screen or wall, or instructional videos of marker board presentations where an instructor uses a marker, rather than chalk, to write on the board. Moreover, while it is assumed herein that instructional videos are produced by lightly trained individuals who do poor camerawork and little or no editing, the invention is not limited to this class of instructional videos.
[0020] FIG. 1 is a simplified illustration of one embodiment of a board-camera system 100 in which the present invention may be implemented. As shown, system 100 includes a video camera 104 for capturing video of board 108 (including, e.g., chalk content 112). For example, camera 104 may be used to capture video of board 108 while an instructor (not shown) is presenting instructional material to an audience (e.g., students). While only a single camera 104 is shown in FIG. 1, it will be understood that the invention is not limited in this manner. It will also be understood that camera 104 may be any suitable type of video camera, and that board 108 may be any suitable type of board on which an instructor may write (e.g., with a marker, chalk, etc.). Additionally, it should be noted that, according to various embodiments of the present invention, camera 104 may be replaced by any suitable type of detection device that is capable of detecting writing on (or the projection of writing onto) board 108 or any other suitable type of surface (e.g., a wall, screen, etc.).
[0021] System 100 also includes a computer 116 and a monitor 120 connected thereto, and although not shown in FIG. 1, may further include one or more input devices (e.g., a keyboard and mouse) connected to computer 116 via connection section 124. Computer 116 may be, for example, a DELL PRECISION computer with an INTEL PENTIUM processor (1.8 GHz) and 1 Gb of SDRAM memory. Video to be analyzed and summarized according to the invention by computer 116 is provided by camera 104 (not necessarily while the video is in the process of being captured, assuming that camera 104 has some form of media on which to store video content prior to transfer to computer 116). The video from camera 104 may be captured by computer 116 using any suitable type of video card 128. It will be understood that, in accordance with the principles of the present invention, the image processing described below may be occurring in a processor that is, for example, the central processing unit (CPU) of computer 116.
[0022] As will be appreciated by persons versed in the art, there are generally many difficulties involved in the processing of unedited videos such as instructional videos taken from real classrooms using, for example, camera 104 of FIG. 1. For example, a board that is the central focus of an instructional video may have multiple (potentially sliding) panels, where an instructor may chose to alternate the use of these panels for writing instructional material. Additionally, there may be changes in lighting conditions (e.g., as seen when changing from one camera to another camera when multiple cameras are being used). Moreover, the camera(s) being used may be subject to frequent (and often unintended) motion, particularly when being controlled by a cameraman instead of being placed in a stationary position, or when the camera needs to zoom in to the board in order to maintain adequate resolution (at which time it is possible that only a portion of the board is being captured by the camera). Similarly, an instructor is obviously prone to movement, and is likely to occlude the board image at least from time to time. There may also be spatial and temporal inconsistencies in the board background color, due to the wear of the board and the trail or residue due to the accumulation of partially erased chalk. The methods and systems described in greater detail below for analyzing and summarizing videos may be used to overcome some or all of these difficulties in analyzing and summarizing videos, such as instructional videos.
[0023] The non-uniform color distribution on a board mentioned above (which may be the result of, for example, chalk residue) generally precludes the use of straight-forward models that assume board area to have a uniform background color. Therefore, in accordance with the principles of the present invention, the color distribution of board background is modeled, and this model is used to segment video content areas.
[0024] Considering the color of a board background pixel as a combined effect of the board background and the trail of erased chalk characters, the color of each pixel is modeled as
{r, g, b} = { r + k - r',g + k - g',b + k - b' }, where { r,g,b } represents the color of the board, { r', g', b' } represents the color of chalk, and k represents how much chalk has been left on that pixel (sometimes unintentionally through accumulated erasures). Given that, for chalk, r' » g' » b' when the chalk is white, the pixel color model provided above is simplified to {r, g, b) = { r + n, g + n,b + n ) . However, each pixel differs in the effects of chalk erasure. Based on experimental observations and empirical data, n is modeled as a binomial distribution, and the real board color { r,g,b } and the parameters of the distribution of n are estimated by processing a single frame iteratively in the manner set forth below. [0025] Initially, a block-based process is used to estimate the parameters of the unchalked board background. In particular, for each video frame, the frame is divided into blocks. According to various embodiments, the frame is divided into blocks with size 16 by 16 pixels. For all blocks having edge pixels (as determined using a standard edge detection method) with color intensities less than a predefined threshold (where the threshold may be obtained by first making a coarse estimation of board color by searching the dominant color of the frame using a histogram), the average color of these blocks is clustered. The main cluster is then obtained by iteratively discarding outliers (i.e., those blocks having a higher average color value) with regard to the determined average color of all these blocks having edge pixels with color intensities less than the predefined threshold. This iterative process may be repeated a certain number of times or until a threshold amount of outliers has been removed. All blocks in the main cluster are considered to be board background blocks, where the blocks with less standard deviation of color are those less affected by chalk dust or residue. The block with the least color variance in this cluster is assumed to be the one least affected by chalk pixels, and the average color { r,g,b } of that block is selected as an estimation of board background color. All pixels in board background blocks are used to estimate the average and standard deviation of k, i.e., μ and σ. It should also be noted that, since board color { r, g,b } depends on lighting conditions, each time a change of camera, camera direction, camera zoom, etc. (a new shot) is detected, the board color is re-estimated.
[0026] Based on the board model, the board background color { r, g,b } and the value of chalk parameters μ and σ are estimated in successor frames of the same shot, and blocks of each frame are classified into one of three categories: board background blocks, board content blocks, and irrelevant blocks. While using a block-based process such as described in greater detail below provides potentially less accurate results when compared to using a pixel-based process (whereby each pixel is independently analyzed), such a process greatly reduces computation complexity and produces reasonable results based on empirical analyses.
[0027] FIG. 2 is a simplified flow chart illustrating the steps performed in determining the classification of a given block in accordance with the principles of the present invention. First, at step 202, a block color test is performed to determined whether the average color of the block is similar to that of a background block. For example, assuming that the average color of the block is {R, G, B), the block is tested to see if the following conditions are satisfied: R -r < μ+ aσand lG - gl < μ+ a.σand B -b < μ+ aσ (e.g., where a=2).
Obviously, the similarity of the block to the background block can be determined in other ways (e.g., the median could be used instead of the average, etc.).
[0028] If this first test is satisfied at step 202, the bock resembles a board background block. However, this does not automatically mean that all (or even a majority) of the pixels in the block are board background pixels. Therefore, at step 204, a pixel color test is performed, whereby each pixel {r, g, b} in the block is tested. For example, a pixel may be considered to be a board background pixel if the following conditions are met: r - r < + aσand g - g
< μ+ aσand b-b < μ+ aσ (e.g., where a=2). The ratio of board background pixels to all pixels in a block is then calculated, and the ratio is compared to a predefined threshold value.
[0029] At step 206, an edge point test is used to determine whether the number of edge points in the block is less than a predefined threshold. In general, background blocks will contain almost no edge points.
[0030] At step 208, it is determined whether the block has passed (satisfied) all three of the above described tests (corresponding to steps 202-206). If the block has passed all three of these tests, it is determined to be a board background block at step 210.
[0031] If the block has not passed all three of the above described tests, it is determined at step 212 whether the block has passed (satisfies) the first two tests (corresponding to steps 202 and 204). If it has passed the first two tests, the block is determined to be a board content block at step 214.
[0032] If it is determined at step 212 that the block has not passed the first two steps, the block is determined to be an irrelevant block at step 216 (e.g., a block that correspond to portions of the board that is obstructed and not suitable for writing).
[0033] Once all blocks have been classified in the manner described above (using the steps described in connection with the flow chart of FIG. 2), they are merged into four area or region types: board background region, board content region, occlusion by instructor(s) region, and irrelevant region. Since, as explained above, there may be more than one panel in a video frame, a color similarity least squares method (e.g., as described in "Analysis and Enhancement of Videos of Electronic Slide Presentations," by Tiecheng Liu et al., International Conference on Multimedia and Expo, 2002, which is hereby incorporated by reference herein in its entirety) is first used to detect the boundaries of panels. The main panel is characterized by having the largest amount of board area. Within the main panel, all board background blocks are merged to form the board background region, all board content blocks are merged to form the board content region, all irrelevant blocks are merged to form the occlusion by instructor region, and all other blocks are merged to form the irrelevant region. The area of the occlusion by instructor region is recorded, and the position of the instructor is also estimated by the center of mass (or centroid) of the occlusion by instructor region. FIG. 3 shows one frame of video that has been segmented into different regions, where the region of board content has been determined, and the position of the instructor has been estimated. In particular, the frame shown in FIG. 3 includes board region 304, region occluded by instructor 308, and irrelevant region 312. Position 314 shown in FIG. 3 is the estimated center of mass of the instructor. Moreover, board region 304 shown in FIG. 3 generally includes both board content blocks (which includes teaching content, such as text 318) and board background blocks.
[0034] After segmenting a frame into four classes of regions as described above, content chalk pixels are detected from the board content area. For each pixel {r, g, b} of the board content area, it is determined whether the pixel is a chalk pixel. In particular, a pixel may be considered to be a chalk pixel if the color is greater that the board background color by a threshold amount. For example, a pixel may be a chalk pixel if the following conditions are met: r -r > μ + acrand g - g > μ + aσ and b -b > μ+ aσ (e.g., where a=2). The number of total chalk pixels in the board content region is determined, and this number is used as a heuristic measurement of content (given that changes in the number of chalk pixels represents changing video content). A function is then obtained that relates the total number of chalk pixels to the frame number. Local maxima in the function correspond to frames that contain more chalk pixels, and thus more video content. By searching these maxima and tracking the movements of camera and instructor, key frames can be extracted from the video (and combined to provide a summary of the video) in the manner set forth below. [0035] Prior to the detection of key frames, camera motion is first detected. It is noted that a change of the value of the content function may be caused not only by the addition or removal of actual content, but also by camera motion or by camera zoom. For example, camera zoom-in may cause a significant increase in the number of chalk pixels even when the content on the board has not changed. Similarly, if an instructor ceases to obstruct a content portion of the board, the number of chalk pixels detected may increase, even though the actual content on the board has not changed. Accordingly, comparing the values of content function on a whole video fails to provide an accurate indication of actual board content. Board content is most reliably detected within those static video segments with little or no detected camera changes.
[0036] In connection with instructional videos, it is assumed that most camera motions are pans, some are zooms, and very few are tilts. Thus, according to the invention, a relatively fast approximation method is used to detect camera motion based on optical flow (where optical flow refers to an apparent motion in a frame based on pixel displacements). Using four border regions along the four edges of a frame, for each pixel in a border area, a least squares solution (e.g., as described in "The Computation of Optical Flow," by Beauchemin and Barron, ACM Computing Surveys, pages 433-467, 1995, which is hereby incorporated by reference herein in its entirety) to the local smoothness constraint optical flow equation is determined. This equation relates the optic flow vector of the image spatial gradient (obtained by the Sobel operator) and the image temporal gradient (obtained by subtracting temporally adjacent frames). The motion of a border area is then defined as the average motion vector of all pixels in that area. For example, for each frame I, and a pixel at (x, v), → v is solved for the value that minimizes the following expression: (VI(x,y,t) -v + I, (x,y,t))}2 , where S is the sampled pixel's 3 pixel by 3 pixel
Figure imgf000012_0001
neighborhood, and G(x,y) is a Gaussian smoothing function. The motion of each of the four border areas is the average motion vector of all pixels in that area.
[0037] Because most motions in instruction videos of board presentations are motions of instructors who rarely occupy all the area of a frame, when at least one border area is determined to have almost no motion, the frame is determined to be static. That is, only when all of the four border areas show large motion vectors is it determined that the camera is in motion. This method is sufficiently accurate for most (if not all) intended purposes, and reduces the computation time taken for camera motion detection. Detection of movements of instructors (where the center of the region occluded by the instructor is chosen as the approximate position of the instructor), on the other hand, is based on the results of the board segmentation.
[0038] In order to detect key frames, for each segment of the content function that is without camera movement (as determined in the manner described above), the local content maxima (which correspond to frames with more video content than other frames) is determined. In the case that several frames have substantially the same (or at least similar) content value, according to at least one embodiment, the frame with minimal instructor occlusion is chosen by measuring area of regions occluded by the instructor in each frame.
[0039] As an illustration, FIGS. 4-5 show two results from the application of an algorithm (using, e.g., computer 116 shown in FIG. 1) in accordance with one embodiment of the invention to the extraction of board summary images from a 17-minute video segment of an instructional video taken in a real, non-instrumented classroom. In particular, FIG. 4 shows the fluctuation of the number of chalk or content pixels (reflecting the change of video content) as measured for the first 4,500 frames of the video segment. Marks 402, 404, 406, and 408 in FIG. 4 show approximate positions of extracted summary (key) frames. FIG. 5, on the other hand, shows the fluctuation of the area of the occlusion by the instructor region in the board area, measured in terms of number of blocks occluded by the instructor (and includes marks 502, 504, 506, and 508 respectively corresponding to marks 402, 404, 406, and 408 of FIG. 4). The significance of knowing this fluctuation can be can be seen in connection with marks 406 and 506 of FIGS. 4 and 5, respectively. In particular, referring to FIG. 4, it can be seen that there are frames immediately to the left of mark 406 that include a greater number of chalk or content pixels. However, mark 406 is selected as representative of a key frame because it corresponds to a frame having lesser occlusion by the instructor as indicated by corresponding mark 506 of FIG. 5.
[0040] In instructional videos of board presentation, even if the camera does not move, there may be some video content not shown in any single frame due to often inevitable occlusion by instructors. Therefore, in accordance with the principles of the present invention, for each static video segment, board content is pieced together as a semantic mosaic into board content images. It should be noted that, even if selected key frames contain all video content, a mosaic summary image obtained in accordance with the invention may nonetheless provide more clean semantic summary images or may be used for compression purposes.
[0041] Mosaic images are made based on the results of frame region segmentation. For each segment of video without camera movement, image registration is used to get a mosaic image, where the occlusion caused by the instructor is effectively removed. To further simplify the process, detected content blocks are compared and registered. In each segment of video content without camera motion, the mosaiced image is compared with selected key frames. If this mosaic image is not of better quality than one of the already-selected key frames which also has no instructor occlusion, the mosaic image is discarded.
[0042] In accordance with the principles of the present invention, semantic mosaicing of slides is also provided. Semantic mosaicing begins with the extraction of content areas, using a method such as described above. Next, the text lines and figures in each frame are detected. The text lines are detected in each frame by first binarizing the content area. It will be understood that any suitable thresholding technique as known in the art may be used to binarize the content area of each frame. Since the handwritten text lines on the board tend to have their own skew angles, standard methods for finding lines of scanned printed text by calculating a global skew angle may not be adequate. Therefore, in accordance with the invention, a progressive refinement approach may be used to search for text lines in the board content area of each frame.
[0043] For example, in a given frame, the progressive refinement approach for searching for text lines includes placing a large bounding box over all the content area. Next, this top-level box is divided into adjacent bounding boxes if the top-level content can be separated by a horizontal line (because there is more than one line of text). Then, each of these smaller divisions is refined by searching for its minimal bounding box. The process recurs until no further separating horizontal lines are found. Similarly, although not described in detail herein, it will be understood that vertical lines may be used to further extract words for recognition and indexing. Figures can be bound in a similar manner to lines of text.
[0044] Once detection is complete, as will now be explained, the text lines and figures are tracked and matched using linked lists that record the bounding box parameters, and the bounding boxes are mosaiced together into virtual slides. First, it should be noted that, during the process of mosaicing, the non-monotonic nature of instructional video text (i.e., an actual increase in video text does not always correspond to an increase in detected video text) must be accounted for. This is because, in general, content increases as presentations progress, but it may also appear to decrease due to occlusions by the instructor. However, it is possible to determine the number of chalk pixels in each frame, and to use this number as a heuristic approximation to detect the beginning of a new instructional segment (e.g., when the instructor erases the board before starting a discussion on a new topic) when the number of detected chalk pixels falls, and stays, near zero.
[0045] For each new instructional segment detected, the developing text lines may be recorded in a linked list, together with their local skew angle and the coordinates and center point of their bounding boxes. Similarly, figures may be recorded in a linked list together with the coordinates and center point of their bounding boxes. Usually, the frame-to-frame movement of text lines is small (although occlusion by the instructor sometimes exaggerates the apparent movement of their bounding boxes). Thus, it is possible to use the heuristic approach set forth below, which approximates camera and moving object parameters by simple translation and rotations.
[0046] Assume that the present frame includes n text lines, and m text lines have already been detected and stored from all prior frames for the current instructional segment. For each present text line T, with center point C„ a search is performed for the nearest center point in the prior text lines. If the nearest center point is found to be C,* (i.e., the center point for T,*), an approximate spatial translation and rotation transform from the current text lines to the prior text lines can be easily computed from α, and α,*, the respective skew angles of C, and C,*. This transform is then applied to all other prior text lines, and every present line center is paired to its nearest corresponding transformed prior line center. The matching quality induced by this transform is measured by computing a single global error measure, including the sum of an error measure for each pairing. This, in turn, is the sum of the absolute values of the differences of each of the four corresponding bounding box coordinates in the pairing. The above matching process is then repeated n times, starting with each present text line, and the one that yields the lowest global error is chosen. The box-to-box matches are then confirmed by scaling one to match the other, and by matching their contexts pixel-wise.
[0047] It will be understood that the above-described tracking method has a complexity of 0(mn). However, since m and n are relatively small, and the computations involved are on box coordinates, the method is sufficiently efficient. Also, as a side benefit, this tracking method readily detects those text lines which are being modified by the instructor.
[0048] According to various embodiments of the invention, the contents of several virtual slides obtained in the manner described above may be stitched together to achieve a summary slide that covers the entire visual content of the video. Such a summary slide can be used, for example, to index and summarize the video so as to allow efficient and effective cross-video browsing.
[0049] In addition, while extracting content text lines in the manner described above, temporal development can also be detected. This enables the reconstruction of the content of the original video on a frame-by-frame basis (using the cleaner, rectified, unoccluded, whole virtual mosaiced slides), by displaying only those portions of the slide that have been created by that frame and synchronizing them to any audio that accompanies the video. This reconstructed video maintains (and enhances) the visual content (including, for example, text lines, figures, emphasis marks, etc.) of the original video, while irrelevant areas may be detected and discarded.
[0050] In accordance with the invention, in addition to key frame selection and content area segmentation, high level semantic content analysis of board frames is provided in accordance with the principles of the present invention. In order to accomplish this, the concept of spatial-temporal "semantic teaching units" (or semantic temporal-spatial units of video content) is proposed, which is based in part on the recognition of significant actions of the instructor and the spatial and temporal coherence of the board content. As will now be explained in greater detail, these units are retrieved by recognizing actions of instructors, and by detecting spatial local and temporal duration of video content.
[0051] A semantic teaching unit (e.g., of instructional videos) is defined herein as a temporal- spatial unit related with one topic of teaching content. Key frames, and even mosaiced images, are inadequate for this purpose, as they are units which are only based on spatial location of video content. In other words, key frames and mosaiced images are simple spatial constructs that may contain more than one teaching topic, or only a portion of a teaching topic (e.g., when one teaching topic occupies several key frames). Because semantic teaching units are based on teaching content, they are more natural for students to view, and may aid students in understanding instructional videos. Summarizing and indexing instructional videos based on these units (which are more semantically related to teaching purpose, rather than spatial units (key frames) or temporal shots (video shots)), is thus more useful in practice.
[0052] According to the invention, a semantic teaching unit is characterized as a series of frames (not necessarily contiguous) displaying a certain sequence of instructor actions within the spatial neighborhood of a developing region of written content. The visual content of one semantic teaching unit tends to have content pixels around a compact area in a board, and usually lasts for a substantial fraction of the video. Let S represent the action of instructor starting a new topic by writing on board area which is not close to areas of other teaching units, W represent the action of the instructor writing one or more lines of content on the same topic, E represent the action of the instructor emphasizing teaching content by adding symbols, underlining, or adding few words, and T represent the action of the instructor talking, either on or off camera.
[0053] Given the above representations, the grammar of a semantic teaching unit may be defined as ST((W\E)T)n. That is, a semantic teaching unit usually starts with instructor initiating a new topic by writing on the board (S), followed by discussion involving a talk or explanation (7), and is then usually followed by actions of writing (W) or emphasis (E) alternating with talking (T). As an example, one semantic teaching unit may be ST ET WT WT (a realization of the grammar ST((W\E)T )"). By recognizing the actions of instructors, and by detecting the spatial location and temporal duration of the written content, it is possible to recognize these larger semantic units from instructional videos that are more related to instructional topics.
[0054] In order to recognize semantic teaching units, the instructor's actions first need to be classified into starting (S), writing (W), emphasizing (E), or talking (T). These three categories are recognized by measuring the area occluded by the instructor, and by detecting the instructor's face. Since the region occluded by the instructor within the board area has already been segmented, the head region of the instructor can be detected and it can be determined whether the instructor is facing the board or the audience (and camera). In particular, if there is a large amount of skin tone color in the head region, the instructor is facing the audience (and camera). Otherwise, the instructor is facing the board. [0055] The actions of the instructor can be detected (classified) according to the steps of the flow chart shown in FIG. 6. First, at step 602, there is a check to see if an instructor region is found. If it is determined at step 604 that an instructor region is not found (e.g., because the instructor is out of the camera's view of the board), at step 606, the video segment is classified as talking (I).
[0056] If it is determined at step 604 that there is an instructor region, then it is determined at step 608 whether the instructor is facing the board. If the instructor is determined not to be facing the board at step 608, then at step 610 (and similar to step 606), the video segment is classified as talking (7).
[0057] If it is determined at step 608 that the instructor is facing the board, then at step 612, it is determined whether there is an increase in content area after a specified (e.g., predetermined) period of time. If it is determined at step 612 that there is not an increase in content area after the specified period of time, at step 614, the video segment is again classified as talking (7).
[0058] Assuming it is determined at step 612 that there is an increase in content after the specified period of time, there is one of two possibilities. Namely, if at step 616 it is determined that the newly added content area is not close to any other content area, the action is classified as the starting of a new teaching unit (S) at step 618. Otherwise, if the newly added content area is determined to be close to another content area at step 616, at step 620, the action is classified as either writing (W) or emphasizing (E), depending on the relative increase in content and the relative amount of time spent facing the board (the action of emphasizing (E) is characterized by a shorted duration of action compared to writing (W)). As persons versed in the art will appreciate, recognizing emphasizing action (E) is important not only in recognizing teaching units, but also in locating emphasized topics in instructional videos.
[0059] According to various embodiments of the invention, the detection of instructor actions takes into account the likelihood that semantic teaching units tend to concentrate on one area of the board. For example, the boundaries of each content region can be measured, and if there are two content regions relatively far away from each other, they may be considered as separate semantic teaching units. There may also be a restriction on the duration of semantic teaching units. For example, according to various embodiments, a semantic teaching unit cannot be too long or too short (e.g., must fall within a predetermined temporal range). These and other criteria may all be combined to assist in the detection of semantic teaching units in accordance with the principles of the present invention. Moreover, it should be noted that one teaching unit may follow another teaching unit in time (due, e.g., to its nature or content), or there may not be a required or desired temporal relationship between the two. In the former case, teaching topics can be detected at least in part based on the temporal relation among different semantic teaching units. Additionally, one teaching unit may refer to another (although are not required to do so).
[0060] Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention. For example, although it has been largely assumed above that a single instructor will be occluding the video being processed in accordance with the invention, the invention is not limited in this manner. Rather, according to various embodiments of the present invention, the present invention may be configured to account for multiple instructors. Similarly, while the above description generally refers to the use of a single camera, as mentioned above, it will be understood that multiple camera may be used, and that the invention may also be configured to account for the use of such multiple cameras. Moreover, it should be reiterated that, while boards on which instructors write with chalk have been referenced above, the invention is not limited to video presentations involving only such boards and writing instruments.
[0061] Therefore, other embodiments, extensions, and modifications of the ideas presented above are comprehended and should be within the reach of one versed in the art upon reviewing the present disclosure. Accordingly, the scope of the present invention in its various aspects should not be limited by the examples presented above. The individual aspects of the present invention, and the entirety of the invention should be regarded so as to allow for such design modifications and future developments within the scope of the present disclosure. The present invention is limited only by the claims which follow.

Claims

Docket No. 19240.157-WO1 Express Mail Label No. EV 510684643 USWhat is claimed is:
1. A method for processing a video of a board presentation, wherein the video comprises a plurality of board image frames, the method comprising the steps of: modeling the color distribution of the board; dividing the board image of at least some frames into a plurality of blocks; for the at least some frames, determining which blocks are board content blocks; detecting the amount of combined content in the content blocks for the at least some frames; and selecting key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
2. The method of claim 1, wherein the detecting the amount of combined content in the content blocks for the at least some frames comprises detecting the total number of content chalk pixels in the content blocks for the at least some frames.
3. The method of claim 1, further comprising the step of summarizing the video using the selected key frames of the video.
4. The method of claim 1, further comprising the step of reconstructing at least some of the frames using one or more of the selected key frames.
5. The method of claim 1, further comprising the steps of: for at least some frames, detecting text lines and/or figures; establishing bounding boxes for at least some of the detected text lines and/or figures; matching text lines using at least one linked list; and combining at least some of the detected text lines to form at least one virtual slide.
6. The method of claim 5, wherein the detecting text lines and/or figures comprises the step of binarizing at least some of the board content blocks.
7. The method of claim 1, further comprising the step of determining the board background color of the board.
8. The method of claim 7, further comprising the step of determining which blocks are board background blocks for at least some frames. Docket No. 19240.157- WO 1 Express Mail Label No. EV 510684643 US
9. The method of claim 8, wherein the determining which blocks are board background blocks comprises the steps of: comparing the average color of a block to the average color of a board background block; comparing at least some pixels of the block to the average pixel of a board background block; and determining if the number of edge points in the block is less than a specified threshold.
10. The method of claim 1, further comprising the step of determining which blocks are irrelevant blocks for at least some frames.
11. The method of claim 1 , further comprising the step of detecting motion of a camera that captured the video.
12. The method of claim 1, wherein an instructor is involved with the board presentation, the method further comprising the step of determining the region of the board that is occluded by the instructor for at least some frames.
13. The method of claim 1, wherein an instructor is involved with the board presentation, the method further comprising the step of classifying at least some actions by the instructor as one of a finite number of different types of actions.
14. The method of claim 13, wherein each of the at least some actions are classified as either starting a new teaching unit, writing, emphasizing, or talking.
15. The method of claim 13, wherein the classifying comprises the step of, for each of the at least some actions, determining whether there is a region of the board that is occluded by the instructor.
16. The method of claim 13, wherein the classifying comprises the step of, for each of the at least some actions, determining whether the instructor is facing the board.
17. The method of claim 16, wherein, if the instructor is facing the board, the method further comprises the step of determining whether there is an increase in content.
18. The method of claim 17, wherein, if there is an increase in content, the method further comprises the step of determining whether the increase in content is within a specified distance from other content located on the board. Docket No. 19240.157-WO1 Express Mail Label No. EV 510684643 US
19. A system for a video of a board presentation, wherein the video comprises a plurality of board image frames, the system comprising: a projection surface that is adapted to display writing; at least one detection device that captures the video; and a processor coupled to the at least one detection device that models the color distribution of the board, divides the board image of at least some frames into a plurality of blocks, determines which blocks are board content blocks for the at least some frames, detects the total number of content chalk pixels in the content blocks for the at least some frames, and selects key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
20. The system of claim 19, wherein the projection surface displays writing that is either written on the board or projected onto the board.
21. The system of claim 19, wherein the projection surface is one of a blackboard adapted to receive writing from chalk, a marker board adapted to receive writing from a marker, and a material that is adapted to receive writing in the form of a projection.
22. A system for processing a video of a board presentation, wherein the video comprises a plurality of board image frames, the system comprising: a processor that: models the color distribution of the board; divides the board image of at least some frames into a plurality of blocks; determines which blocks are board content blocks for the at least some frames; detects the total number of content chalk pixels in the content blocks for the at least some frames; and selects key frames from the video based at least in part on the detection of the total number of content chalk pixels in the content blocks for the at least some frames.
PCT/US2005/001859 2004-01-21 2005-01-21 Methods and systems for analyzing and summarizing video WO2005072239A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US53837404P 2004-01-21 2004-01-21
US60/538,374 2004-01-21

Publications (2)

Publication Number Publication Date
WO2005072239A2 true WO2005072239A2 (en) 2005-08-11
WO2005072239A3 WO2005072239A3 (en) 2006-05-11

Family

ID=34825974

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/001859 WO2005072239A2 (en) 2004-01-21 2005-01-21 Methods and systems for analyzing and summarizing video

Country Status (1)

Country Link
WO (1) WO2005072239A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8000533B2 (en) 2006-11-14 2011-08-16 Microsoft Corporation Space-time video montage
US8804005B2 (en) 2008-04-29 2014-08-12 Microsoft Corporation Video concept detection using multi-layer multi-instance learning
US9830736B2 (en) 2013-02-18 2017-11-28 Tata Consultancy Services Limited Segmenting objects in multimedia data
US10147230B2 (en) 2016-10-03 2018-12-04 International Business Machines Corporation Dynamic video visualization
US10528251B2 (en) 2016-12-13 2020-01-07 International Business Machines Corporation Alternate video summarization
CN111144255A (en) * 2019-12-18 2020-05-12 华中科技大学鄂州工业技术研究院 Method and device for analyzing non-language behaviors of teacher
CN113190710A (en) * 2021-04-27 2021-07-30 南昌虚拟现实研究院股份有限公司 Semantic video image generation method, semantic video image playing method and related device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4821335A (en) * 1985-04-05 1989-04-11 Ricoh Company, Limited Electronic blackboard
US5930473A (en) * 1993-06-24 1999-07-27 Teng; Peter Video application server for mediating live video services
US6185560B1 (en) * 1998-04-15 2001-02-06 Sungard Eprocess Intelligance Inc. System for automatically organizing data in accordance with pattern hierarchies therein
US20010049087A1 (en) * 2000-01-03 2001-12-06 Hale Janet B. System and method of distance education
US20020178223A1 (en) * 2001-05-23 2002-11-28 Arthur A. Bushkin System and method for disseminating knowledge over a global computer network
US6564263B1 (en) * 1998-12-04 2003-05-13 International Business Machines Corporation Multimedia content description framework
US20030234772A1 (en) * 2002-06-19 2003-12-25 Zhengyou Zhang System and method for whiteboard and audio capture
US20040002049A1 (en) * 2002-07-01 2004-01-01 Jay Beavers Computer network-based, interactive, multimedia learning system and process

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4821335A (en) * 1985-04-05 1989-04-11 Ricoh Company, Limited Electronic blackboard
US5930473A (en) * 1993-06-24 1999-07-27 Teng; Peter Video application server for mediating live video services
US6185560B1 (en) * 1998-04-15 2001-02-06 Sungard Eprocess Intelligance Inc. System for automatically organizing data in accordance with pattern hierarchies therein
US6564263B1 (en) * 1998-12-04 2003-05-13 International Business Machines Corporation Multimedia content description framework
US20010049087A1 (en) * 2000-01-03 2001-12-06 Hale Janet B. System and method of distance education
US20020178223A1 (en) * 2001-05-23 2002-11-28 Arthur A. Bushkin System and method for disseminating knowledge over a global computer network
US20030234772A1 (en) * 2002-06-19 2003-12-25 Zhengyou Zhang System and method for whiteboard and audio capture
US20040002049A1 (en) * 2002-07-01 2004-01-01 Jay Beavers Computer network-based, interactive, multimedia learning system and process

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8000533B2 (en) 2006-11-14 2011-08-16 Microsoft Corporation Space-time video montage
US8804005B2 (en) 2008-04-29 2014-08-12 Microsoft Corporation Video concept detection using multi-layer multi-instance learning
US9830736B2 (en) 2013-02-18 2017-11-28 Tata Consultancy Services Limited Segmenting objects in multimedia data
US10147230B2 (en) 2016-10-03 2018-12-04 International Business Machines Corporation Dynamic video visualization
US10528251B2 (en) 2016-12-13 2020-01-07 International Business Machines Corporation Alternate video summarization
US10901612B2 (en) 2016-12-13 2021-01-26 International Business Machines Corporation Alternate video summarization
CN111144255A (en) * 2019-12-18 2020-05-12 华中科技大学鄂州工业技术研究院 Method and device for analyzing non-language behaviors of teacher
CN111144255B (en) * 2019-12-18 2024-04-19 华中科技大学鄂州工业技术研究院 Analysis method and device for non-language behaviors of teacher
CN113190710A (en) * 2021-04-27 2021-07-30 南昌虚拟现实研究院股份有限公司 Semantic video image generation method, semantic video image playing method and related device
CN113190710B (en) * 2021-04-27 2023-05-02 南昌虚拟现实研究院股份有限公司 Semantic video image generation method, semantic video image playing method and related devices

Also Published As

Publication number Publication date
WO2005072239A3 (en) 2006-05-11

Similar Documents

Publication Publication Date Title
Ju et al. Summarization of videotaped presentations: automatic analysis of motion and gesture
Yin et al. Text detection, tracking and recognition in video: a comprehensive survey
Wu et al. A new technique for multi-oriented scene text line detection and tracking in video
Liu Beyond pixels: exploring new representations and applications for motion analysis
Black et al. Recognizing facial expressions in image sequences using local parameterized models of image motion
Cotsaces et al. Video shot detection and condensed representation. a review
Yang et al. Lecture video indexing and analysis using video ocr technology
Choudary et al. Summarization of visual content in instructional videos
Davila et al. Whiteboard video summarization via spatio-temporal conflict minimization
EP2034426A1 (en) Moving image analyzing, method and system
Ju et al. Analysis of gesture and action in technical talks for video indexing
Chatila et al. Integrated planning and execution control of autonomous robot actions
WO2005072239A2 (en) Methods and systems for analyzing and summarizing video
Li et al. Structuring lecture videos by automatic projection screen localization and analysis
JP2011505601A (en) Video processing method and video processing apparatus
Lee et al. Robust handwriting extraction and lecture video summarization
Wang et al. Structuring low-quality videotaped lectures for cross-reference browsing by video text analysis
Fan et al. Robust spatiotemporal matching of electronic slides to presentation videos
Kota et al. Automated detection of handwritten whiteboard content in lecture videos for summarization
Xu et al. Content extraction from lecture video via speaker action classification based on pose information
Ngo et al. Structuring lecture videos for distance learning applications
Ngo et al. Detection of slide transition for topic indexing
WO1999005865A1 (en) Content-based video access
Ghorpade et al. Extracting text from video
Wang et al. Gesture tracking and recognition for lecture video editing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase