WO2006103625A1 - Method and apparatus for the detection of text in video data - Google Patents

Method and apparatus for the detection of text in video data Download PDF

Info

Publication number
WO2006103625A1
WO2006103625A1 PCT/IB2006/050936 IB2006050936W WO2006103625A1 WO 2006103625 A1 WO2006103625 A1 WO 2006103625A1 IB 2006050936 W IB2006050936 W IB 2006050936W WO 2006103625 A1 WO2006103625 A1 WO 2006103625A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
motion
detection module
video
mod
Prior art date
Application number
PCT/IB2006/050936
Other languages
French (fr)
Inventor
Jan Nesvadba
Igor Nagorski
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2006103625A1 publication Critical patent/WO2006103625A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/786Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using motion, e.g. object motion or camera motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • An aspect of the invention relates to a video apparatus arranged to detect text in video data.
  • the video apparatus may be, for example, a digital video recorder that records video data on an optical disk (DVD) or a magnetic disk (HD), or both.
  • Other aspects of the invention relate to a method of detecting text in video data, and a computer program for a video apparatus.
  • the international patent application published under number WO 02/093910 describes detection of the presence, appearance or disappearance of subtitles in a video signal. This detection comprises operations that an MPEG encoder or decoder typically carries out. The detection therefore requires relatively few additional operations.
  • a detected subtitle may be subjected to a character recognition algorithm, which provides an electronic version of the text. The electronic text may be separately stored and subsequently used for indexing video scenes stored in a database.
  • a typical application thereof is retrieval of video scenes in a video recorder based on spoken keywords.
  • a video apparatus comprises a text detection module, a text-motion detection module, and a user interface.
  • the text detection module detects visual text in video data.
  • the text-motion detection module provides a motion indication for the visual text that the text detection module has detected.
  • the user interface allows a user to access a particular portion of the video data on the basis of the motion indication that the text-motion detection module has provided.
  • Video data is generally rich in information. In principle, this makes it relatively difficult for an average person to handle video data and, more specifically, to access a particular information that the video data comprises.
  • video data may comprise visual text of various different types.
  • Visual text may be, for example, a subtitle, a title role that provides information about a program, which the video data comprises, or news in a telegraphic form. Visual text may also mark a particular event, such as, for example, the start of a program, the end of a program, or a new chapter.
  • sequences of characters from visual text comprised in video data by means of a character recognition algorithm.
  • the sequences of characters which have been derived from the video data, can be stored in a database and handled separately. For example, a user may browse through the database to find a subtitle that comprises a particular word.
  • the aforementioned prior art, which relates to subtitles appears to suggest this possibility.
  • the database need not be restricted to subtitles only. That is, the database may comprise a collection of sequences of characters that relate to various different types of visual text. This complicates browsing. It will be relatively difficult to find a particular piece of information in the database.
  • a video apparatus comprises a text-motion detection module, which provides a motion indication for visual text that has been detected.
  • a user interface allows a user to access a particular portion of the video data on the basis of the motion indication that the text-motion detection module has provided.
  • the motion indication may indicate a particular event in the video data.
  • the motion indication may indicate the start or the end of a program.
  • the start and the end of a program generally comprise a title role in the form of vertically scrolling visual text, which the motion indication indicates.
  • the motion indication may also indicate a type of visual text in the video data, such as the aforementioned title role.
  • the title role may comprise useful information for cataloguing purposes, such as, for example, the title of the program and actors' names.
  • Another example concerns visual text that constitutes news data. Such news data is often presented in the form of so-called tickers.
  • a ticker is a horizontal bar on a screen that comprises visual text that moves in a horizontal direction, from the left to the right or vice versa.
  • the motion indication may indicate such a ticker and therefore indicate news data in the video data.
  • FIG. 1 is a block diagram that illustrates a digital video recorder, which comprises a text-motion detection module.
  • FIG. 2 is a flowchart diagram that illustrates a displacement-vector calculation, which the text-motion detection module carries out.
  • FIG. 3 is a conceptual diagram that illustrates an overlap calculation, which is based on a projection of a macroblock from a predicted frame to reference frame.
  • FIG. 1 illustrates a digital video recorder DVR.
  • the digital video recorder DVR comprises an encoding-and-decoding module CODEC and a disk driver DRW for writing data on a disk and for reading data from a disk DSK.
  • the disk DSK may be, for example, an optical disk such as a digital versatile disk (DVD) or a hard disk (HD), on which data is magnetically stored.
  • the digital video recorder DVR further comprises a text detection module TXD, a character recognition module OCR, a text-motion detection module MOD, a data-association module DAS, a semantic database SDB, and a control module CTRL.
  • the control module CTRL comprises a user interface UIF and may receive commands from a remote control device RCD. Any of the aforementioned modules may be implemented by means of software or hardware, or a combination of software and hardware. A suitably programmed processor may carry out operations that will be described hereinafter with reference to the aforementioned modules.
  • the digital video recorder DVR operates as follows.
  • the encoding-and-decoding module CODEC receives video input data VI from an external entity.
  • the encoding-and-decoding module CODEC encodes the video input data VI in accordance with a video encoding standard, such as, for example, MPEG2 or MPEG4 (MPEG is a commonly used acronym for Moving Pictures Experts Group).
  • the disk driver DRW receives encoded video input data VIC from the encoding-and-decoding module CODEC and writes that data onto the disk DSK, which is present in the disk driver DRW.
  • the disk driver DRW may receive the encoded video input data VIC directly from an external entity via a codec bypass, which is illustrated in broken lines.
  • the encoding-and- decoding module CODEC may decode the encoded video input data VIC, which the external entity provides. Accordingly, the digital video recorder DVR can retrieve coding parameters and other data from the encoded video input data VIC. This data is useful for text detection, which will be described in greater detail hereinafter.
  • the encoding-and-decoding module CODEC receives encoded video output data VOC from the disk driver DRW.
  • the encoding-and-decoding module CODEC decodes the encoded video output data VOC so as to obtain video output data VO, which can be applied to an external entity, such as, for example, a display device.
  • the disk driver DRW may apply the encoded video output data VOC, which is read from the disk, directly to an external entity that comprises a video decoder. In that case, the coded video output data VOC transits via the codec bypass.
  • the encoding-and-decoding module CODEC may still decode the encoded video output data VOC, which the disk driver DRW provides. Accordingly, the digital video recorder DVR can retrieve coding parameters and other data from the encoded video output data VOC. This data is useful for text detection, which will be described in greater detail hereinafter.
  • the encoding-and-decoding module CODEC provides various coding parameters PAR and other data, which result from an encoding or a decoding in accordance with the relevant MPEG standard.
  • the coding parameters PAR comprise: a "b"-parameter, which indicates the number of bits used for encoding an image slice excluding overhead bits, a "qs"-parameter, which indicates the quantizer scale for a slice, - a "c"-parameter, which indicates the transform coefficients (DC and
  • the encoding-and-decoding module CODEC further provides motion vectors MV, a predicted frame PF, a reference frame RF, and a frame index IX.
  • the predicted frame PF may be a so-called P-frame or a B-frame as defined by the relevant MPEG standard.
  • the reference frame RF is a so-called I-frame.
  • the frame index IX indicates the position of the frame within a sequence frames that constitutes a video recording, which corresponds with, for example, a movie.
  • the frame index IX can be associated with a particular instant within an interval of time that corresponds with the video recording. For example, the frame index IX may correspond with the instant "5 minutes and 27 seconds" from the start of the video recording or any other reference in the recording.
  • the text detection module TXD detects, in a reference frame (I-frame), one or more segments that comprise visual text.
  • the text detection module TXD provides text detection data TD that indicates these segments.
  • the text detection data TD may identify, for example, those macroblocks in the reference frame, which comprise visual text.
  • the text detection module TXD may operate in a manner similar to, for example, the subtitle detector in the international application published under number WO 02/093910. Such an implementation of the text detection module TXD detects visual-text segments on the basis of the coding parameters PAR mentioned hereinbefore.
  • the visual text which the text detection module TXD detects, may be in the form of subtitles.
  • the visual text may also form part of a scene in the video of interest.
  • the scene may comprise an object with a certain text, such as, for example, a city- name board or a facade with a restaurant name.
  • a certain text such as, for example, a city- name board or a facade with a restaurant name.
  • this is static text in the sense that the text does not move when displayed on a screen.
  • the visual text may also be in the form of, for example, a title role of a movie, a television series, or any other artistic or documentary video. In many cases, such visual text scrolls on the screen in a vertical direction. This is vertically scrolling text.
  • the visual text may also be in the form of, for example, so-called tickers that comprise daily news, financial news, weather news, or other news. In many cases, such visual text scrolls on the screen in a horizontal direction. This is horizontally scrolling text.
  • the text detection module TXD is capable of providing text detection data TD, which indicates segments within a reference frame that comprise visual text.
  • the character recognition module OCR derives a sequence of characters TXT from a reference frame RF that comprises visual text.
  • the text detection data TD which the text detection module TXD provides, assists the character recognition module OCR in this process. It is recalled that the text detection data TD indicates the segments in the reference frame RF that comprise visual text.
  • the sequence of characters TXT which the character recognition module OCR provides, corresponds with the visual text comprised in these segments. It should be noted that the character recognition module OCR can derive various sequences of characters from a reference frame. Each sequence may correspond with, for example, a particular line of visual text. For example, a subtitle may comprise two lines.
  • the text-motion detection module MOD establishes a motion indication MI on the basis of text detection data TD, which the text detection module TXD provides, and the motion vectors MV, the predicted frame PF, and the reference frame RF, which the encoding- and-decoding module CODEC provides.
  • the motion indication MI indicates whether visual text, which is comprised in a sequence of frames, moves when displayed on the screen or not.
  • the motion indication MI may comprise, for example, a binary value, which indicates whether the visual text moves or not.
  • the motion indication MI may provide further information, for example, an indication whether the visual text moves in a horizontal direction or in a vertical direction.
  • the motion indication MI may also be in the form of a displacement vector, which indicates a displacement direction with relatively great precision.
  • the displacement vector has a length, which indicates a displacement speed, i.e. how fast the visual text moves when displayed on a screen.
  • the motion indication MI may further indicate acceleration or any other useful information that relates to the displacement of the visual text throughout
  • the data-association module DAS associates the motion indication MI, which the text-motion detection module MOD provides, with the sequence of characters TXT, which the character recognition module OCR provides.
  • the data-association module DAS stores the sequence of characters TXT and the motion indication MI associated therewith in the semantic database SDB.
  • the data-association module DAS can associate one or more frame indices IX with the sequence of characters TXT, and store these indices IX in the semantic database SDB too.
  • the semantic database SDB will comprise a collection of sequences of characters representing visual text comprised in each video program, which has been recorded.
  • the semantic database SDB further comprises the motion indication associated with each sequence of characters and, optionally, the frame indices, which indicate when the sequence of characters appears in the video program.
  • the aforementioned information in the semantic database SDB can assist a user in numerous manners. Some examples will be given hereinafter. These examples have in common that the user exploits the semantic database SDB through the control module CTRL of the digital video recorder DVR, which comprises the user interface UIF.
  • the user interface UIF which comprises a software program, may cause, for example, one or more menus to be displayed from which the user may select a particular item. The user may navigate through various menus and make selections by means of the remote control RCD.
  • the user interface UIF may present the user an option within a menu that allows automatic identification of the start or the end, or both, of a video program that has been recorded. Let it be assumed that the user chooses this option.
  • the control module CTRL will search in the semantic database SDB for a motion indication that indicates vertically scrolling text. In many cases, a title role of a movie, a television series, or any other artistic or documentary video, comprises visual text that scrolls on the screen in a vertical direction.
  • the control module CTRL retrieves the frame indices that are associated with the motion indication that indicates vertically scrolling text.
  • the control module CTRL may, for example, associate a start marker with one or more frames that correspond with the frame indices retrieved from the semantic database SDB, and that are relatively close to the start of the recording.
  • the control module CTRL may further associate an end marker with one or more frames that correspond with the frame indices retrieved from the semantic database SDB, and that are relatively close to the end of the recording.
  • the video program of interest lies between the start marker and the end marker. Other parts of the recording may comprise commercials or other video scenes, which are of less interest.
  • control module CTRL may cause the disk driver DRW to add the start marker and the end marker to the recording.
  • control module CTRL may cause the start marker and the end marker to be stored within the digital video recorder DVR in association with an identification of the disk DSK on which the recording has been made.
  • the user interface UIF causes the digital video recorder DVR to play back respective portions of the recording that comprise vertically scrolling text.
  • the digital video recorder DVR can find these portions thanks to the frame indices associated with the motion indication in the semantic database SDB.
  • This selective playback allows the user to check that the detected vertically scrolling text corresponds with the start or the end of the video program.
  • the control module CTRL generates the start marker and the end marker when the user has validated this check.
  • the user may also fine-tune, as it were, the start marker and the end marker by placing these markers just after the vertically scrolling text at the start of the recording and just before the vertically scrolling text at the end of the recording, respectively.
  • the user interface UIF may further present the user an option that allows him or her to catalog the video program, which has been recorded.
  • This cataloguing of the video program allows content management and content browsing and navigation.
  • the cataloguing of the video program may comprise various different items, such as, for example, a title, one or more actors' names, a producer's name, a production date, and other characteristics of the video program, which appear in the title role. These characteristics generally appear in the form of vertically scrolling text. Let it be assumed that the user chooses the catalog option.
  • the control module CTRL will search in the semantic database SDB for a motion indication that indicates vertically scrolling text, as described hereinbefore.
  • the semantic database SDB comprises sequences of characters, which are associated with this motion indication.
  • sequences of characters correspond with the vertically scrolling text comprised in the video program when displayed on a screen. Consequently, a sequence of characters may correspond with the title of the video program, another sequence may correspond with an actor who plays a role in the video program, yet another sequence may correspond with the producer of the video program, and yet another sequence may correspond with the production data of the video program, and so on.
  • the control module CTRL may retrieve these sequences of characters, which are associated with the motion indication that indicates horizontally scrolling text, from the semantic database SDB so as to copy these sequences into, for example, a cache memory within the digital video recorder DVR.
  • the user may then browse through these sequences of characters so as to select one or more sequences that are of interest for the purpose of cataloguing.
  • the user may copy a selected sequence of characters into his or her catalog. Accordingly, the user can catalog the video program, which he or she has recorded, in a relatively simple manner thanks to the motion indication that indicates that the sequences of characters relate to vertically scrolling text, which is typical of a title role.
  • the user interface UIF may further present the user an option within a menu that allows automatic identification of news information.
  • tickers for displaying textual news information.
  • a ticker is a horizontal bar on a screen that comprises visual text that moves in a horizontal direction, from the left to the right or vice versa. Let it be assumed that the user chooses the news-information option.
  • the control module CTRL will search in the semantic database SDB for a motion indication that indicates a horizontally scrolling text.
  • the semantic database SDB comprises sequences of characters, which are associated with this motion indication. These sequences of characters correspond with the horizontally scrolling text comprised in the program when displayed on a screen. That is, these sequences of characters generally correspond with news information comprised in a ticker.
  • a sequence of characters may correspond with a general news item, a financial news item, such as, for example, stock prices, or a weather forecast, and so on.
  • the control module CTRL may retrieve these sequences of characters, which are associated with the motion indication that indicates horizontally scrolling text, from the semantic database SDB so as to copy these sequences into, for example, a cache memory within the digital video recorder DVR. The user may then browse through these sequences of characters so as to select one or more items that are of interest.
  • the user interface UIF may also comprise a search engine that allows the user to typically find a particular news item.
  • FIG. 2 illustrates a displacement-vector calculation, which the text-motion detection module MOD carries out.
  • the displacement-vector calculation provides an indication of movement of visual text from a reference frame to a predicted frame that is subsequent to the reference frame. That is, the displacement- vector calculation, which FIG. 2 illustrates, applies to a forward prediction of a frame.
  • the reference frame which may be an I-frame, or a P-frame, on which the prediction is based, precedes the predicted frame, which may be a P-frame, or a B-frame.
  • the text-motion detection module MOD receives the text detection data TD from the text detection module TXD (TD : MB[RF] 3VTX). It is assumed that the text detection module TXD has already detected one or more macroblocks in the reference frame that comprise visual text. Accordingly, the text detection data TD, which indicates these macroblocks, is available.
  • the displacement-vector calculation comprises a series of steps ST2-ST6 for each macroblock in the predicted frame.
  • Each macroblock in the predicted frame has a motion vector, which the encoding-and-decoding module CODEC provides.
  • the motion vector indicates a particular block of pixels in the reference frame, which is similar to the macroblock in the predicted frame.
  • the motion vector indicates movement of an object, which the respective macroblocks at least partially represent, from the predicted frame to the reference frame.
  • the displacement-vector calculation that FIG. 2 illustrates applies to a forward prediction: the predicted frame is subsequent to the reference frame. Consequently, the motion vector points backwards in time. What is more, the motion vector points from the predicted frame to reference frame, but the text detection module TXD can not provide text indication data that relates to the predicted frame. That is, the location of the visual text in the predicted frame is not yet known. What is known, is the location of the visual text in the reference frame. The text indication data TD indicates this.
  • the text-motion detection module MOD projects the macroblock concerned from the predicted frame to the reference frame. The motion vector that belongs to the macroblock defines this protection.
  • the pre ⁇ ted block will generally not precisely coincide with a macroblock in the reference frame.
  • the projected block will generally overlap a cluster of four different macroblocks in the reference frame. That is, the projected block will have a certain overlap with each of these four different macroblocks.
  • step ST3 the text-motion detection module MOD detects whether the projected block overlaps with a macroblock in the reference frame that comprises visual text, or not (OVR[TD]?). It is recalled that the text detection data TD indicates the macroblocks in the reference frame that comprise visual text. Let it be assumed that in reply to the test performed in step ST3, the projected block does not overlap (reply N) with a macroblock in the reference frame that comprises visual text. In that case, the text-motion detection module MOD directly carries out step ST2 and ST3 anew for a subsequent macroblock in the predicted frame, without carrying out steps ST4-ST6 for the macroblock concerned.
  • the text-motion detection module MOD carries out steps ST4-ST6 for the macroblock concerned if the projected block overlaps (reply Y) with at least one macroblock in the reference frame that comprises visual text.
  • the overlap percentage indicates the size of the portion of the projected block that falls within the macroblock concerned in the reference frame.
  • the overlap percentage is 100% if the projected block fully coincides with the macroblock concerned in the reference frame.
  • the overlap percentage is 50% if half of the projected block falls within the macroblock concerned in the reference frame.
  • FIG. 3 illustrates steps ST3 and ST4 described hereinbefore.
  • FIG. 3 illustrates, in full lines, four macroblocks MB[ij], MB[i+l,j], MB[i,j+l], MB[i+l j+1] in the reference frame.
  • FIG. 3 further illustrates the projected block PB.
  • a portion of the projected block PB overlaps macroblock MB[ij].
  • the overlap percentage is 50%.
  • Another portion of the projected block overlaps macroblock MB[i+l J].
  • the overlap percentage is 20%.
  • Yet another portion of the projected block overlaps macroblock MB[ij+l].
  • the overlap percentage is 20% too.
  • Yet another portion of the projected block overlaps macroblock MB[i+l j+1].
  • the overlap percentage is 10%.
  • the text-motion detection module MOD associates the inverted motion vector with each macroblock in the reference frame that overlaps the projected block (MB[RF] ⁇ IV,OVR%).
  • the text-motion detection module MOD further associates the overlap percentage that has been calculated, in step ST4, for the macroblock concerned in the reference frame.
  • the overlap percentage reflects a degree of confidence, as it were, that the inverted motion vector faithfully indicates movement of textual matter from the macroblock concerned in the reference frame to the predicted frame.
  • the text-motion detection module MOD temporarily stores the inverted motion vector and the overlap percentage for each macroblock concerned in the reference frame together with an identification of that macroblock. Referring to FIG.
  • the text-motion detection module MOD stores the inverted motion vector with an overlap percentage of 50% for macroblock MB[i j], and stores the same inverted motion vector with an overlap percentage of 20%, 20%, and 10%, for macroblocks MB[i+l j], MB[i,j+l], and MB[i+l j+1], respectively.
  • FIG. 3 can further illustrate the aforementioned aspect, which relates to the degree of confidence in the inverted motion vector.
  • the overlap percentage for macroblock MB[ij] is 50%, which is relatively high. Consequently, the inverted motion vector indicates with reasonable precision movement of textual matter from macroblock MB[i,j] in the reference frame to the predicted frame. This is because the motion vector, which points from the predicted frame to the reference frame, causes a substantial overlap between the projected block and macroblock MB[i J].
  • the text-motion detection module MOD has carried out the series of steps ST2-ST6 for each macroblock in the predicted frame.
  • the text-motion detection module MOD has generated the following data for each macroblock in the reference frame that comprises visual text: at least one inverted motion vector, and an overlap percentage associated with each inverted motion vector.
  • various inverted motion vectors will be generated for a macroblock in the reference frame that comprises visual text. This can be explained with reference to FIG. 3.
  • FIG. 3 illustrates the projection of a particular macroblock from the predicted frame to the reference frame.
  • the text-motion detection module MOD will project further, neighboring macroblocks from the predicted frame to reference frame.
  • Each of these respective projections will provide respective projected blocks. Any of these projected blocks may overlap one or more of the macroblocks MB [i j ] , MB [H-IJ], MB [i,j+ 1 ] , MB [i+ 1 j+ 1 ] , which FIG. 3 illustrates.
  • each respective projection will be based on a different motion vector, namely the motion vector that belongs to the macroblock in the predicted frame, which is projected.
  • the text-motion detection module MOD calculates a displacement vector for the visual text (CLC[DV]).
  • the text-motion detection module MOD makes a weighed combination of the respective inverted motion vectors, which have been established for macroblocks in the reference frame that comprise visual text.
  • the respective overlap percentages, which are associated with the respective inverted motion vectors, constitute weighing factors.
  • the text-motion detection module MOD may also carry out another displacement-vector calculation, which provides an indication of movement of visual text from a predicted frame (P-frame or B-frame) to a reference frame (I-frame) that is subsequent to the predicted frame.
  • This displacement-vector calculation is different from the one that FIG. 2 illustrates.
  • the main features of the other displacement-vector calculation are as follows.
  • the text-motion detection module MOD first establishes a set of macroblocks in the predicted frame that comprises visual text. This text detection may be based on, for example, one or more displacement vectors that have been calculated for previous frames. Each macroblock in the aforementioned set has a motion vector that points from the predicted frame to the reference frame. The motion vector points forward in time because the reference frame is subsequent to the predicted frame. The text-motion detection module MOD calculates an average of all relevant motion vectors, that is, all motion vectors that belong to the set of macroblocks that comprises visual text. The average constitutes the displacement vector.
  • the text-motion detection module MOD may calculate a sequence of displacement vectors for a sequence of frames. To that end, the text-motion detection may carry out the one or the other displacement-vector calculation, which have been described hereinbefore.
  • displacement vectors that form part of the sequence should be similar. This is because text generally moves on the screen in a steady, monotonous fashion, that is, text generally scrolls with constant speed.
  • the text-motion detection module MOD may check whether the displacement vectors are indeed similar, or not.
  • the motion indication which the text-motion detection module MOD provides, may indicate an anomaly when the displacement vectors are substantially different.
  • the text-motion detection module MOD may provide a no-motion indication, or may signal the text detection module TXD this anomaly. This anomaly signaling may cause the text detection module TXD to make one or more further detections or to make a different, more precise detection.
  • the text-motion detection module MOD may also signal the character recognition module OCR the anomaly, so as to prevent erroneous character recognition.
  • the motion indication MI which the text-motion detection module MOD provides, may comprise the following elements: an average of the displacement vectors, which have been established for a sequence of frames, accompanied with an indication of a frame when the visual text concerned enters the screen and an indication of a frame when the visual text leaves the screen.
  • text-motion detection module MOD may receive frame indices from the encoding-and-decoding module CODEC.
  • the database manager may carry out this marking of the visual text entering the screen and leaving the screen, respectively.
  • a video apparatus (digital video recorder DVR) comprises a text detection module (TXD), a text-motion detection module (MOD), and a user interface (UIF).
  • the text detection module (TXD) detects visual text in video data.
  • the text-motion detection module (MOD) provides a motion indication (MI) for the visual text that the text detection module (TXD) has detected.
  • the user interface (UIF) allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection module (MOD) has provided.
  • a control module marks a particular portion of the video data, which comprises the visual text, on the basis of the motion indication (MI), which the text-motion detection module (MOD) has provided.
  • the control module may, for example, insert a marker in the video data or store a marker in a database in association with the video data.
  • the marker may be a start marker, an end marker, or any other marker that is useful for content management.
  • a character recognition module derives a sequence of characters (TXT) from the visual text in the video data that the text detection module (TXD) has detected.
  • a data-association module associates the sequence of characters (TXT), which the character recognition module (OCR) has derived, with the motion indication (MI), which the text-motion detection module (MOD) has provided.
  • MI motion indication
  • MI text-motion detection module
  • a video processing module provides motion vectors (MV) that indicate movement of an object, which the video data represents, from a predicted frame (PF) to a reference frame (RF).
  • the text-motion detection module establishes the motion indication (MI) for the visual text that the text detection module (TXD) has detected, on the basis of the motion vectors (MV), which the video processing module (CODEC) has provided.
  • the digital video recorder described in detail hereinbefore is merely an example of a video apparatus in accordance with the invention.
  • the video apparatus may also be in the form of, for example, a settop box, a television set, or a mobile phone.
  • the digital video apparatus needs not necessarily be MPEG-based.
  • the invention can be applied in any video apparatus that comprises a video processor providing some form motion indication.
  • the digital video apparatus may be based on the H.263 standard for mobile video telephony.
  • the digital video apparatus needs not necessarily comprise a disk driver or any video storage device.
  • the digital video apparatus needs not necessarily comprise any video coder or decoder. It is possible to detect visual text in plain, uncompressed video data without the use of any (de-)coder parameters. It is also possible to detect text motion in plain, uncompressed video data without the use of any motion- vectors, which video (de-)coder typically provides. In case that visual text is detected on the basis of (de-)coder parameters, these parameters need not necessarily be standard coding parameters. For example, visual text detection may involve non-standard parameters, which are specific for a particular implementation of a video coder or a decoder. That is, the video coder or decoder generates proprietary parameters, which may be used for the purpose of visual text detection. Character recognition, if any, may be carried out in a classical fashion, without the use of any text indication derived from (de-)coder parameters or any other parameters relating to the video data.
  • a motion- vector weighing based on overlap calculation is not mandatory, although such a weighing calculation is advantageous. It is not necessary to weigh motion vectors, which a (de-)coder provides, in order to establish a text-motion indication. That is, the text-motion detection that Fig. 2 illustrates can be simplified. Such simplification may, however, be at the expense of motion-detection precision.
  • text-motion detection results need not necessarily be stored in a semantic database. For example, an application may use text- motion detection results for the purpose of content marking or image quality improvement only.
  • text-motion detection results can be stored in an ordinary memory.
  • frame and “image” should be understood in a broad sense. These terms are exchangeable and include a field or any other entity that may wholly or partially constitute an image or picture.
  • the term “scrolling text” should equally be understood in a broad sense. Scrolling text may move in a non-monotonous fashion. For example, scrolling text may move in a discontinuous, jumpy fashion or have a significant acceleration.

Abstract

A video apparatus (DVR) comprises a text detection module (TXD), a text- motion detection module (MOD), and a user interface (UIF). The text detection module (TXD) detects visual text in video data. The text-motion detection module (MOD) provides a motion indication (MI) for the visual text that the text detection module (TXD) has detected. The user interface (UIF) allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection module (MOD) has provided. For example, the user may request a program-change search. In response, the video apparatus (DVR) retrieves a portion of the video data for which the text-motion detection module (MOD) has provided a motion indication (MI) that indicates vertically scrolling visual text. Vertically scrolling visual text generally corresponds with a title role at the start or the end of a program, or both. There are numerous other manners to use the motion indication (MI), that the text-motion detection module (MOD) provides.

Description

METHOD AND APPARATUS FOR THE DETECTION OF TEXT IN VIDEO DATA
FIELD OF THE INVENTION
An aspect of the invention relates to a video apparatus arranged to detect text in video data. The video apparatus may be, for example, a digital video recorder that records video data on an optical disk (DVD) or a magnetic disk (HD), or both. Other aspects of the invention relate to a method of detecting text in video data, and a computer program for a video apparatus.
DESCRIPTION OF PRIOR ART
The international patent application published under number WO 02/093910 describes detection of the presence, appearance or disappearance of subtitles in a video signal. This detection comprises operations that an MPEG encoder or decoder typically carries out. The detection therefore requires relatively few additional operations. A detected subtitle may be subjected to a character recognition algorithm, which provides an electronic version of the text. The electronic text may be separately stored and subsequently used for indexing video scenes stored in a database. A typical application thereof is retrieval of video scenes in a video recorder based on spoken keywords.
SUMMARY OF THE INVENTION
According to an aspect of the invention, a video apparatus comprises a text detection module, a text-motion detection module, and a user interface. The text detection module detects visual text in video data. The text-motion detection module provides a motion indication for the visual text that the text detection module has detected. The user interface allows a user to access a particular portion of the video data on the basis of the motion indication that the text-motion detection module has provided. The invention takes the following aspects into consideration. Video data is generally rich in information. In principle, this makes it relatively difficult for an average person to handle video data and, more specifically, to access a particular information that the video data comprises. For example, video data may comprise visual text of various different types. Visual text may be, for example, a subtitle, a title role that provides information about a program, which the video data comprises, or news in a telegraphic form. Visual text may also mark a particular event, such as, for example, the start of a program, the end of a program, or a new chapter.
In principle, it is possible to derive sequences of characters from visual text comprised in video data by means of a character recognition algorithm. The sequences of characters, which have been derived from the video data, can be stored in a database and handled separately. For example, a user may browse through the database to find a subtitle that comprises a particular word. The aforementioned prior art, which relates to subtitles, appears to suggest this possibility. However, the database need not be restricted to subtitles only. That is, the database may comprise a collection of sequences of characters that relate to various different types of visual text. This complicates browsing. It will be relatively difficult to find a particular piece of information in the database.
In accordance with the aforementioned aspect of the invention, a video apparatus comprises a text-motion detection module, which provides a motion indication for visual text that has been detected. A user interface allows a user to access a particular portion of the video data on the basis of the motion indication that the text-motion detection module has provided.
The motion indication may indicate a particular event in the video data. For example, the motion indication may indicate the start or the end of a program. The start and the end of a program generally comprise a title role in the form of vertically scrolling visual text, which the motion indication indicates. The motion indication may also indicate a type of visual text in the video data, such as the aforementioned title role. The title role may comprise useful information for cataloguing purposes, such as, for example, the title of the program and actors' names. Another example concerns visual text that constitutes news data. Such news data is often presented in the form of so-called tickers. A ticker is a horizontal bar on a screen that comprises visual text that moves in a horizontal direction, from the left to the right or vice versa. The motion indication may indicate such a ticker and therefore indicate news data in the video data. Those examples illustrate that the invention allows greater user convenience. These and other aspects of the invention will be described in greater detail hereinafter with reference to drawings.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram that illustrates a digital video recorder, which comprises a text-motion detection module.
FIG. 2 is a flowchart diagram that illustrates a displacement-vector calculation, which the text-motion detection module carries out. FIG. 3 is a conceptual diagram that illustrates an overlap calculation, which is based on a projection of a macroblock from a predicted frame to reference frame.
DETAILED DESCRIPTION
FIG. 1 illustrates a digital video recorder DVR. The digital video recorder DVR comprises an encoding-and-decoding module CODEC and a disk driver DRW for writing data on a disk and for reading data from a disk DSK. The disk DSK may be, for example, an optical disk such as a digital versatile disk (DVD) or a hard disk (HD), on which data is magnetically stored. The digital video recorder DVR further comprises a text detection module TXD, a character recognition module OCR, a text-motion detection module MOD, a data-association module DAS, a semantic database SDB, and a control module CTRL. The control module CTRL comprises a user interface UIF and may receive commands from a remote control device RCD. Any of the aforementioned modules may be implemented by means of software or hardware, or a combination of software and hardware. A suitably programmed processor may carry out operations that will be described hereinafter with reference to the aforementioned modules.
The digital video recorder DVR operates as follows. In a recording mode, the encoding-and-decoding module CODEC receives video input data VI from an external entity. The encoding-and-decoding module CODEC encodes the video input data VI in accordance with a video encoding standard, such as, for example, MPEG2 or MPEG4 (MPEG is a commonly used acronym for Moving Pictures Experts Group). The disk driver DRW receives encoded video input data VIC from the encoding-and-decoding module CODEC and writes that data onto the disk DSK, which is present in the disk driver DRW. Alternatively, the disk driver DRW may receive the encoded video input data VIC directly from an external entity via a codec bypass, which is illustrated in broken lines. In that case, the encoding-and- decoding module CODEC may decode the encoded video input data VIC, which the external entity provides. Accordingly, the digital video recorder DVR can retrieve coding parameters and other data from the encoded video input data VIC. This data is useful for text detection, which will be described in greater detail hereinafter.
In a playback mode, the encoding-and-decoding module CODEC receives encoded video output data VOC from the disk driver DRW. The encoding-and-decoding module CODEC decodes the encoded video output data VOC so as to obtain video output data VO, which can be applied to an external entity, such as, for example, a display device. Alternatively, the disk driver DRW may apply the encoded video output data VOC, which is read from the disk, directly to an external entity that comprises a video decoder. In that case, the coded video output data VOC transits via the codec bypass. The encoding-and-decoding module CODEC may still decode the encoded video output data VOC, which the disk driver DRW provides. Accordingly, the digital video recorder DVR can retrieve coding parameters and other data from the encoded video output data VOC. This data is useful for text detection, which will be described in greater detail hereinafter.
The encoding-and-decoding module CODEC provides various coding parameters PAR and other data, which result from an encoding or a decoding in accordance with the relevant MPEG standard. The coding parameters PAR comprise: a "b"-parameter, which indicates the number of bits used for encoding an image slice excluding overhead bits, a "qs"-parameter, which indicates the quantizer scale for a slice, - a "c"-parameter, which indicates the transform coefficients (DC and
AC) of a macroblock, a "mad"-parameter, which indicates the mean absolute difference between an image block and the prediction block found by a motion estimator, which forms part of the encoding-and-decoding module CODEC. The international application published under number WO 02/093910 describes these coding parameters.
The encoding-and-decoding module CODEC further provides motion vectors MV, a predicted frame PF, a reference frame RF, and a frame index IX. The predicted frame PF may be a so-called P-frame or a B-frame as defined by the relevant MPEG standard. The reference frame RF is a so-called I-frame. The frame index IX indicates the position of the frame within a sequence frames that constitutes a video recording, which corresponds with, for example, a movie. The frame index IX can be associated with a particular instant within an interval of time that corresponds with the video recording. For example, the frame index IX may correspond with the instant "5 minutes and 27 seconds" from the start of the video recording or any other reference in the recording. The text detection module TXD detects, in a reference frame (I-frame), one or more segments that comprise visual text. The text detection module TXD provides text detection data TD that indicates these segments. The text detection data TD may identify, for example, those macroblocks in the reference frame, which comprise visual text. The text detection module TXD may operate in a manner similar to, for example, the subtitle detector in the international application published under number WO 02/093910. Such an implementation of the text detection module TXD detects visual-text segments on the basis of the coding parameters PAR mentioned hereinbefore.
The visual text, which the text detection module TXD detects, may be in the form of subtitles. The visual text may also form part of a scene in the video of interest. For example, the scene may comprise an object with a certain text, such as, for example, a city- name board or a facade with a restaurant name. Generally, this is static text in the sense that the text does not move when displayed on a screen. The visual text may also be in the form of, for example, a title role of a movie, a television series, or any other artistic or documentary video. In many cases, such visual text scrolls on the screen in a vertical direction. This is vertically scrolling text. The visual text may also be in the form of, for example, so-called tickers that comprise daily news, financial news, weather news, or other news. In many cases, such visual text scrolls on the screen in a horizontal direction. This is horizontally scrolling text. Whatever type of text, the text detection module TXD is capable of providing text detection data TD, which indicates segments within a reference frame that comprise visual text.
The character recognition module OCR derives a sequence of characters TXT from a reference frame RF that comprises visual text. The text detection data TD, which the text detection module TXD provides, assists the character recognition module OCR in this process. It is recalled that the text detection data TD indicates the segments in the reference frame RF that comprise visual text. The sequence of characters TXT, which the character recognition module OCR provides, corresponds with the visual text comprised in these segments. It should be noted that the character recognition module OCR can derive various sequences of characters from a reference frame. Each sequence may correspond with, for example, a particular line of visual text. For example, a subtitle may comprise two lines.
The text-motion detection module MOD establishes a motion indication MI on the basis of text detection data TD, which the text detection module TXD provides, and the motion vectors MV, the predicted frame PF, and the reference frame RF, which the encoding- and-decoding module CODEC provides. The motion indication MI indicates whether visual text, which is comprised in a sequence of frames, moves when displayed on the screen or not. The motion indication MI may comprise, for example, a binary value, which indicates whether the visual text moves or not. The motion indication MI may provide further information, for example, an indication whether the visual text moves in a horizontal direction or in a vertical direction. The motion indication MI may also be in the form of a displacement vector, which indicates a displacement direction with relatively great precision. The displacement vector has a length, which indicates a displacement speed, i.e. how fast the visual text moves when displayed on a screen. The motion indication MI may further indicate acceleration or any other useful information that relates to the displacement of the visual text throughout a sequence of frames.
The data-association module DAS associates the motion indication MI, which the text-motion detection module MOD provides, with the sequence of characters TXT, which the character recognition module OCR provides. The data-association module DAS stores the sequence of characters TXT and the motion indication MI associated therewith in the semantic database SDB. Optionally, the data-association module DAS can associate one or more frame indices IX with the sequence of characters TXT, and store these indices IX in the semantic database SDB too.
Let it be assumed that the digital video recorder DVR has recorded one or more video programs. The semantic database SDB will comprise a collection of sequences of characters representing visual text comprised in each video program, which has been recorded. The semantic database SDB further comprises the motion indication associated with each sequence of characters and, optionally, the frame indices, which indicate when the sequence of characters appears in the video program.
The aforementioned information in the semantic database SDB can assist a user in numerous manners. Some examples will be given hereinafter. These examples have in common that the user exploits the semantic database SDB through the control module CTRL of the digital video recorder DVR, which comprises the user interface UIF. The user interface UIF, which comprises a software program, may cause, for example, one or more menus to be displayed from which the user may select a particular item. The user may navigate through various menus and make selections by means of the remote control RCD.
The user interface UIF may present the user an option within a menu that allows automatic identification of the start or the end, or both, of a video program that has been recorded. Let it be assumed that the user chooses this option. The control module CTRL will search in the semantic database SDB for a motion indication that indicates vertically scrolling text. In many cases, a title role of a movie, a television series, or any other artistic or documentary video, comprises visual text that scrolls on the screen in a vertical direction.
The control module CTRL retrieves the frame indices that are associated with the motion indication that indicates vertically scrolling text. The control module CTRL may, for example, associate a start marker with one or more frames that correspond with the frame indices retrieved from the semantic database SDB, and that are relatively close to the start of the recording. The control module CTRL may further associate an end marker with one or more frames that correspond with the frame indices retrieved from the semantic database SDB, and that are relatively close to the end of the recording. The video program of interest lies between the start marker and the end marker. Other parts of the recording may comprise commercials or other video scenes, which are of less interest.
Optionally, the control module CTRL may cause the disk driver DRW to add the start marker and the end marker to the recording. Alternatively, the control module CTRL may cause the start marker and the end marker to be stored within the digital video recorder DVR in association with an identification of the disk DSK on which the recording has been made.
Preferably, the user interface UIF causes the digital video recorder DVR to play back respective portions of the recording that comprise vertically scrolling text. The digital video recorder DVR can find these portions thanks to the frame indices associated with the motion indication in the semantic database SDB. This selective playback allows the user to check that the detected vertically scrolling text corresponds with the start or the end of the video program. The control module CTRL generates the start marker and the end marker when the user has validated this check. The user may also fine-tune, as it were, the start marker and the end marker by placing these markers just after the vertically scrolling text at the start of the recording and just before the vertically scrolling text at the end of the recording, respectively.
The user interface UIF may further present the user an option that allows him or her to catalog the video program, which has been recorded. This cataloguing of the video program allows content management and content browsing and navigation. The cataloguing of the video program may comprise various different items, such as, for example, a title, one or more actors' names, a producer's name, a production date, and other characteristics of the video program, which appear in the title role. These characteristics generally appear in the form of vertically scrolling text. Let it be assumed that the user chooses the catalog option. The control module CTRL will search in the semantic database SDB for a motion indication that indicates vertically scrolling text, as described hereinbefore. The semantic database SDB comprises sequences of characters, which are associated with this motion indication. These sequences of characters correspond with the vertically scrolling text comprised in the video program when displayed on a screen. Consequently, a sequence of characters may correspond with the title of the video program, another sequence may correspond with an actor who plays a role in the video program, yet another sequence may correspond with the producer of the video program, and yet another sequence may correspond with the production data of the video program, and so on.
The control module CTRL may retrieve these sequences of characters, which are associated with the motion indication that indicates horizontally scrolling text, from the semantic database SDB so as to copy these sequences into, for example, a cache memory within the digital video recorder DVR. The user may then browse through these sequences of characters so as to select one or more sequences that are of interest for the purpose of cataloguing. The user may copy a selected sequence of characters into his or her catalog. Accordingly, the user can catalog the video program, which he or she has recorded, in a relatively simple manner thanks to the motion indication that indicates that the sequences of characters relate to vertically scrolling text, which is typical of a title role. The user interface UIF may further present the user an option within a menu that allows automatic identification of news information. Certain programs comprise one or more tickers for displaying textual news information. A ticker is a horizontal bar on a screen that comprises visual text that moves in a horizontal direction, from the left to the right or vice versa. Let it be assumed that the user chooses the news-information option. The control module CTRL will search in the semantic database SDB for a motion indication that indicates a horizontally scrolling text. The semantic database SDB comprises sequences of characters, which are associated with this motion indication. These sequences of characters correspond with the horizontally scrolling text comprised in the program when displayed on a screen. That is, these sequences of characters generally correspond with news information comprised in a ticker. Consequently, a sequence of characters may correspond with a general news item, a financial news item, such as, for example, stock prices, or a weather forecast, and so on. The control module CTRL may retrieve these sequences of characters, which are associated with the motion indication that indicates horizontally scrolling text, from the semantic database SDB so as to copy these sequences into, for example, a cache memory within the digital video recorder DVR. The user may then browse through these sequences of characters so as to select one or more items that are of interest. The user interface UIF may also comprise a search engine that allows the user to typically find a particular news item. FIG. 2 illustrates a displacement-vector calculation, which the text-motion detection module MOD carries out. The displacement-vector calculation provides an indication of movement of visual text from a reference frame to a predicted frame that is subsequent to the reference frame. That is, the displacement- vector calculation, which FIG. 2 illustrates, applies to a forward prediction of a frame. The reference frame, which may be an I-frame, or a P-frame, on which the prediction is based, precedes the predicted frame, which may be a P-frame, or a B-frame.
In a first step STl, the text-motion detection module MOD receives the text detection data TD from the text detection module TXD (TD : MB[RF] 3VTX). It is assumed that the text detection module TXD has already detected one or more macroblocks in the reference frame that comprise visual text. Accordingly, the text detection data TD, which indicates these macroblocks, is available.
The displacement-vector calculation, which FIG. 2 illustrates, comprises a series of steps ST2-ST6 for each macroblock in the predicted frame. Each macroblock in the predicted frame has a motion vector, which the encoding-and-decoding module CODEC provides. The motion vector indicates a particular block of pixels in the reference frame, which is similar to the macroblock in the predicted frame. In general, the motion vector indicates movement of an object, which the respective macroblocks at least partially represent, from the predicted frame to the reference frame.
As mentioned hereinbefore, the displacement-vector calculation that FIG. 2 illustrates applies to a forward prediction: the predicted frame is subsequent to the reference frame. Consequently, the motion vector points backwards in time. What is more, the motion vector points from the predicted frame to reference frame, but the text detection module TXD can not provide text indication data that relates to the predicted frame. That is, the location of the visual text in the predicted frame is not yet known. What is known, is the location of the visual text in the reference frame. The text indication data TD indicates this. In step ST2, the text-motion detection module MOD projects the macroblock concerned from the predicted frame to the reference frame. The motion vector that belongs to the macroblock defines this protection. Accordingly, a projected block within the reference frame is obtained (MB[PF] & MV : PROJ[PF→RF] =>. PB). The preφted block will generally not precisely coincide with a macroblock in the reference frame. The projected block will generally overlap a cluster of four different macroblocks in the reference frame. That is, the projected block will have a certain overlap with each of these four different macroblocks.
In step ST3, the text-motion detection module MOD detects whether the projected block overlaps with a macroblock in the reference frame that comprises visual text, or not (OVR[TD]?). It is recalled that the text detection data TD indicates the macroblocks in the reference frame that comprise visual text. Let it be assumed that in reply to the test performed in step ST3, the projected block does not overlap (reply N) with a macroblock in the reference frame that comprises visual text. In that case, the text-motion detection module MOD directly carries out step ST2 and ST3 anew for a subsequent macroblock in the predicted frame, without carrying out steps ST4-ST6 for the macroblock concerned. Conversely, the text-motion detection module MOD carries out steps ST4-ST6 for the macroblock concerned if the projected block overlaps (reply Y) with at least one macroblock in the reference frame that comprises visual text. In step ST4, the text-motion detection module MOD establishes an overlap percentage for each macroblock in the reference frame that comprises visual text and that overlaps the projected block (CLC[OVR%]; PB5MB[RF] => OVR%). The overlap percentage indicates the size of the portion of the projected block that falls within the macroblock concerned in the reference frame. The overlap percentage is 100% if the projected block fully coincides with the macroblock concerned in the reference frame. As an example, the overlap percentage is 50% if half of the projected block falls within the macroblock concerned in the reference frame.
FIG. 3 illustrates steps ST3 and ST4 described hereinbefore. FIG. 3 illustrates, in full lines, four macroblocks MB[ij], MB[i+l,j], MB[i,j+l], MB[i+l j+1] in the reference frame. FIG. 3 further illustrates the projected block PB. A portion of the projected block PB overlaps macroblock MB[ij]. The overlap percentage is 50%. Another portion of the projected block overlaps macroblock MB[i+l J]. The overlap percentage is 20%. Yet another portion of the projected block overlaps macroblock MB[ij+l]. The overlap percentage is 20% too. Yet another portion of the projected block overlaps macroblock MB[i+l j+1]. The overlap percentage is 10%.
In step ST5, the text-motion detection module MOD inverts the motion vector that belongs to the macroblock in the predicted frame for which steps ST2-ST6 are carried out. It is recalled that the motion vector defines the projection of this macroblock, which results in the projected block. Step ST5 provides an inverted motion vector, which points from the reference frame to the predicted frame (INV[MVJ =j> IV). Consequently, the inverted motion vector points forward in time.
In step ST6, the text-motion detection module MOD associates the inverted motion vector with each macroblock in the reference frame that overlaps the projected block (MB[RF]<→IV,OVR%). The text-motion detection module MOD further associates the overlap percentage that has been calculated, in step ST4, for the macroblock concerned in the reference frame. The overlap percentage reflects a degree of confidence, as it were, that the inverted motion vector faithfully indicates movement of textual matter from the macroblock concerned in the reference frame to the predicted frame. The text-motion detection module MOD temporarily stores the inverted motion vector and the overlap percentage for each macroblock concerned in the reference frame together with an identification of that macroblock. Referring to FIG. 3, the text-motion detection module MOD stores the inverted motion vector with an overlap percentage of 50% for macroblock MB[i j], and stores the same inverted motion vector with an overlap percentage of 20%, 20%, and 10%, for macroblocks MB[i+l j], MB[i,j+l], and MB[i+l j+1], respectively.
FIG. 3 can further illustrate the aforementioned aspect, which relates to the degree of confidence in the inverted motion vector. In FIG. 3, the overlap percentage for macroblock MB[ij] is 50%, which is relatively high. Consequently, the inverted motion vector indicates with reasonable precision movement of textual matter from macroblock MB[i,j] in the reference frame to the predicted frame. This is because the motion vector, which points from the predicted frame to the reference frame, causes a substantial overlap between the projected block and macroblock MB[i J].
Let it be assumed that the overlap percentage for macroblock MB[ij] were 100%. In that case, the projected block PB would fully coincide with macro block MB[ij]. The motion vector has caused this projection. Let it now be assumed that an inverse projection is made based on the inverted motion vector. Macroblock MB[ij] is projected from the reference frame to the predicted frame. In this 100% overlap example, the projection of macroblock MB[ij] will fully coincide with the relevant macroblock in the predicted frame to which the motion vector belongs. Stated boldly, a 100% overlap indicates that the motion vector, when inverted, provides an accurate projection in the opposite direction.
Let it be assumed that the text-motion detection module MOD has carried out the series of steps ST2-ST6 for each macroblock in the predicted frame. The text-motion detection module MOD has generated the following data for each macroblock in the reference frame that comprises visual text: at least one inverted motion vector, and an overlap percentage associated with each inverted motion vector. Generally, various inverted motion vectors will be generated for a macroblock in the reference frame that comprises visual text. This can be explained with reference to FIG. 3.
FIG. 3 illustrates the projection of a particular macroblock from the predicted frame to the reference frame. The text-motion detection module MOD will project further, neighboring macroblocks from the predicted frame to reference frame. Each of these respective projections will provide respective projected blocks. Any of these projected blocks may overlap one or more of the macroblocks MB [i j ] , MB [H-IJ], MB [i,j+ 1 ] , MB [i+ 1 j+ 1 ] , which FIG. 3 illustrates. Furthermore, each respective projection will be based on a different motion vector, namely the motion vector that belongs to the macroblock in the predicted frame, which is projected.
In step ST7, the text-motion detection module MOD calculates a displacement vector for the visual text (CLC[DV]). The text-motion detection module MOD makes a weighed combination of the respective inverted motion vectors, which have been established for macroblocks in the reference frame that comprise visual text. The respective overlap percentages, which are associated with the respective inverted motion vectors, constitute weighing factors. The aforementioned weighed combination constitutes the displacement vector (AVG[IV,OVR%] => DV).
The text-motion detection module MOD may also carry out another displacement-vector calculation, which provides an indication of movement of visual text from a predicted frame (P-frame or B-frame) to a reference frame (I-frame) that is subsequent to the predicted frame. This displacement-vector calculation is different from the one that FIG. 2 illustrates. The main features of the other displacement-vector calculation are as follows.
The text-motion detection module MOD first establishes a set of macroblocks in the predicted frame that comprises visual text. This text detection may be based on, for example, one or more displacement vectors that have been calculated for previous frames. Each macroblock in the aforementioned set has a motion vector that points from the predicted frame to the reference frame. The motion vector points forward in time because the reference frame is subsequent to the predicted frame. The text-motion detection module MOD calculates an average of all relevant motion vectors, that is, all motion vectors that belong to the set of macroblocks that comprises visual text. The average constitutes the displacement vector.
Accordingly, the text-motion detection module MOD may calculate a sequence of displacement vectors for a sequence of frames. To that end, the text-motion detection may carry out the one or the other displacement-vector calculation, which have been described hereinbefore. Generally, displacement vectors that form part of the sequence should be similar. This is because text generally moves on the screen in a steady, monotonous fashion, that is, text generally scrolls with constant speed.
The text-motion detection module MOD may check whether the displacement vectors are indeed similar, or not. The motion indication, which the text-motion detection module MOD provides, may indicate an anomaly when the displacement vectors are substantially different. Alternatively, the text-motion detection module MOD may provide a no-motion indication, or may signal the text detection module TXD this anomaly. This anomaly signaling may cause the text detection module TXD to make one or more further detections or to make a different, more precise detection. The text-motion detection module MOD may also signal the character recognition module OCR the anomaly, so as to prevent erroneous character recognition.
The motion indication MI, which the text-motion detection module MOD provides, may comprise the following elements: an average of the displacement vectors, which have been established for a sequence of frames, accompanied with an indication of a frame when the visual text concerned enters the screen and an indication of a frame when the visual text leaves the screen. To that end, text-motion detection module MOD may receive frame indices from the encoding-and-decoding module CODEC. Alternatively, the database manager may carry out this marking of the visual text entering the screen and leaving the screen, respectively. CONCLUDING REMARKS
The detailed description hereinbefore with reference to the drawings illustrates the following characteristics, which are cited in claim 1. A video apparatus (digital video recorder DVR) comprises a text detection module (TXD), a text-motion detection module (MOD), and a user interface (UIF). The text detection module (TXD) detects visual text in video data. The text-motion detection module (MOD) provides a motion indication (MI) for the visual text that the text detection module (TXD) has detected. The user interface (UIF) allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection module (MOD) has provided.
The detailed description hereinbefore further illustrates the following optional characteristics, which are cited in claim 2. A control module (CTRL) marks a particular portion of the video data, which comprises the visual text, on the basis of the motion indication (MI), which the text-motion detection module (MOD) has provided. The control module may, for example, insert a marker in the video data or store a marker in a database in association with the video data. The marker may be a start marker, an end marker, or any other marker that is useful for content management. These characteristics allow the user to conveniently access the video data any subsequent time.
The detailed description hereinbefore further illustrates the following optional characteristics, which are cited in claim 3. A character recognition module (OCR) derives a sequence of characters (TXT) from the visual text in the video data that the text detection module (TXD) has detected. A data-association module (DAS) associates the sequence of characters (TXT), which the character recognition module (OCR) has derived, with the motion indication (MI), which the text-motion detection module (MOD) has provided. These characteristics facilitate the retrieval and the handling of textual information comprised in video data. Consequently, these characteristics further contribute to more user convenience. The detailed description hereinbefore illustrates various aspects thereof, which are cited in claims 4, 5, and 6.
The detailed description hereinbefore further illustrates the following optional characteristics, which are cited in claim 7. A video processing module (CODEC) provides motion vectors (MV) that indicate movement of an object, which the video data represents, from a predicted frame (PF) to a reference frame (RF). The text-motion detection module (MOD) establishes the motion indication (MI) for the visual text that the text detection module (TXD) has detected, on the basis of the motion vectors (MV), which the video processing module (CODEC) has provided.
The aforementioned characteristics can be implemented in numerous different manners. In order to illustrate this, some alternatives are briefly indicated. The digital video recorder described in detail hereinbefore is merely an example of a video apparatus in accordance with the invention. The video apparatus may also be in the form of, for example, a settop box, a television set, or a mobile phone. The digital video apparatus needs not necessarily be MPEG-based. The invention can be applied in any video apparatus that comprises a video processor providing some form motion indication. For example, the digital video apparatus may be based on the H.263 standard for mobile video telephony. The digital video apparatus needs not necessarily comprise a disk driver or any video storage device.
The digital video apparatus needs not necessarily comprise any video coder or decoder. It is possible to detect visual text in plain, uncompressed video data without the use of any (de-)coder parameters. It is also possible to detect text motion in plain, uncompressed video data without the use of any motion- vectors, which video (de-)coder typically provides. In case that visual text is detected on the basis of (de-)coder parameters, these parameters need not necessarily be standard coding parameters. For example, visual text detection may involve non-standard parameters, which are specific for a particular implementation of a video coder or a decoder. That is, the video coder or decoder generates proprietary parameters, which may be used for the purpose of visual text detection. Character recognition, if any, may be carried out in a classical fashion, without the use of any text indication derived from (de-)coder parameters or any other parameters relating to the video data.
There are numerous different techniques to detect text motion. The technique described hereinbefore with reference to Fig. 2 is merely an example. For example, a motion- vector weighing based on overlap calculation is not mandatory, although such a weighing calculation is advantageous. It is not necessary to weigh motion vectors, which a (de-)coder provides, in order to establish a text-motion indication. That is, the text-motion detection that Fig. 2 illustrates can be simplified. Such simplification may, however, be at the expense of motion-detection precision. It should further be noted that text-motion detection results need not necessarily be stored in a semantic database. For example, an application may use text- motion detection results for the purpose of content marking or image quality improvement only. In such an application, text-motion detection results can be stored in an ordinary memory. The terms "frame" and "image" should be understood in a broad sense. These terms are exchangeable and include a field or any other entity that may wholly or partially constitute an image or picture. The term "scrolling text" should equally be understood in a broad sense. Scrolling text may move in a non-monotonous fashion. For example, scrolling text may move in a discontinuous, jumpy fashion or have a significant acceleration.
There are numerous ways of implementing functions by means of items of hardware or software, or both. In this respect, the drawings are very diagrammatic, each representing only one possible embodiment of the invention. Thus, although a drawing shows different functions as different blocks, this by no means excludes that a single item of hardware or software carries out several functions. Nor does it exclude that an assembly of items of hardware or software or both carry out a function.
The remarks made herein before demonstrate that the detailed description, with reference to the drawings, illustrates rather than limits the invention. There are numerous alternatives, which fall within the scope of the appended claims. Any reference sign in a claim should not be construed as limiting the claim. The word "comprising" does not exclude the presence of other elements or steps than those listed in a claim. The word "a" or "an" preceding an element or step does not exclude the presence of a plurality of such elements or steps.

Claims

Claims.
1. A video apparatus (DVR) comprising: a text detection module (TXD) arranged to detect visual text in video data; a text-motion detection module (MOD) arranged to provide a motion indication (MI) for the visual text that the text detection module (TXD) has detected; and a user interface (UIF) arranged to allow a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection module (MOD) has provided.
2. A video apparatus as claimed in claim 1 comprising a control module
(CTRL) arranged to mark a particular portion of the video data, which comprises the visual text, on the basis of the motion indication (MI), that the text-motion detection module (MOD) has provided.
3. A video apparatus as claimed in claim 1 comprising: a character recognition module (OCR) arranged to derive a sequence of characters (TXT) from the visual text in the video data that the text detection module (TXD) has detected, and a data-association module (DAS) arranged to associate the sequence of characters (TXT), which the character recognition module (OCR) has derived, with the motion indication (MI), that the text-motion detection module (MOD) has provided.
4. A video apparatus as claimed in claim 1, the user interface (UIF) being arranged to allow a user to request a program-change search, the video apparatus (DVR) comprising: a control module (CTRL) arranged to select, in response to the program-change search, a particular portion of the video data for which the text-motion detection module (MOD) has provided a motion indication (MI) that indicates vertically scrolling visual text.
5. A video apparatus as claimed in claim 3, the user interface (UIF) being arranged to allow a user to present a program-data query, the video apparatus (DVR) comprising: a control module (CTRL) arranged to retrieve, in response to the program-data query, a sequence of characters (TXT) associated with a text motion indication (MI) that indicates vertically scrolling text.
6. A video apparatus as claimed in claim 3, the user interface (UIF) being arranged to allow a user to present a news-data query, the video apparatus (DVR) comprising: - a control module (CTRL) arranged to retrieve, in response to the news- data query, a sequence of characters (TXT) associated with a text motion indication (MI) that indicates horizontally scrolling text.
7. A video apparatus as claimed in claim 1 comprising: - a video processing module (CODEC) arranged to provide motion vectors (MV) that indicate movement of an object, which the video data represents, from a predicted frame (PF) to a reference frame (RF), the text-motion detection module (MOD) being arranged to establish the motion indication (MI) for the visual text that the text detection module (TXD) has detected, on the basis of the motion vectors (MV), that the video processing module (CODEC) has provided.
8. A video apparatus as claimed in claim 7, the text detection module (TXD) being arranged to indicate pixel-blocks in the reference frame (RF) that comprise visual text, the text-motion detection module (MOD) being arranged to calculate a displacement vector, which indicates movement of the visual text from the reference frame (RF) to the predicted frame (PF), on the basis of respective motion vectors (MV) that cause respective pixel-blocks in the predicted frame (PF) to be at least partially projected to a pixel- block in the reference frame (RF) that comprises visual text.
9. A video apparatus as claimed in claim 8, the text-motion detection module (MOD) being arranged to calculate a measure of overlap (OVR%) that indicates the extent to which a projection of a pixel-block from the predicted frame (PF) to the reference frame (RF) in accordance with the motion vector for that pixel-block, overlaps a pixel-block in the reference frame (RF), the motion detection module being further arranged to use the measure of overlap (OVR%) as a weighing factor for the motion vector in the calculation of the displacement vector.
10. A video apparatus as claimed in claim 1, the text-motion detection module (MOD) being arranged to establish a plurality of motion indications (MI) for a sequence of images, a motion indication (MI) relating to a displacement, if any, of visual text from one image to another, the text-motion detection module (MOD) further being arranged to check whether the motion indications (MI) are similar or not, and to provide an anomaly indication when the motions indications are not similar.
11. A method of handling video data comprising: a text detection step (TXD) in which visual text in the video data is detected ; - a text-motion detection step (MOD) in which a motion indication (MI) for the visual text that the text detection module (TXD) has detected, is provided; and a user interface (UIF) step which allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection step has provided.
12. A computer program product for a video apparatus (DVR), the computer program product comprising a set of instructions that, when loaded into the video apparatus (DVR), causes the video apparatus (DVR) to carry out: a text detection step (TXD) in which visual text in the video data is detected; a text-motion detection step (MOD) in which a motion indication (MI) for the visual text that the text detection module (TXD) has detected is provided; and a user interface (UIF) step which allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection step has provided.
PCT/IB2006/050936 2005-03-30 2006-03-28 Method and apparatus for the detection of text in video data WO2006103625A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP05300230 2005-03-30
EP05300230.9 2005-03-30

Publications (1)

Publication Number Publication Date
WO2006103625A1 true WO2006103625A1 (en) 2006-10-05

Family

ID=36649428

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/050936 WO2006103625A1 (en) 2005-03-30 2006-03-28 Method and apparatus for the detection of text in video data

Country Status (1)

Country Link
WO (1) WO2006103625A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7912289B2 (en) 2007-05-01 2011-03-22 Microsoft Corporation Image text replacement
CN116911924A (en) * 2023-09-12 2023-10-20 南京闲侠信息科技有限公司 Intelligent advertisement data comparison method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CRANDALL D ET AL: "Extraction of special effects caption text events from digital video", INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND REGOGNITION, 2003, pages 138 - 157, XP002390630, Retrieved from the Internet <URL:http://citeseer.ist.psu.edu/crandall03extraction.html> [retrieved on 20060803] *
CRANDALL D: "EXTRACTION OF UNCONSTRAINED CAPTION TEXT FROM GENERAL-PURPOSE VIDEO", May 2001, THE PENNSYLVANIA STATE UNIVERSITY, XP002390637 *
HUIPING LI ET AL: "Automatic Text Detection and Tracking in Digital Video", IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 9, no. 1, January 2000 (2000-01-01), XP011025505, ISSN: 1057-7149 *
JUNG K ET AL: "Text information extraction in images and video: a survey", PATTERN RECOGNITION, ELSEVIER, KIDLINGTON, GB, vol. 37, no. 5, May 2004 (2004-05-01), pages 977 - 997, XP004496837, ISSN: 0031-3203 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7912289B2 (en) 2007-05-01 2011-03-22 Microsoft Corporation Image text replacement
CN116911924A (en) * 2023-09-12 2023-10-20 南京闲侠信息科技有限公司 Intelligent advertisement data comparison method and system
CN116911924B (en) * 2023-09-12 2023-11-21 南京闲侠信息科技有限公司 Intelligent advertisement data comparison method and system

Similar Documents

Publication Publication Date Title
Tan et al. Rapid estimation of camera motion from compressed video with application to video annotation
US6912327B1 (en) Imagine information describing method, video retrieval method, video reproducing method, and video reproducing apparatus
US7046731B2 (en) Extracting key frames from a video sequence
KR100915847B1 (en) Streaming video bookmarks
US7469010B2 (en) Extracting key frames from a video sequence
US9147112B2 (en) Advertisement detection
US20100322310A1 (en) Video Processing Method
JPH04207878A (en) Moving image management device
JP2007082088A (en) Contents and meta data recording and reproducing device and contents processing device and program
US20140147100A1 (en) Methods and systems of editing and decoding a video file
KR100846770B1 (en) Method for encoding a moving picture and apparatus therefor
CN1312614C (en) Method and apparatus for detecting fast motion scenes
US6801294B2 (en) Recording and/or reproducing apparatus and method using key frame
Zhang Content-based video browsing and retrieval
Smeaton Indexing, browsing and searching of digital video
KR20070028535A (en) Video/audio stream processing device and video/audio stream processing method
WO2006103625A1 (en) Method and apparatus for the detection of text in video data
US7353451B2 (en) Meta data creation apparatus and meta data creation method
JP2002281433A (en) Device for retrieving and reading editing moving image and recording medium
JP2001119661A (en) Dynamic image editing system and recording medium
KR20060102639A (en) System and method for playing mutimedia data
US20060215995A1 (en) Information recording device information recording method information reproduction device information reproduction method information recording program information reproduction program information recording medium and recording medium
JP2002199348A (en) Information reception recording and reproducing device
KR20020040503A (en) Shot detecting method of video stream
AU762791B2 (en) Extracting key frames from a video sequence

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

NENP Non-entry into the national phase

Ref country code: RU

WWW Wipo information: withdrawn in national office

Country of ref document: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06727753

Country of ref document: EP

Kind code of ref document: A1

WWW Wipo information: withdrawn in national office

Ref document number: 6727753

Country of ref document: EP