WO2006103625A1

WO2006103625A1 - Method and apparatus for the detection of text in video data

Info

Publication number: WO2006103625A1
Application number: PCT/IB2006/050936
Authority: WO
Inventors: Jan Nesvadba; Igor Nagorski
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2005-03-30
Filing date: 2006-03-28
Publication date: 2006-10-05

Abstract

A video apparatus (DVR) comprises a text detection module (TXD), a text- motion detection module (MOD), and a user interface (UIF). The text detection module (TXD) detects visual text in video data. The text-motion detection module (MOD) provides a motion indication (MI) for the visual text that the text detection module (TXD) has detected. The user interface (UIF) allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection module (MOD) has provided. For example, the user may request a program-change search. In response, the video apparatus (DVR) retrieves a portion of the video data for which the text-motion detection module (MOD) has provided a motion indication (MI) that indicates vertically scrolling visual text. Vertically scrolling visual text generally corresponds with a title role at the start or the end of a program, or both. There are numerous other manners to use the motion indication (MI), that the text-motion detection module (MOD) provides.

Description

METHOD AND APPARATUS FOR THE DETECTION OF TEXT IN VIDEO DATA

FIELD OF THE INVENTION

An aspect of the invention relates to a video apparatus arranged to detect text in video data. The video apparatus may be, for example, a digital video recorder that records video data on an optical disk (DVD) or a magnetic disk (HD), or both. Other aspects of the invention relate to a method of detecting text in video data, and a computer program for a video apparatus.

DESCRIPTION OF PRIOR ART

The international patent application published under number WO 02/093910 describes detection of the presence, appearance or disappearance of subtitles in a video signal. This detection comprises operations that an MPEG encoder or decoder typically carries out. The detection therefore requires relatively few additional operations. A detected subtitle may be subjected to a character recognition algorithm, which provides an electronic version of the text. The electronic text may be separately stored and subsequently used for indexing video scenes stored in a database. A typical application thereof is retrieval of video scenes in a video recorder based on spoken keywords.

SUMMARY OF THE INVENTION

According to an aspect of the invention, a video apparatus comprises a text detection module, a text-motion detection module, and a user interface. The text detection module detects visual text in video data. The text-motion detection module provides a motion indication for the visual text that the text detection module has detected. The user interface allows a user to access a particular portion of the video data on the basis of the motion indication that the text-motion detection module has provided. The invention takes the following aspects into consideration. Video data is generally rich in information. In principle, this makes it relatively difficult for an average person to handle video data and, more specifically, to access a particular information that the video data comprises. For example, video data may comprise visual text of various different types. Visual text may be, for example, a subtitle, a title role that provides information about a program, which the video data comprises, or news in a telegraphic form. Visual text may also mark a particular event, such as, for example, the start of a program, the end of a program, or a new chapter.

In principle, it is possible to derive sequences of characters from visual text comprised in video data by means of a character recognition algorithm. The sequences of characters, which have been derived from the video data, can be stored in a database and handled separately. For example, a user may browse through the database to find a subtitle that comprises a particular word. The aforementioned prior art, which relates to subtitles, appears to suggest this possibility. However, the database need not be restricted to subtitles only. That is, the database may comprise a collection of sequences of characters that relate to various different types of visual text. This complicates browsing. It will be relatively difficult to find a particular piece of information in the database.

In accordance with the aforementioned aspect of the invention, a video apparatus comprises a text-motion detection module, which provides a motion indication for visual text that has been detected. A user interface allows a user to access a particular portion of the video data on the basis of the motion indication that the text-motion detection module has provided.

The motion indication may indicate a particular event in the video data. For example, the motion indication may indicate the start or the end of a program. The start and the end of a program generally comprise a title role in the form of vertically scrolling visual text, which the motion indication indicates. The motion indication may also indicate a type of visual text in the video data, such as the aforementioned title role. The title role may comprise useful information for cataloguing purposes, such as, for example, the title of the program and actors' names. Another example concerns visual text that constitutes news data. Such news data is often presented in the form of so-called tickers. A ticker is a horizontal bar on a screen that comprises visual text that moves in a horizontal direction, from the left to the right or vice versa. The motion indication may indicate such a ticker and therefore indicate news data in the video data. Those examples illustrate that the invention allows greater user convenience. These and other aspects of the invention will be described in greater detail hereinafter with reference to drawings.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram that illustrates a digital video recorder, which comprises a text-motion detection module.

FIG. 2 is a flowchart diagram that illustrates a displacement-vector calculation, which the text-motion detection module carries out. FIG. 3 is a conceptual diagram that illustrates an overlap calculation, which is based on a projection of a macroblock from a predicted frame to reference frame.

DETAILED DESCRIPTION

FIG. 1 illustrates a digital video recorder DVR. The digital video recorder DVR comprises an encoding-and-decoding module CODEC and a disk driver DRW for writing data on a disk and for reading data from a disk DSK. The disk DSK may be, for example, an optical disk such as a digital versatile disk (DVD) or a hard disk (HD), on which data is magnetically stored. The digital video recorder DVR further comprises a text detection module TXD, a character recognition module OCR, a text-motion detection module MOD, a data-association module DAS, a semantic database SDB, and a control module CTRL. The control module CTRL comprises a user interface UIF and may receive commands from a remote control device RCD. Any of the aforementioned modules may be implemented by means of software or hardware, or a combination of software and hardware. A suitably programmed processor may carry out operations that will be described hereinafter with reference to the aforementioned modules.

The digital video recorder DVR operates as follows. In a recording mode, the encoding-and-decoding module CODEC receives video input data VI from an external entity. The encoding-and-decoding module CODEC encodes the video input data VI in accordance with a video encoding standard, such as, for example, MPEG2 or MPEG4 (MPEG is a commonly used acronym for Moving Pictures Experts Group). The disk driver DRW receives encoded video input data VIC from the encoding-and-decoding module CODEC and writes that data onto the disk DSK, which is present in the disk driver DRW. Alternatively, the disk driver DRW may receive the encoded video input data VIC directly from an external entity via a codec bypass, which is illustrated in broken lines. In that case, the encoding-and- decoding module CODEC may decode the encoded video input data VIC, which the external entity provides. Accordingly, the digital video recorder DVR can retrieve coding parameters and other data from the encoded video input data VIC. This data is useful for text detection, which will be described in greater detail hereinafter.

In a playback mode, the encoding-and-decoding module CODEC receives encoded video output data VOC from the disk driver DRW. The encoding-and-decoding module CODEC decodes the encoded video output data VOC so as to obtain video output data VO, which can be applied to an external entity, such as, for example, a display device. Alternatively, the disk driver DRW may apply the encoded video output data VOC, which is read from the disk, directly to an external entity that comprises a video decoder. In that case, the coded video output data VOC transits via the codec bypass. The encoding-and-decoding module CODEC may still decode the encoded video output data VOC, which the disk driver DRW provides. Accordingly, the digital video recorder DVR can retrieve coding parameters and other data from the encoded video output data VOC. This data is useful for text detection, which will be described in greater detail hereinafter.

The encoding-and-decoding module CODEC provides various coding parameters PAR and other data, which result from an encoding or a decoding in accordance with the relevant MPEG standard. The coding parameters PAR comprise: a "b"-parameter, which indicates the number of bits used for encoding an image slice excluding overhead bits, a "qs"-parameter, which indicates the quantizer scale for a slice, - a "c"-parameter, which indicates the transform coefficients (DC and

AC) of a macroblock, a "mad"-parameter, which indicates the mean absolute difference between an image block and the prediction block found by a motion estimator, which forms part of the encoding-and-decoding module CODEC. The international application published under number WO 02/093910 describes these coding parameters.

The encoding-and-decoding module CODEC further provides motion vectors MV, a predicted frame PF, a reference frame RF, and a frame index IX. The predicted frame PF may be a so-called P-frame or a B-frame as defined by the relevant MPEG standard. The reference frame RF is a so-called I-frame. The frame index IX indicates the position of the frame within a sequence frames that constitutes a video recording, which corresponds with, for example, a movie. The frame index IX can be associated with a particular instant within an interval of time that corresponds with the video recording. For example, the frame index IX may correspond with the instant "5 minutes and 27 seconds" from the start of the video recording or any other reference in the recording. The text detection module TXD detects, in a reference frame (I-frame), one or more segments that comprise visual text. The text detection module TXD provides text detection data TD that indicates these segments. The text detection data TD may identify, for example, those macroblocks in the reference frame, which comprise visual text. The text detection module TXD may operate in a manner similar to, for example, the subtitle detector in the international application published under number WO 02/093910. Such an implementation of the text detection module TXD detects visual-text segments on the basis of the coding parameters PAR mentioned hereinbefore.

The visual text, which the text detection module TXD detects, may be in the form of subtitles. The visual text may also form part of a scene in the video of interest. For example, the scene may comprise an object with a certain text, such as, for example, a city- name board or a facade with a restaurant name. Generally, this is static text in the sense that the text does not move when displayed on a screen. The visual text may also be in the form of, for example, a title role of a movie, a television series, or any other artistic or documentary video. In many cases, such visual text scrolls on the screen in a vertical direction. This is vertically scrolling text. The visual text may also be in the form of, for example, so-called tickers that comprise daily news, financial news, weather news, or other news. In many cases, such visual text scrolls on the screen in a horizontal direction. This is horizontally scrolling text. Whatever type of text, the text detection module TXD is capable of providing text detection data TD, which indicates segments within a reference frame that comprise visual text.

The character recognition module OCR derives a sequence of characters TXT from a reference frame RF that comprises visual text. The text detection data TD, which the text detection module TXD provides, assists the character recognition module OCR in this process. It is recalled that the text detection data TD indicates the segments in the reference frame RF that comprise visual text. The sequence of characters TXT, which the character recognition module OCR provides, corresponds with the visual text comprised in these segments. It should be noted that the character recognition module OCR can derive various sequences of characters from a reference frame. Each sequence may correspond with, for example, a particular line of visual text. For example, a subtitle may comprise two lines.

The text-motion detection module MOD establishes a motion indication MI on the basis of text detection data TD, which the text detection module TXD provides, and the motion vectors MV, the predicted frame PF, and the reference frame RF, which the encoding- and-decoding module CODEC provides. The motion indication MI indicates whether visual text, which is comprised in a sequence of frames, moves when displayed on the screen or not. The motion indication MI may comprise, for example, a binary value, which indicates whether the visual text moves or not. The motion indication MI may provide further information, for example, an indication whether the visual text moves in a horizontal direction or in a vertical direction. The motion indication MI may also be in the form of a displacement vector, which indicates a displacement direction with relatively great precision. The displacement vector has a length, which indicates a displacement speed, i.e. how fast the visual text moves when displayed on a screen. The motion indication MI may further indicate acceleration or any other useful information that relates to the displacement of the visual text throughout a sequence of frames.

The data-association module DAS associates the motion indication MI, which the text-motion detection module MOD provides, with the sequence of characters TXT, which the character recognition module OCR provides. The data-association module DAS stores the sequence of characters TXT and the motion indication MI associated therewith in the semantic database SDB. Optionally, the data-association module DAS can associate one or more frame indices IX with the sequence of characters TXT, and store these indices IX in the semantic database SDB too.

Let it be assumed that the digital video recorder DVR has recorded one or more video programs. The semantic database SDB will comprise a collection of sequences of characters representing visual text comprised in each video program, which has been recorded. The semantic database SDB further comprises the motion indication associated with each sequence of characters and, optionally, the frame indices, which indicate when the sequence of characters appears in the video program.

The aforementioned information in the semantic database SDB can assist a user in numerous manners. Some examples will be given hereinafter. These examples have in common that the user exploits the semantic database SDB through the control module CTRL of the digital video recorder DVR, which comprises the user interface UIF. The user interface UIF, which comprises a software program, may cause, for example, one or more menus to be displayed from which the user may select a particular item. The user may navigate through various menus and make selections by means of the remote control RCD.

The user interface UIF may present the user an option within a menu that allows automatic identification of the start or the end, or both, of a video program that has been recorded. Let it be assumed that the user chooses this option. The control module CTRL will search in the semantic database SDB for a motion indication that indicates vertically scrolling text. In many cases, a title role of a movie, a television series, or any other artistic or documentary video, comprises visual text that scrolls on the screen in a vertical direction.

The control module CTRL retrieves the frame indices that are associated with the motion indication that indicates vertically scrolling text. The control module CTRL may, for example, associate a start marker with one or more frames that correspond with the frame indices retrieved from the semantic database SDB, and that are relatively close to the start of the recording. The control module CTRL may further associate an end marker with one or more frames that correspond with the frame indices retrieved from the semantic database SDB, and that are relatively close to the end of the recording. The video program of interest lies between the start marker and the end marker. Other parts of the recording may comprise commercials or other video scenes, which are of less interest.

Optionally, the control module CTRL may cause the disk driver DRW to add the start marker and the end marker to the recording. Alternatively, the control module CTRL may cause the start marker and the end marker to be stored within the digital video recorder DVR in association with an identification of the disk DSK on which the recording has been made.

Preferably, the user interface UIF causes the digital video recorder DVR to play back respective portions of the recording that comprise vertically scrolling text. The digital video recorder DVR can find these portions thanks to the frame indices associated with the motion indication in the semantic database SDB. This selective playback allows the user to check that the detected vertically scrolling text corresponds with the start or the end of the video program. The control module CTRL generates the start marker and the end marker when the user has validated this check. The user may also fine-tune, as it were, the start marker and the end marker by placing these markers just after the vertically scrolling text at the start of the recording and just before the vertically scrolling text at the end of the recording, respectively.

The user interface UIF may further present the user an option that allows him or her to catalog the video program, which has been recorded. This cataloguing of the video program allows content management and content browsing and navigation. The cataloguing of the video program may comprise various different items, such as, for example, a title, one or more actors' names, a producer's name, a production date, and other characteristics of the video program, which appear in the title role. These characteristics generally appear in the form of vertically scrolling text. Let it be assumed that the user chooses the catalog option. The control module CTRL will search in the semantic database SDB for a motion indication that indicates vertically scrolling text, as described hereinbefore. The semantic database SDB comprises sequences of characters, which are associated with this motion indication. These sequences of characters correspond with the vertically scrolling text comprised in the video program when displayed on a screen. Consequently, a sequence of characters may correspond with the title of the video program, another sequence may correspond with an actor who plays a role in the video program, yet another sequence may correspond with the producer of the video program, and yet another sequence may correspond with the production data of the video program, and so on.

The control module CTRL may retrieve these sequences of characters, which are associated with the motion indication that indicates horizontally scrolling text, from the semantic database SDB so as to copy these sequences into, for example, a cache memory within the digital video recorder DVR. The user may then browse through these sequences of characters so as to select one or more sequences that are of interest for the purpose of cataloguing. The user may copy a selected sequence of characters into his or her catalog. Accordingly, the user can catalog the video program, which he or she has recorded, in a relatively simple manner thanks to the motion indication that indicates that the sequences of characters relate to vertically scrolling text, which is typical of a title role. The user interface UIF may further present the user an option within a menu that allows automatic identification of news information. Certain programs comprise one or more tickers for displaying textual news information. A ticker is a horizontal bar on a screen that comprises visual text that moves in a horizontal direction, from the left to the right or vice versa. Let it be assumed that the user chooses the news-information option. The control module CTRL will search in the semantic database SDB for a motion indication that indicates a horizontally scrolling text. The semantic database SDB comprises sequences of characters, which are associated with this motion indication. These sequences of characters correspond with the horizontally scrolling text comprised in the program when displayed on a screen. That is, these sequences of characters generally correspond with news information comprised in a ticker. Consequently, a sequence of characters may correspond with a general news item, a financial news item, such as, for example, stock prices, or a weather forecast, and so on. The control module CTRL may retrieve these sequences of characters, which are associated with the motion indication that indicates horizontally scrolling text, from the semantic database SDB so as to copy these sequences into, for example, a cache memory within the digital video recorder DVR. The user may then browse through these sequences of characters so as to select one or more items that are of interest. The user interface UIF may also comprise a search engine that allows the user to typically find a particular news item. FIG. 2 illustrates a displacement-vector calculation, which the text-motion detection module MOD carries out. The displacement-vector calculation provides an indication of movement of visual text from a reference frame to a predicted frame that is subsequent to the reference frame. That is, the displacement- vector calculation, which FIG. 2 illustrates, applies to a forward prediction of a frame. The reference frame, which may be an I-frame, or a P-frame, on which the prediction is based, precedes the predicted frame, which may be a P-frame, or a B-frame.

In a first step STl, the text-motion detection module MOD receives the text detection data TD from the text detection module TXD (TD : MB[RF] 3VTX). It is assumed that the text detection module TXD has already detected one or more macroblocks in the reference frame that comprise visual text. Accordingly, the text detection data TD, which indicates these macroblocks, is available.

The displacement-vector calculation, which FIG. 2 illustrates, comprises a series of steps ST2-ST6 for each macroblock in the predicted frame. Each macroblock in the predicted frame has a motion vector, which the encoding-and-decoding module CODEC provides. The motion vector indicates a particular block of pixels in the reference frame, which is similar to the macroblock in the predicted frame. In general, the motion vector indicates movement of an object, which the respective macroblocks at least partially represent, from the predicted frame to the reference frame.

As mentioned hereinbefore, the displacement-vector calculation that FIG. 2 illustrates applies to a forward prediction: the predicted frame is subsequent to the reference frame. Consequently, the motion vector points backwards in time. What is more, the motion vector points from the predicted frame to reference frame, but the text detection module TXD can not provide text indication data that relates to the predicted frame. That is, the location of the visual text in the predicted frame is not yet known. What is known, is the location of the visual text in the reference frame. The text indication data TD indicates this. In step ST2, the text-motion detection module MOD projects the macroblock concerned from the predicted frame to the reference frame. The motion vector that belongs to the macroblock defines this protection. Accordingly, a projected block within the reference frame is obtained (MB[PF] & MV : PROJ[PF→RF] =>. PB). The preφted block will generally not precisely coincide with a macroblock in the reference frame. The projected block will generally overlap a cluster of four different macroblocks in the reference frame. That is, the projected block will have a certain overlap with each of these four different macroblocks.

In step ST3, the text-motion detection module MOD detects whether the projected block overlaps with a macroblock in the reference frame that comprises visual text, or not (OVR[TD]?). It is recalled that the text detection data TD indicates the macroblocks in the reference frame that comprise visual text. Let it be assumed that in reply to the test performed in step ST3, the projected block does not overlap (reply N) with a macroblock in the reference frame that comprises visual text. In that case, the text-motion detection module MOD directly carries out step ST2 and ST3 anew for a subsequent macroblock in the predicted frame, without carrying out steps ST4-ST6 for the macroblock concerned. Conversely, the text-motion detection module MOD carries out steps ST4-ST6 for the macroblock concerned if the projected block overlaps (reply Y) with at least one macroblock in the reference frame that comprises visual text. In step ST4, the text-motion detection module MOD establishes an overlap percentage for each macroblock in the reference frame that comprises visual text and that overlaps the projected block (CLC[OVR%]; PB₅MB[RF] => OVR%). The overlap percentage indicates the size of the portion of the projected block that falls within the macroblock concerned in the reference frame. The overlap percentage is 100% if the projected block fully coincides with the macroblock concerned in the reference frame. As an example, the overlap percentage is 50% if half of the projected block falls within the macroblock concerned in the reference frame.

FIG. 3 illustrates steps ST3 and ST4 described hereinbefore. FIG. 3 illustrates, in full lines, four macroblocks MB[ij], MB[i+l,j], MB[i,j+l], MB[i+l j+1] in the reference frame. FIG. 3 further illustrates the projected block PB. A portion of the projected block PB overlaps macroblock MB[ij]. The overlap percentage is 50%. Another portion of the projected block overlaps macroblock MB[i+l J]. The overlap percentage is 20%. Yet another portion of the projected block overlaps macroblock MB[ij+l]. The overlap percentage is 20% too. Yet another portion of the projected block overlaps macroblock MB[i+l j+1]. The overlap percentage is 10%.

In step ST5, the text-motion detection module MOD inverts the motion vector that belongs to the macroblock in the predicted frame for which steps ST2-ST6 are carried out. It is recalled that the motion vector defines the projection of this macroblock, which results in the projected block. Step ST5 provides an inverted motion vector, which points from the reference frame to the predicted frame (INV[MVJ =j> IV). Consequently, the inverted motion vector points forward in time.

In step ST6, the text-motion detection module MOD associates the inverted motion vector with each macroblock in the reference frame that overlaps the projected block (MB[RF]<→IV,OVR%). The text-motion detection module MOD further associates the overlap percentage that has been calculated, in step ST4, for the macroblock concerned in the reference frame. The overlap percentage reflects a degree of confidence, as it were, that the inverted motion vector faithfully indicates movement of textual matter from the macroblock concerned in the reference frame to the predicted frame. The text-motion detection module MOD temporarily stores the inverted motion vector and the overlap percentage for each macroblock concerned in the reference frame together with an identification of that macroblock. Referring to FIG. 3, the text-motion detection module MOD stores the inverted motion vector with an overlap percentage of 50% for macroblock MB[i j], and stores the same inverted motion vector with an overlap percentage of 20%, 20%, and 10%, for macroblocks MB[i+l j], MB[i,j+l], and MB[i+l j+1], respectively.

FIG. 3 can further illustrate the aforementioned aspect, which relates to the degree of confidence in the inverted motion vector. In FIG. 3, the overlap percentage for macroblock MB[ij] is 50%, which is relatively high. Consequently, the inverted motion vector indicates with reasonable precision movement of textual matter from macroblock MB[i,j] in the reference frame to the predicted frame. This is because the motion vector, which points from the predicted frame to the reference frame, causes a substantial overlap between the projected block and macroblock MB[i J].

Let it be assumed that the overlap percentage for macroblock MB[ij] were 100%. In that case, the projected block PB would fully coincide with macro block MB[ij]. The motion vector has caused this projection. Let it now be assumed that an inverse projection is made based on the inverted motion vector. Macroblock MB[ij] is projected from the reference frame to the predicted frame. In this 100% overlap example, the projection of macroblock MB[ij] will fully coincide with the relevant macroblock in the predicted frame to which the motion vector belongs. Stated boldly, a 100% overlap indicates that the motion vector, when inverted, provides an accurate projection in the opposite direction.

Let it be assumed that the text-motion detection module MOD has carried out the series of steps ST2-ST6 for each macroblock in the predicted frame. The text-motion detection module MOD has generated the following data for each macroblock in the reference frame that comprises visual text: at least one inverted motion vector, and an overlap percentage associated with each inverted motion vector. Generally, various inverted motion vectors will be generated for a macroblock in the reference frame that comprises visual text. This can be explained with reference to FIG. 3.

FIG. 3 illustrates the projection of a particular macroblock from the predicted frame to the reference frame. The text-motion detection module MOD will project further, neighboring macroblocks from the predicted frame to reference frame. Each of these respective projections will provide respective projected blocks. Any of these projected blocks may overlap one or more of the macroblocks MB [i j ] , MB [H-IJ], MB [i,j+ 1 ] , MB [i+ 1 j+ 1 ] , which FIG. 3 illustrates. Furthermore, each respective projection will be based on a different motion vector, namely the motion vector that belongs to the macroblock in the predicted frame, which is projected.

In step ST7, the text-motion detection module MOD calculates a displacement vector for the visual text (CLC[DV]). The text-motion detection module MOD makes a weighed combination of the respective inverted motion vectors, which have been established for macroblocks in the reference frame that comprise visual text. The respective overlap percentages, which are associated with the respective inverted motion vectors, constitute weighing factors. The aforementioned weighed combination constitutes the displacement vector (AVG[IV,OVR%] => DV).

The text-motion detection module MOD may also carry out another displacement-vector calculation, which provides an indication of movement of visual text from a predicted frame (P-frame or B-frame) to a reference frame (I-frame) that is subsequent to the predicted frame. This displacement-vector calculation is different from the one that FIG. 2 illustrates. The main features of the other displacement-vector calculation are as follows.

The text-motion detection module MOD first establishes a set of macroblocks in the predicted frame that comprises visual text. This text detection may be based on, for example, one or more displacement vectors that have been calculated for previous frames. Each macroblock in the aforementioned set has a motion vector that points from the predicted frame to the reference frame. The motion vector points forward in time because the reference frame is subsequent to the predicted frame. The text-motion detection module MOD calculates an average of all relevant motion vectors, that is, all motion vectors that belong to the set of macroblocks that comprises visual text. The average constitutes the displacement vector.

Accordingly, the text-motion detection module MOD may calculate a sequence of displacement vectors for a sequence of frames. To that end, the text-motion detection may carry out the one or the other displacement-vector calculation, which have been described hereinbefore. Generally, displacement vectors that form part of the sequence should be similar. This is because text generally moves on the screen in a steady, monotonous fashion, that is, text generally scrolls with constant speed.

The text-motion detection module MOD may check whether the displacement vectors are indeed similar, or not. The motion indication, which the text-motion detection module MOD provides, may indicate an anomaly when the displacement vectors are substantially different. Alternatively, the text-motion detection module MOD may provide a no-motion indication, or may signal the text detection module TXD this anomaly. This anomaly signaling may cause the text detection module TXD to make one or more further detections or to make a different, more precise detection. The text-motion detection module MOD may also signal the character recognition module OCR the anomaly, so as to prevent erroneous character recognition.

The motion indication MI, which the text-motion detection module MOD provides, may comprise the following elements: an average of the displacement vectors, which have been established for a sequence of frames, accompanied with an indication of a frame when the visual text concerned enters the screen and an indication of a frame when the visual text leaves the screen. To that end, text-motion detection module MOD may receive frame indices from the encoding-and-decoding module CODEC. Alternatively, the database manager may carry out this marking of the visual text entering the screen and leaving the screen, respectively. CONCLUDING REMARKS

The detailed description hereinbefore with reference to the drawings illustrates the following characteristics, which are cited in claim 1. A video apparatus (digital video recorder DVR) comprises a text detection module (TXD), a text-motion detection module (MOD), and a user interface (UIF). The text detection module (TXD) detects visual text in video data. The text-motion detection module (MOD) provides a motion indication (MI) for the visual text that the text detection module (TXD) has detected. The user interface (UIF) allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection module (MOD) has provided.

The detailed description hereinbefore further illustrates the following optional characteristics, which are cited in claim 2. A control module (CTRL) marks a particular portion of the video data, which comprises the visual text, on the basis of the motion indication (MI), which the text-motion detection module (MOD) has provided. The control module may, for example, insert a marker in the video data or store a marker in a database in association with the video data. The marker may be a start marker, an end marker, or any other marker that is useful for content management. These characteristics allow the user to conveniently access the video data any subsequent time.

The detailed description hereinbefore further illustrates the following optional characteristics, which are cited in claim 3. A character recognition module (OCR) derives a sequence of characters (TXT) from the visual text in the video data that the text detection module (TXD) has detected. A data-association module (DAS) associates the sequence of characters (TXT), which the character recognition module (OCR) has derived, with the motion indication (MI), which the text-motion detection module (MOD) has provided. These characteristics facilitate the retrieval and the handling of textual information comprised in video data. Consequently, these characteristics further contribute to more user convenience. The detailed description hereinbefore illustrates various aspects thereof, which are cited in claims 4, 5, and 6.

The detailed description hereinbefore further illustrates the following optional characteristics, which are cited in claim 7. A video processing module (CODEC) provides motion vectors (MV) that indicate movement of an object, which the video data represents, from a predicted frame (PF) to a reference frame (RF). The text-motion detection module (MOD) establishes the motion indication (MI) for the visual text that the text detection module (TXD) has detected, on the basis of the motion vectors (MV), which the video processing module (CODEC) has provided.

The aforementioned characteristics can be implemented in numerous different manners. In order to illustrate this, some alternatives are briefly indicated. The digital video recorder described in detail hereinbefore is merely an example of a video apparatus in accordance with the invention. The video apparatus may also be in the form of, for example, a settop box, a television set, or a mobile phone. The digital video apparatus needs not necessarily be MPEG-based. The invention can be applied in any video apparatus that comprises a video processor providing some form motion indication. For example, the digital video apparatus may be based on the H.263 standard for mobile video telephony. The digital video apparatus needs not necessarily comprise a disk driver or any video storage device.

The digital video apparatus needs not necessarily comprise any video coder or decoder. It is possible to detect visual text in plain, uncompressed video data without the use of any (de-)coder parameters. It is also possible to detect text motion in plain, uncompressed video data without the use of any motion- vectors, which video (de-)coder typically provides. In case that visual text is detected on the basis of (de-)coder parameters, these parameters need not necessarily be standard coding parameters. For example, visual text detection may involve non-standard parameters, which are specific for a particular implementation of a video coder or a decoder. That is, the video coder or decoder generates proprietary parameters, which may be used for the purpose of visual text detection. Character recognition, if any, may be carried out in a classical fashion, without the use of any text indication derived from (de-)coder parameters or any other parameters relating to the video data.

There are numerous different techniques to detect text motion. The technique described hereinbefore with reference to Fig. 2 is merely an example. For example, a motion- vector weighing based on overlap calculation is not mandatory, although such a weighing calculation is advantageous. It is not necessary to weigh motion vectors, which a (de-)coder provides, in order to establish a text-motion indication. That is, the text-motion detection that Fig. 2 illustrates can be simplified. Such simplification may, however, be at the expense of motion-detection precision. It should further be noted that text-motion detection results need not necessarily be stored in a semantic database. For example, an application may use text- motion detection results for the purpose of content marking or image quality improvement only. In such an application, text-motion detection results can be stored in an ordinary memory. The terms "frame" and "image" should be understood in a broad sense. These terms are exchangeable and include a field or any other entity that may wholly or partially constitute an image or picture. The term "scrolling text" should equally be understood in a broad sense. Scrolling text may move in a non-monotonous fashion. For example, scrolling text may move in a discontinuous, jumpy fashion or have a significant acceleration.

There are numerous ways of implementing functions by means of items of hardware or software, or both. In this respect, the drawings are very diagrammatic, each representing only one possible embodiment of the invention. Thus, although a drawing shows different functions as different blocks, this by no means excludes that a single item of hardware or software carries out several functions. Nor does it exclude that an assembly of items of hardware or software or both carry out a function.

The remarks made herein before demonstrate that the detailed description, with reference to the drawings, illustrates rather than limits the invention. There are numerous alternatives, which fall within the scope of the appended claims. Any reference sign in a claim should not be construed as limiting the claim. The word "comprising" does not exclude the presence of other elements or steps than those listed in a claim. The word "a" or "an" preceding an element or step does not exclude the presence of a plurality of such elements or steps.

Claims

Claims.

1. A video apparatus (DVR) comprising: a text detection module (TXD) arranged to detect visual text in video data; a text-motion detection module (MOD) arranged to provide a motion indication (MI) for the visual text that the text detection module (TXD) has detected; and a user interface (UIF) arranged to allow a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection module (MOD) has provided.

2. A video apparatus as claimed in claim 1 comprising a control module

(CTRL) arranged to mark a particular portion of the video data, which comprises the visual text, on the basis of the motion indication (MI), that the text-motion detection module (MOD) has provided.

3. A video apparatus as claimed in claim 1 comprising: a character recognition module (OCR) arranged to derive a sequence of characters (TXT) from the visual text in the video data that the text detection module (TXD) has detected, and a data-association module (DAS) arranged to associate the sequence of characters (TXT), which the character recognition module (OCR) has derived, with the motion indication (MI), that the text-motion detection module (MOD) has provided.

4. A video apparatus as claimed in claim 1, the user interface (UIF) being arranged to allow a user to request a program-change search, the video apparatus (DVR) comprising: a control module (CTRL) arranged to select, in response to the program-change search, a particular portion of the video data for which the text-motion detection module (MOD) has provided a motion indication (MI) that indicates vertically scrolling visual text.

5. A video apparatus as claimed in claim 3, the user interface (UIF) being arranged to allow a user to present a program-data query, the video apparatus (DVR) comprising: a control module (CTRL) arranged to retrieve, in response to the program-data query, a sequence of characters (TXT) associated with a text motion indication (MI) that indicates vertically scrolling text.

6. A video apparatus as claimed in claim 3, the user interface (UIF) being arranged to allow a user to present a news-data query, the video apparatus (DVR) comprising: - a control module (CTRL) arranged to retrieve, in response to the news- data query, a sequence of characters (TXT) associated with a text motion indication (MI) that indicates horizontally scrolling text.

7. A video apparatus as claimed in claim 1 comprising: - a video processing module (CODEC) arranged to provide motion vectors (MV) that indicate movement of an object, which the video data represents, from a predicted frame (PF) to a reference frame (RF), the text-motion detection module (MOD) being arranged to establish the motion indication (MI) for the visual text that the text detection module (TXD) has detected, on the basis of the motion vectors (MV), that the video processing module (CODEC) has provided.

8. A video apparatus as claimed in claim 7, the text detection module (TXD) being arranged to indicate pixel-blocks in the reference frame (RF) that comprise visual text, the text-motion detection module (MOD) being arranged to calculate a displacement vector, which indicates movement of the visual text from the reference frame (RF) to the predicted frame (PF), on the basis of respective motion vectors (MV) that cause respective pixel-blocks in the predicted frame (PF) to be at least partially projected to a pixel- block in the reference frame (RF) that comprises visual text.

9. A video apparatus as claimed in claim 8, the text-motion detection module (MOD) being arranged to calculate a measure of overlap (OVR%) that indicates the extent to which a projection of a pixel-block from the predicted frame (PF) to the reference frame (RF) in accordance with the motion vector for that pixel-block, overlaps a pixel-block in the reference frame (RF), the motion detection module being further arranged to use the measure of overlap (OVR%) as a weighing factor for the motion vector in the calculation of the displacement vector.

10. A video apparatus as claimed in claim 1, the text-motion detection module (MOD) being arranged to establish a plurality of motion indications (MI) for a sequence of images, a motion indication (MI) relating to a displacement, if any, of visual text from one image to another, the text-motion detection module (MOD) further being arranged to check whether the motion indications (MI) are similar or not, and to provide an anomaly indication when the motions indications are not similar.

11. A method of handling video data comprising: a text detection step (TXD) in which visual text in the video data is detected ; - a text-motion detection step (MOD) in which a motion indication (MI) for the visual text that the text detection module (TXD) has detected, is provided; and a user interface (UIF) step which allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection step has provided.

12. A computer program product for a video apparatus (DVR), the computer program product comprising a set of instructions that, when loaded into the video apparatus (DVR), causes the video apparatus (DVR) to carry out: a text detection step (TXD) in which visual text in the video data is detected; a text-motion detection step (MOD) in which a motion indication (MI) for the visual text that the text detection module (TXD) has detected is provided; and a user interface (UIF) step which allows a user to access a particular portion of the video data on the basis of the motion indication (MI) that the text-motion detection step has provided.