WO2015165524A1

WO2015165524A1 - Extracting text from video

Info

Publication number: WO2015165524A1
Application number: PCT/EP2014/058832
Authority: WO
Inventors: Hoang-Vu DANG; Stephen Davis
Original assignee: Longsand Limited
Priority date: 2014-04-30
Filing date: 2014-04-30
Publication date: 2015-11-05

Abstract

A method of extracting text from video includes, with a processor storing a number of frames of a video input in a memory, identifying potential text within the frames, and comparing a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames. If comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in no match, then a new track is created from the line. If comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match, then the line is appended to a best matching track. The method further includes combining a number of tracks to create a track image, and with an optical character recognition module, extracting a number of characters from the track image.

Description

EXTRACTING TEXT FROM VIDEO

BACKGROUND

[0001] The consumption of information via video images by consumers and other individuals is ever increasing. These video images may be obtained through television programming, movie productions, the Internet, or other venues that provide video regarding a seemingly infinite number of topics. These video images may contain text images displayed in or with the video images in the form of subtitles, closed captioning, film credits, news headlines, advertisements, and other text-related images.

[0002] In some situations, it may be advantageous to extract data relating to displayed text from the images in order to convert the extracted text into a character encoding format. This may be useful in gathering headlines or other portions of news-related text from news feeds, and obtaining text from displayed subtitles and film credits, among other uses. However, this may be difficult due to the large amount of computing resources and data used in extraction of text images from video images.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] The accompanying drawings illustrate various examples of the principles described herein and are a part of the specification. The illustrated examples are given merely for illustration, and do not limit the scope of the claims.

[0004] Fig. 1 is a diagram of a system for extracting text from video, according to one example of the principles described herein. [0005] Fig. 2 is a flowchart depicting a method of extracting text from video, according to one example of the principles described herein.

[0006] Figs. 3A and 3B depict a flowchart depicting a method of extracting text from video, according to another example of the principles described herein.

[0007] Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

DETAILED DESCRIPTION

[0008] The present description, therefore, describes systems, methods, and computer program products to extract text from video. Due to the high frequency in which video frames appear, the majority of text in each frame may be duplicated from previous frames. Processes or sub-processes dealing with optical character recognition (OCR) may be the slowest part of an overall text extraction process. Performing an OCR process once per unique portion of text identified within a video feed will result in an increase in the speed of text extraction. In some scenarios, an OCR analysis is to be done in real-time such as in a scenario in which a live television stream is the video input. Thus, the speed of a system in extracting text from a video feed is a consideration.

[0009] Further, raw input from any single video frame may be compromised by a number of factors such as, for example, signal quality in analogue broadcasts or compression loss in digital online videos. Frame-by- frame OCR suffers in accuracy from the low input quality. The present systems and methods comprise a correction mechanism that averages images of text over multiple frames. Using this averaging technique, the effects of video compression artifacts and other noise is reduced or eliminated.

[0010] Further, the present systems and methods automatically combine duplicate text from consecutive frames into a single result. This eliminates additional post-processing that may otherwise occur in connection with the duplicate text. In combining duplicate text from consecutive frames into a single result, a matching process is utilized. The matching process provides robustness against OCR errors where inconsistencies across a number of frames are resolved by, for example, a majority, and errors are automatically excluded. This process adds another layer to the intrinsic accuracy advantage of the present holistic treatment of the video as a whole, as opposed to processing the frames as separate images.

[0011] Further, the present systems and methods provide automatic, efficient storage management that allows the system to run indefinitely without memory or other resource exhaustion. This is advantageous in use cases where live video, which may run for an indeterminate duration, is input into the system. In one example, the balance between storage requirements and processing speed is user-adjustable or user-definable to accommodate for the user's requirements. Increasing storage limits directly reduces the frequency of OCR runs, which is a time-limiting factor in these processes.

[0012] Still further, in some video feeds, the text may be placed on a static, uniform background. This provides a viewer with the ability to better view the text as it stands out from the remainder of the video. In other examples of video feeds, text is superimposed onto a moving, filmed background image where the text itself is static over several frames. Another example of video feeds may include text superimposed onto a moving, filmed background image where the text is scrolling text, moving at a slow, uniform speed across a number of frames. Measuring the changes in the image over multiple frames provides an effective way of determining which parts of these backgrounds comprise text. In one example, any rapidly changing part of the video feed may be eliminated for further analysis as this type of textual display within the video feed is either unintentional or an error. This increases the speed of processing, and reduces the chance of incorrectly identifying background as superimposed text.

[0013] Scrolling text may also be incorporated into the video feed as, for example, a part of television news broadcasts. The present systems and methods provide for full text to be built up or accumulated over multiple frames. Doing so eliminates the need to otherwise produce and analyze sentence fragments. If frames are analyzed individually, incorrect results will be produced for characters of the scrolling text that are located partially off the edge of the video frame. In addition, applying an OCR process to words or characters within text that are located partially off a video frame results in less accurate capture of text, as dictionary spell-checking used in OCR processes will be ineffective at reducing character recognition errors.

[0014] The present systems and methods are scalable in terms of accuracy versus efficiency. This is achieved by adjusting the feature set of each potential text character. Every character may be represented by a combination of graphical and statistical features. The number of graphical and statistical features may vary depending on the particular requirements. Users with high- quality video feeds may use a minimal feature set to capitalize on speed and storage gains, while other users with more demanding input or that require a higher confidence, can extend the feature set to maximize accuracy.

[0015] As used in the present specification and in the appended claims, the term "a number of or similar language is meant to be understood broadly as any positive number comprising 1 to infinity; zero not being a number, but the absence of a number.

[0016] In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present systems and methods. It will be apparent, however, to one skilled in the art that the present apparatus, systems and methods may be practiced without these specific details. Reference in the specification to "an example" or similar language means that a particular feature, structure, or characteristic described in connection with that example is included as described, but may not be included in other examples.

[0017] Turning now to the figures, Fig. 1 is a diagram of a computing system (100) for extracting text from video, according to one example of the principles described herein. The system (100) may be implemented in an electronic device. Examples of electronic devices include servers, desktop computers, laptop computers, personal digital assistants (PDAs), mobile devices, smartphones, gaming systems, and tablets, among other electronic devices.

[0018] The system (100) may be utilized in any data processing scenario including, stand-alone hardware, mobile applications, through a computing network, at least one of the above, or combinations thereof. Further, the system (100) may be used in a computing network, a public cloud network, a private cloud network, a hybrid cloud network, other forms of networks, at least one of the above, or combinations thereof. In one example, the methods provided by the system (100) are provided as a service over a network by, for example, a third party. In this example, the service may comprise, for example, the following: a Software as a Service (SaaS) hosting a number of applications; a Platform as a Service (PaaS) hosting a computing platform comprising, for example, operating systems, hardware, and storage, among others; an Infrastructure as a Service (laaS) hosting equipment such as, for example, servers, storage components, network, and components, among others; application program interface (API) as a service (APIaaS), other forms of network services, at least one of the above, or combinations thereof. The present systems may be implemented on one or multiple hardware platforms, in which the modules in the system can be executed on one or across multiple platforms. Such modules can run on various forms of cloud technologies and hybrid cloud technologies or offered as a SaaS (Software as a service) that can be implemented on or off the cloud. In another example, the methods provided by the system (100) are executed by a local administrator.

[0019] To achieve its desired functionality, the system (100) comprises various hardware components. Among these hardware components may be a number of processors (101 ), a number of data storage devices (102), a number of peripheral device adapters (104), and a number of network adapters (103). These hardware components may be interconnected through the use of a number of busses and/or network connections. In one example, the processor (101 ), data storage device (102), peripheral device adapters (104), and a network adapter (103) may be communicatively coupled via a bus (105).

[0020] The processor (101 ) may include the hardware architecture to retrieve executable code from the data storage device (102) and execute the executable code. The executable code may, when executed by the processor (101 ), cause the processor (101 ) to implement at least the functionality of extract text from video input, according to the methods of the present specification described herein. In the course of executing code, the processor (101 ) may receive input from and provide output to a number of the remaining hardware units.

[0021] The data storage device (102) may store data such as executable program code that is executed by the processor (101 ) or other processing device. As will be discussed, the data storage device (102) may specifically store computer code representing a number of applications that the processor (101 ) executes to implement at least the functionality described herein.

[0022] The data storage device (102) may include various types of memory modules, including volatile and nonvolatile memory. For example, the data storage device (102) of the present example includes Random Access Memory (RAM) (106), Read Only Memory (ROM) (107), and Hard Disk Drive (HDD) memory (108). Many other types of memory may also be utilized, and the present specification contemplates the use of many varying type(s) of memory in the data storage device (102) as may suit a particular application of the principles described herein. In certain examples, different types of memory in the data storage device (102) may be used for different data storage needs. For example, in certain examples the processor (101 ) may boot from Read Only Memory (ROM) (107), maintain nonvolatile storage in the Hard Disk Drive (HDD) memory (108), and execute program code stored in Random Access Memory (RAM) (106).

[0023] Generally, the data storage device (102) may comprise a computer readable medium, a computer readable storage medium, or a non- transitory computer readable medium, among others. For example, the data storage device (102) may be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the computer readable storage medium may include, for example, the following: an electrical connection having a number of wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store computer usable program code for use by or in connection with an instruction execution system, apparatus, or device. In another example, a computer readable storage medium may be any non- transitory medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

[0024] The hardware adapters (103, 104) in the system (100) enable the processor (101 ) to interface with various other hardware elements, external and internal to the system (100). For example, the peripheral device adapters (104) may provide an interface to input/output devices, such as, for example, display device (109), a mouse, or a keyboard. The peripheral device adapters (104) may also provide access to other external devices such as an external storage device, a number of network devices such as, for example, servers, switches, and routers, client devices, other types of computing devices, and combinations thereof.

[0025] The display device (109) may be provided to allow a user of the system (100) to interact with and implement the functionality of the system (100). The peripheral device adapters (104) may also create an interface between the processor (101 ) and the display device (109), a printer, or other media output devices. The network adapter (103) may provide an interface to other computing devices within, for example, a network, thereby enabling the transmission of data between the system (100) and other devices located within the network.

[0026] The system (100) may, when executed by the processor (101 ), display the number of graphical user interfaces (GUIs) on the display device (109) associated with the executable program code representing the number of applications stored on the data storage device (102). The GUIs may display, for example, user-interactive commands and text extracted from the video input. Additionally, via making a number of interactive gestures on the GUIs of the display device (109), a user may set a number of predefined aspects of the present systems and methods to adjust the operation of the present systems and methods. Examples of display devices (109) include a computer screen, a laptop screen, a mobile device screen, a personal digital assistant (PDA) screen, and a tablet screen, among other display devices (106). Examples of the GUIs displayed on the display device (109), will be described in more detail below.

[0027] The system (100) further comprises a number of modules used in the implementation of text extraction from video input. The various modules within the system (100) comprise executable program code that may be executed separately. In this example, the various modules may be stored as separate computer program products. In another example, the various modules within the system (100) may be combined within a number of computer program products; each computer program product comprising a number of the modules.

[0028] The system (100) may include a potential text module (1 12) to, when executed by the processor (101 ), detect a number of regions in the video frames in which text is located. The text located within these regions may be referred to as text images. As used in the present specification and in the appended claims, the term "text image" or similar language is meant to be understood broadly as a visual image of text within a frame of a video input. Text images may be images displayed in or with the video images in the form of subtitles, closed captioning, film credits, news headlines, advertisements, information boxes on shopping channels, statistical boxes in sports coverage, captions including names of interviewees, , among many other text-related images.

[0029] In one example, the potential text module (1 12) bounds a region containing the text and determines the text's geometric position within the frames using, for example, a rectangular bounding box. In one example, the potential text module (1 12) does not obtain or store an actual image of the region bounded by the rectangular bounding box, but, instead, provides a reference to the relevant frame and coordinates within the frame to identify the regions in which text is located. In this example, the system refrains from storing actual images that may consume a large amount of storage space and computing resources.

[0030] The system (100) may include a potential text matching module (1 13) to, when executed by the processor (101 ), compare a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames. The potential text matching module (1 13) further creates a new track if comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in no match. The potential text matching module (1 13) further appends a line to a best matching track that already exists if comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match.

[0031] The system (100) may include a track creation module (1 14) to, when executed by the processor (101 ), create a track image. The track creation module (1 14) may further combine a number of tracks to create a track image, and determine whether a number of conditions are met for a given track to be converted to text. Using optical character recognition (OCR) on a video frame allows the text in it to be converted to a computer readable character encoding, such as the American Standard Code for Information Interchange (ASCII) format. Other examples of character encodings include Universal Character Set (UCS), and Unicode Transformation Formats (UTF) such as UTF-8. The converted text may then be searched and analyzed by other text-based applications. The system (100) may include modules in addition to those described above to bring about the processes described herein.

[0032] Fig. 2 is a flowchart depicting a method (200) of extracting text from video, according to one example of the principles described herein. As depicted in Fig. 2, the method (200) may begin by storing (block 201 ) a number of frames of a video input in a memory. Video may be any video data presented to the system (100) via the video input (Fig. 1 , 1 10), and stored in, for example, the cache (Fig. 1 , 1 18) associated with the processor (101 ) of the system (100). In another example, the cache may be the RAM (Fig. 1 , 106). The size of the memory may vary depending on the number of frames of video data to be stored in the memory. The number of frames of video data to be stored in the memory may depend on the amount of frames to be compared.

[0033] The video may be any form of video data which the video input (Fig. 1 , 1 10) obtains from a source. In one example, the video may comprise data stored in a separate data storage device from which the video input (Fig. 1 , 1 10) obtains the video. In another example, the video may be streaming video. In still another example, the video may be live video obtained from a live television broadcast. The video may be interlaced video or progressive video. Further, the video may be in analog or digital format. In still another example, the video may be a combination of the above forms of video.

[0034] The method of Fig. 2 may further comprise identifying (block 202) potential text within the frames. As used in the present specification and in the appended claims, the term "frame" is meant to be understood broadly as any still image among a number of still images that, when displayed in sequence, produce the illusion of a moving image. In one example, each frame may be flashed on a screen for a short time, and then immediately replaced by the next one. Such a display of frames creates a video display. In one example, the frames may be represented as analog waveforms in which varying voltages represent the intensity of light in an analog raster scan across the screen. In this example, analog blanking intervals may separate video frames in the same way that frame lines do in film. In another example, the frames may be represented in digital format in which the video system represents the video frame as a rectangular raster of pixels. Video frames may be identified using SMPTE time codes that contain binary coded decimal "hour:minute:second:frame" identification. Time codes may use a number of frame rates such as, for example, 24 frames per second (fps), 25 fps, 29.97 (30 ÷ 1 .001 ) fps, and 30 fps.

[0035] Potential text may be identified (block 202) using image regions within each frame of the video feed. In one example, an image region may comprise a single, potential text character. The image region may be defined by a rectangular bounding box locating a given image region's geometric position within a given frame. In one example, in order to reduce storage use, the identification (block 202) of the image regions does not contain an actual image, but, instead, comprises a reference to the relevant frame and the geometric position within that frame.

[0036] In identifying (block 202) potential text within each frame of the video feed, it may be advantageous to track text over multiple frames. To do so, a number of features of potential characters in the text are extracted and matched against characters in other frames. In one example, in order to increase speed and minimize storage, the features will be relatively more simple features with respect to those used in an OCR process.

[0037] In one example in which Latin-based alphabets make up the target potential text, the rectangular bounding box of the potential text character, characters, words, phrases, sentences, paragraphs, other groupings of text, at least one of the above, or combinations thereof may be sufficient in identifying (block 202) the potential text. This is because Latin-based alphabets have relatively large variations between the dimensions of different characters. In another example, far East Asian-based text, for example, such as Chinese, Korean, Japanese, and Hindi, among many other alphabets may be the target potential text. These types of texts consist mostly of characters of very similar dimensions. In this situation, it may be advantageous to match extra features like foreground-background ratios or component counts to improve matching accuracy even at the cost of extra processing time.

[0038] In another example, potential text regions may comprise potential words or word fragments, rather than the single characters described above. This will, for example, be more convenient in Arabic-based languages where letters are generally joined together. Breaking such words into individual letters may require additional analysis which is inconvenient, introduces the potential for inaccuracies in the process, and may not actually be necessary for the methods described herein.

[0039] Thus, text images may be images of text within the video images as captured by a video capture device such as a video camera, as well as text that is superimposed onto a video image. Thus, text images may also be images of text within the video images themselves such as street signs, protest signs, or other text captured in the video.

[0040] The process of Fig. 2 may further comprise comparing (block 203) a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames. As used in the present specification and in the appended claims, the term "line" is meant to be understood broadly as a sequence of potential characters in the same frame with their corresponding character features. These lines are produced by the potential text identification of block 202 described above. Further, as used in the present specification and in the appended claims, the term "track" is meant to be understood broadly as a continuous body of text corresponding to lines appearing in multiple frames.

[0041] As mentioned above, the text in a track may comprise stationary text such as text found in news headlines and channel logos. In another example, the text in a track may comprise horizontally scrolling text such as text found in breaking news feeds, tickers and in the display of stock prices. In still another example, the text in a track may comprise vertically scrolling text such as text found in film credits displayed at the beginning or end of a movie production or similar scenarios. In yet another example, the text in a track may comprise text that jumps up line by line such as text found in computer-generated subtitles. An objective of the present systems and methods is to assign or associate each line to a unique track, while identifying duplicated characters in the process.

[0042] A track may be a sequence of potential characters that undergo a geometric translation across different frames. In each frame, the translation vector is the same for all characters in the same track. In practice, it may be that most tracks have constant translations. However, the present systems and methods allow the translation vector to change over time. Stationary text displayed in a video frame is a particular case where the translation vector equals zero.

[0043] If comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in no match (block 203, determination NO), then a new track is created (block 204) from the line. If, however, comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match (block 203, determination YES), then the line is appended (block 205) to a best matching track that already exists. Block 203 reduces redundancy by removing duplicates that represent identified potential text that has already been identified. This reduces or eliminates the need to process frames or regions of text within the frames a plurality of times.

[0044] The method of Fig. 2 may then continue with combining (block 206) a number of tracks to create a track image, and, with an optical character recognition module, extracting (block 207) a number of characters from the track image. The method of Fig. 2 will now be described in more detail in connection with Figs. 3A and 3B.

[0045] Figs. 3A and 3B depict a flowchart depicting a method (300) of extracting text from video, according to another example of the principles described herein. As depicted in Fig. 3A, the method (300) may begin by storing (block 301 ) a number of frames of a video input in a memory as described above in connection with block 201 of Fig. 2. In one example, the number of frames stored (block 301 ) by the system is user-definable. In this example, a user may choose to have more frames stored when processing video with a high frame rate, whereas the user may choose to have stored relatively fewer frames when processing video with a relatively lower frame rate. Thus, if the input video comprises a frame rate of 30 fps, then a second's worth of video input would provide 30 frames for storage. In this example, if a user desires to capture text that appears within the frames of the video for one second, for example, then the user would set the threshold for the number of frames to store at 30 or more frames for the 30 fps video input. Similarly, if the user desires to capture text that appear within the frames of the video for a half a second, for example, then the user would set the threshold for the number of frames to store at 15 or more frames for the 30 fps video input.

[0046] In another example, the number of frames stored (block 301 ) by the system is two or more. In this example, as few as two frames may be compared as will be described in more detail below in connection with block 302. In another example, the number of frames stored (block 301 ) by the system may be based on the amount of storage available. In still another example, the number of frames stored (block 301 ) by the system may be based on the frame rate of the video feed. In still another example, if the analysis of block 302 is not applied within the overall method (300) of Fig. 3, then the number of frames stored (block

301 ) by the system may be one. In yet another example, the number of frames stored (block 301 ) by the system may be based on a combination of the above.

[0047] The method of Fig. 3 may further comprises comparing (block

302) consecutive video frames to eliminate regions of frames as potential text based on the variation of the regions within the frames. The system (100) may eliminate regions of frames due to a lack of containing potential text based on how these regions vary among consecutive frames. Thus, the system eliminates non-text regions. For example, if a number of consecutive frames comprise a region within the frames that experiences a variation disqualifying it as potential text. In one example, a threshold may be set to determine what type of variation would disqualify a region among a number of frames as potential text. For example, a threshold may be set to where, if the region experiences a variation every two frames, then the region is disqualified as containing potential text. In this example, the threshold would then be set at every 1/15^th of a second in a 30 fps video feed. Other thresholds may be longer or shorter. In one example, this frame variation threshold is user-definable.

[0048] In connection with block 302, the video feed may comprise scrolling text. Scrolling text is sometimes referred to as tickers, a crawler, or a slide. The scrolling text may, for example, be presented in, for example, a lower third of the video frames and may be used to present headlines, minor pieces of news, and stock values, among other information. In the use case of scrolling text, the text scrolled across consecutive frames may be detected as translations of parts of an image at a generally constant speed. If the language being used in the video frames is known, the expected direction of horizontal scrolling, right-to- left for Latin- or Cyrillic-based alphabets and left-to-right for Arabic-based alphabets, may be used to eliminate additional non-text regions as potential text.

[0049] The method of Fig. 3 may further comprise identifying (block 303), with the processor (101 ) executing the potential text module (Fig. 1 , 1 12), potential text within the frames as described above on connection with block 202 of Fig. 2. Potential text may be identified (block 303) using image regions within each frame of the video feed. In addition to the description of block 202 above, a number of sub-processes may be used to identify (block 303) potential text within a number of frames. In a first sub-process, the processor (101 ), executing the potential text module (Fig. 1 , 1 12) module, identifies parts of the image with character-like characteristics, such as lines with uniform thicknesses, or shapes with the same level of complexity as a character in an alphabet, pen stoke width uniformity, pen stroke angle distribution, vertical pixel density, horizontal pixel density, edge detection, among others, at least one of the above, or combinations thereof.

[0050] A second sub-process, with the processor (101 ), executing the potential text module (Fig. 1 , 1 12), may identify regions of uniform color within the number of frames. In most cases, colors of characters in a line or text will all have the same color. These lines of text contrast with the video background, and are easily identifiable for this reason as the uniform letter color may be identified against arbitrary background images. In another or in addition to the above, the sub-process may also use a uniform background to

[0051] In one example, the least computationally expensive sub- process of block 303, including the identification of parts of the image with character-like characteristics and identification of regions of uniform color within the number of frames, may be performed first. In this manner, less data will be required to analyze by the remainder of the sub-processes. This will result in an increase in overall processing speed. In one example, the identification of parts of the image with character-like characteristics is relatively less computationally expensive than the identification of regions of uniform color within the number of frames. In another example, other forms or sub-processes to identify (block 303) potential text within the frames may be used alone or in combination with the above-described sub-processes.

[0052] In one example, blocks 302 and 303 may be performed in any order, including the identification of parts of the image with character-like characteristics and identification of regions of uniform color within the number of frames described above in connection with block 303. In this example, the processor (101 ), executing the potential text module (Fig. 1 , 1 12), may determine which process or sub-process will result in the most effective or useful data, and perform that process or sub-process first.

[0053] The process of Fig. 3 may further comprise comparing (block 304), with the processor (101 ) executing the potential text matching module (1 13), a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames. If comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in no match (block 304, determination NO), then a new track is created (block 305) from the line. If, however, comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match (block 304, determination YES), then the line is appended (block 306) to a best matching track that already exists. In one example, the regions of the number of frames corresponding to the potential characters may be saved with the tracks in, for example, the data storage device (Fig. 1 , 102). In this manner, any frames processed and no longer needed after the analysis associated with blocks 302 or 303 are deleted from the memory such as the cache (Fig. 1 , 1 18), deleted from the data storage device (Fig. 1 , 102) or otherwise not stored and are discarded. Otherwise, previous frames corresponding to the potential characters are stored in, for example, the frame history (Fig. 1 , 1 17) of the data storage device (Fig. 1 , 102). These frames corresponding to the potential characters are stored in, for example, the frame history (Fig. 1 , 1 17) of the data storage device (Fig. 1 , 102) may be retained until they are no longer needed, as determined by block 31 1 described below.

[0054] With regard to blocks 304 and 306, to determine which track to add a line to, a sequence of character features for each line is analyzed by the processor (101 ) executing the potential text matching module (1 13). The processor (101 ) identifies a geometric translation that maps the sequence of character features to the track. For scrolling text, a match between an initial segment of the line and a tail segment of the track is identified since each line contains new characters at the end.

[0055] In one example, the Random Sample Consensus (RANSAC) algorithm may be utilized. The RANSAC algorithm offers high resistance to variations in character features due to noise. In most cases, however, relatively simpler and, hence, faster matching algorithms may be sufficient. In another example, the parameter space of the frames may be search in order to find a correct translation. This may be advantageous in situations where short lengths of the lines exist within the frames. [0056] In another example, the rectangular bounding boxes in a track may be relatively stable throughout the analysis of that region of the frames. However, poor video quality may cause inconsistencies in text detection, resulting in characters being broken up or joined together. To address this potential scenario, alternative geometric features that do not depend on how the line is split into individual characters may be used. For example, the position of ascenders and descenders in the line, or the variation of the distribution of stroke angles such as, for example, where the letter "W" is more diagonal in nature than the letter "H," and distinguishing the corresponding parts of the line in this manner. In typography, ascenders are the part of tall lowercase and uppercase characters that extend above a standard or median height. Examples of ascenders include the letters "b," "h," "A," and "E" where the top portions of the letters extend above the circle portion of the "b," the "n" portion of the "h," the triangle portion of the "A," and the top lines of the "E," for example. Descenders are the part of characters that extend below a bottom line. Examples of decenders include the letters "p," "g," "j," and "y" where the bottom portion of the letters extend below the circle portion of the "p" and "g," the tail bottom of the "j," and the "v" portion of the "y," for example.

[0057] Following the indicator "A" to Fig. 3B, the process may further comprise determining (block 307), with the processor (101 ), executing the track creation module (1 14), whether a number of conditions are met for a given track to be converted to text. In one example, conversion of the text may occur via optical character recognition (OCR). One such condition may include determining (block 307) whether the track has not been matched for more than a set period of frames, and is thus assumed to have expired. In one example, the set period of frames may be twenty frames. If a given track's lifetime is relatively short such as, for example, less than ten frames, it may be categorized as noise and may be discarded.

[0058] Another such condition may include determining (block 307) whether the track is stationary and the track's lifetime has exceeded a pre-set length without change. In this condition, no more significant update is expected to be found within the stationary track. For example, if a headline within a news video feed is stationary within a number of frames, and the number of frames exceeds the pre-set length without a change, then the number of frames, making up the track, may be ready to be converted to text. The track's pre-set length may be user-definable. Further, the track's pre-set length may be based on the number of frames included within the track, the time (e.g., in seconds or portions of seconds) the track includes, or a combination thereof.

[0059] Another condition may include determining (block 307), in a scrolling track comprising scrolling text, whether the scrolling track is partially covered by each frame and will continuously grow in size as its content are built up through the generation of additional text presented at either side of the scrolling text. In this condition, if the track's storage buffer has exceeded a preset limit, then the track may be converted to text.

[0060] Still another condition may include determining (block 307) whether the final frame of a given video input has been received. In this condition, if the last frame has been received, no further processing is required, and the track may be converted to text.

[0061] Yet another condition may include determining (block 307) whether an amount of computer memory in use has exceeded a predefined limit. In this condition, if the amount of computer memory in use has exceeded a predefined limit, then the system (Fig. 1 , 100) may convert the track's text may be converted to text. Processing the track in this manner will free up more memory. In some situations, premature processing of the track may reduce accuracy. However, clearing of system memory for use in processing of additional frames may take priority over what accuracy is lost. In one example, the predefined amount of computer memory available before this condition is met is user definable. In another example, the predefined amount of computer memory available before this condition is met is based on the amount of memory available in the system if no data were stored in the memory. In this example, a percentage, for example, of the memory may be defined as the threshold at which the track may be converted to text.

[0062] Turning again to block 307, if none of these conditions are met (block 307, determination NO), then the process (300) may loop back to block 301 following indicator "B" to Fig. 3A, where an additional number of frames of a video input are stored (block 301 ) in a memory and processed per blocks 302 through 31 1 . The processing of block 307 provides for control over the balance between memory usage and speed in processing frames of video input. Generally, higher storage limits within the memory will reduce the frequency of OCR runs.

[0063] In one example, if any one of these conditions is met, then the process (300) may continue to block 308. In another example, if all of these conditions are met, then the process (300) may continue to block 308. In still another example, if a number of these conditions are met, then the process (300) may continue to block 308. In yet another example, if any combination of these conditions is met, then the process (300) may continue to block 308.

[0064] If these conditions are met (block 307, determination YES), then the process may continue with combining (block 308), with the processor (Fig. 1 , 101 ) executing the track creation module (1 14), a number of tracks to create a track image. In one example, the track image is created by combining line fragments that have been appended to a given track. Line fragments are parts of the frames that are potential text. These line fragments may be the result of block 303 where potential text is identified (block 303) within the frames.

[0065] The process may continue with the processor (Fig. 1 , 101 ) executing the optical character recognition module (1 1 1 ), extracting (block 309) a number of characters from the track image. Using optical character recognition (OCR) on a video frame allows the text in it to be converted to a convenient character encoding, such as the American Standard Code for Information Interchange (ASCII) format. The converted text may then be searched and analyzed by other text-based applications. In this manner, the present systems and methods convert human-readable text images presented in a number of video frames of a video input into computer-readable data.

[0066] The converted text is presented (block 310) to a user. In one example, the results of the text conversion (block 310) may be filtered to further reduce errors. For example, text lines containing dots and hyphens may be rejected as noise. The user may then use the converted text in a variety of ways including, for example, creation of a text document, general reading, research, and creation of transcripts, among other uses.

[0067] The processor (101 ) frees (block 31 1 ) the memory storage. As alluded to above on connection with the execution of blocks 302 and 303, memory storage that is no longer required for processing is freed. In one example, the memory may be freed (block 31 1 ) such that all storage related to the analyzed and converted track is freed.

[0068] In another example, the image data associated with the analyzed track is discarded after conversion of the track to text through the OCR process. In this example, the character features are kept to detect future duplicates until the track expires.

[0069] In still another example, the image data and character features from the newest frame of the track are retained, while the image data and character features from the remainder of the frames is discarded. Thus, block 304 through 312 may continue to be applied to this retained track.

[0070] In yet another example, all data stored in the memory is freed irrespective of the data status or type. In this example, all the data stored relating to any track, for example, is freed as opposed to the above example in which data stored relating to a track is freed.

[0071] In one example, one of the above examples of freeing (block 31 1 ) memory storage may be applied as appropriate to a given situation. For example, if a given track is to be further analyzed by a number of the above processes, then the option to free all data stored in the memory is freed irrespective of the data status or type would not be appropriate. Further, in another example, if a freed track was the last to appear in a history, then that freed frame is then discarded from the frame history (Fig. 1 , 1 17) and/or cache (Fig. 1 , 1 18) due to it being the last to appear in the frame history (Fig. 1 , 1 17). Thus, the above examples of freeing (block 31 1 ) memory storage may be applied to the conditions set forth above in connection with block 307 as appropriate to the given situation.

[0072] At block 312, it may be determined whether there are additional frames of video input for processing. If there are no additional frames of video input for processing (block 312, determination NO), then the process may terminate. If, however, there are additional frames of video input for processing (block 312, determination YES), then, following the indicator "C," the process may loop back to block 301 of Fig. 3A where an additional number of frames of a video input are stored (block 301 ) in a memory and processed per blocks 302 through 31 1 .

[0073] Aspects of the present system and method are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to examples of the principles described herein. Each block of the flowchart illustrations and block diagrams, and combinations of blocks in the flowchart illustrations and block diagrams, may be implemented by computer usable program code. The computer usable program code may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the computer usable program code, when executed via, for example, the processor (101 ) of the text extraction computing device (100) or other programmable data processing apparatus, implement the functions or acts specified in the flowchart and/or block diagram block or blocks. In one example, the computer usable program code may be embodied within a computer readable storage medium; the computer readable storage medium being part of the computer program product. In one example, the computer readable storage medium is a non-transitory computer readable medium.

[0074] The specification and figures describe a method of extracting text from video comprising, with a processor, storing a number of frames of a video input in a memory, identifying potential text within the frames, comparing a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames, in which, if the comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in no match, then creating a new track from the line, and if the comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match, then appending the line to a best matching track. The method further comprises combining a number of tracks to create a track image, and with an optical character recognition module, extracting a number of characters from the track image.

[0075] The present systems and methods for extracting text from video may have a number of advantages, including the ability to remove duplicates in order to reduce the amount of processing and conserve data storage space. Another advantage may be the reduction or elimination noise elements within the video frames that are the result of, for example, lighting problems and video compression by analyzing and averaging potential text over several frames. Still another advantage may be the increase in accuracy of text recognition through noise reduction. A further advantage may be the provision of automatic, efficient storage management that allows the system to run indefinitely without memory or other resource exhaustion. Yet another advantage is the provision of for full text to be built up or accumulated over multiple frames in order to eliminate the need to otherwise produce and analyze text fragments. Still another advantage is the scalability of the present systems and methods in terms of accuracy versus efficiency.

[0076] The preceding description has been presented to illustrate and describe examples of the principles described. This description is not intended to be exhaustive or to limit these principles to any precise form disclosed. Many modifications and variations are possible in light of the above teaching.

Claims

CLAIMS WHAT IS CLAIMED IS:

1 . A method of extracting text from video comprising, with a processor: storing a number of frames of a video input in a memory;

identifying potential text within the frames;

comparing a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames, in which:

if the comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in no match, then creating a new track from the line, and

if the comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match, then appending the line to a best matching track;

combining a number of tracks to create a track image; and

with an optical character recognition module, extracting a number of characters from the track image.

2. The method of claim 1 , in which potential text comprises a region of the first frame that comprises a number of text characters bounded by a rectangular bounding box that identifies a geometric position of the text characters within a number of frames.

3. The method of claim 1 , in which the number of frames stored in the memory is at least one of user-defined, determined based on a frame rate, and determined based on available storage in the memory.

4. The method of claim 1 , in which discovering text within the frames comprises discovering character-like characteristics within each frame of the video input and discovering a number of regions of uniform color, the regions of uniform color comprising text.

5. The method of claim 4, further comprising:

determining which of discovering character-like characteristics within each frame of the video input, and discovering a number of regions of uniform color is less computationally expensive; and

performing the least computationally expensive one first.

6. The method of claim 1 , further comprising:

comparing consecutive video frames, and

eliminating regions of frames as potential text based on variation of the frames.

7. The method of claim 1 , further comprising:

determining whether a number of conditions are met for a given track to be converted to text; and

if a number of the conditions are met, then combining the tracks to create the track image.

8. The method of claim 1 , further comprising freeing memory storage.

9. A system for of extracting text from video, comprising:

a memory for storing a number of frames of a video input; and a processor to:

with a potential text module, identify potential text within the frames;

with a potential text matching module, compare a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames, in which:

if the comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match, then appending the line to a best matching track; with a track creation module, combine a number of tracks to create a track image; and

10. The system of claim 9, in which an optical character recognition module converts text to a character encoding format.

1 1 . The system of claim 9, in which the video input receives video from a source, in which the video is at least one of streaming video, live video obtained from a live video source, interlaced video, progressive video, analog video, and digital video.

12. The system of claim 9, in which the number of frames of the video input stored in the memory is at least one of user definable, based on the amount of storage available in the memory, and based on the frame rate of the video input.

13. A computer program product for extracting text from video, the computer program product comprising:

a computer readable storage medium comprising computer usable program code embodied therewith, the computer usable program code comprising:

computer usable program code to, when executed by a processor, identify potential text within a number of frames of a video input;

computer usable program code to, when executed by a processor, compare a line of the potential text within a first frame with a number of tracks appearing in a number of recent frames, in which:

if the comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in no match, then creating a new track from the line, and if the comparing the line of the potential text within the first frame with the tracks appearing in the recent frames results in a match, then appending the line to a best matching track;

computer usable program code to, when executed by a processor, combine a number of tracks to create a track image; and

computer usable program code to, when executed by a processor, extract a number of characters from the track image.

14. The computer program product of claim 13, further comprising:

computer usable program code to, when executed by a processor, compare consecutive video frames, and

computer usable program code to, when executed by a processor, eliminate regions of frames as potential text based on variation of the frames.

15. The computer program product of claim 13, further comprising:

computer usable program code to, when executed by a processor, determine whether a number of conditions are met for a given track to be converted to text; and

computer usable program code to, when executed by a processor, combining the tracks to create the track image if a number of the conditions are met.