WO2024039362A1 - Methods and systems for text recognition with image preprocessing - Google Patents

Methods and systems for text recognition with image preprocessing Download PDF

Info

Publication number
WO2024039362A1
WO2024039362A1 PCT/US2022/040331 US2022040331W WO2024039362A1 WO 2024039362 A1 WO2024039362 A1 WO 2024039362A1 US 2022040331 W US2022040331 W US 2022040331W WO 2024039362 A1 WO2024039362 A1 WO 2024039362A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
text
text region
generating
region
Prior art date
Application number
PCT/US2022/040331
Other languages
French (fr)
Inventor
Kaiyu ZHANG
Yuan Lin
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2022/040331 priority Critical patent/WO2024039362A1/en
Publication of WO2024039362A1 publication Critical patent/WO2024039362A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • G06V30/18019Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections by matching or filtering
    • G06V30/18038Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters
    • G06V30/18048Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters with interaction between the responses of different filters, e.g. cortical complex cells
    • G06V30/18057Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/147Determination of region of interest
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present invention is directed to text recognition methods and techniques.
  • OCR optical character recognition
  • the input image may be obtained from a scanned document, a photo of a document, a scene photo, or subtitle text superimposed on an image.
  • Optical character recognition is useful for extracting information from images, and its applications include searching, positioning, translation, recommendation, and many others. Over the past, many conventional text recognition systems have been proposed, but they were inadequate for the reasons detailed below.
  • the present invention is directed to text recognition methods and techniques.
  • a text region within an image is identified.
  • Feature data are obtained from the text region using a preprocessing model, which is trained using a machine learning process.
  • the text region is enhanced using the feature data before text recognition is performed.
  • Embodiments of the present invention can be implemented in conjunction with existing systems and processes.
  • the text recognition system according to the present invention can be used in a wide variety of systems, including mobile devices, communication systems, and the like.
  • various techniques according to the present invention can be adopted into existing systems via training of a convolutional neural network (CNN) model, which is compatible with most optical character recognition (OCR) applications.
  • CNN convolutional neural network
  • OCR optical character recognition
  • a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • One general aspect includes a method for text recognition. The method also includes obtaining an image, the image including at least a first text region. The method also includes storing tire image in a memory. The method also includes identifying the first text region. The method also includes obtaining an image preprocessing model from storage. The method also includes generating a feature map using the image preprocessing model and tire first text region.
  • the method also includes providing an enhanced first text region using the feature map and the first text region.
  • the method also includes generating a data sequence using the enhanced first text region by a convoluted neural network.
  • the method also includes mapping the data sequence to an identified text.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Implementations may include one or more of the following features.
  • the method may include receiving the image from a network interface.
  • the method may include cropping the first text region.
  • the method may include extracting feature data in multiple iterations from the first text region using the image preprocessing model.
  • the method may include identifying and removing a background from the image.
  • the method may include overlaying the identified text over the first text region.
  • the method may include detecting a language using at least the feature map.
  • the convolutional neural network may include a u-net architecture.
  • the method may include generating a low-frequency information level based on the first text region using the u-net architecture, the low-frequency information level.
  • the low-frequency information level and the first text region are characterized by different dimensions.
  • the method may include removing a background from the image.
  • One general aspect includes a system for text recognition.
  • the system also includes a housing.
  • the system also includes a camera mounted on the housing and configured to capture an input image.
  • the system also includes a memory configured to store the input image.
  • the system also includes a storage configured to store a preprocessing model.
  • the system also includes a user interface configured to display an identified text.
  • the system also includes a processor coupled to the storage and the memory, the processor being configured to: identify a text region on the input image, generate a feature map using the preprocessing model and the text region, provide an enhanced text region using the feature map and the text region, generate a data sequence using the enhanced text region, and map the data sequence to the identified text.
  • Other embodiments of this aspect include corresponding computer systems, apparatus. and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Implementations may include one or more of the following features.
  • the system where the processor may include a neural network processor configured for generating the feature map using a u-net architecture.
  • Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
  • One general aspect includes a method for providing a training model for processing text images.
  • the method also includes generating a text string.
  • the method also includes providing a background image.
  • the method also includes generating a reference image containing the text string.
  • the method also includes generating an initial training image containing the text string and the background image.
  • the method also includes generating a modified training image by reducing the quality of the initial training image using one or more randomized processes.
  • the method also includes calculating a pattern between the reference image and the modified training image using a convoluted neural network.
  • the method also includes storing the pattern as an image preprocessing model at a storage device.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Implementations may include one or more of the following features.
  • the method may include retrieving the background image from a network source.
  • the method may include encoding the initial training image.
  • the method may include compressing and/or scaling the initial training image.
  • the method may include introducing one or more types of noise to the initial training image.
  • Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
  • embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present systems and methods for image processing increase the model performance directly, resulting in improved image quality and easier text recognition. Additionally, the use of the lightweight U-net model will have a large positive influence on the OCR result without adding much computational cost.
  • Figure 1 is a simplified block diagram illustrating a system for capturing images and recognizing text from the captured images according to embodiments of the present invention.
  • Figure 2 is a simplified block diagram illustrating a server for generating an imaging preprocessing model and/or performing text recognition according to embodiments of the present invention.
  • Figure 3 is a simplified flow diagram illustrating method for training a preprocessing model according to embodiments of the present invention.
  • Figure 4 is a simplified diagram illustrating synthetic images generated for training according to embodiments of the present invention.
  • Figure 5 is a simplified flow diagram illustrating method for recognizing texts in an image according to embodiments of the present invention.
  • Figure 6 is a simplified flow diagram illustrating a text recognition method with a shared backbone according to embodiments of the present invention.
  • Figure 7A is a simplified diagram illustrating a U-net architecture used in image preprocessing according to embodiments of the present invention.
  • Figure 7B is a simplified diagram illustrating a U-net architecture used in image preprocessing according to embodiments of the present invention.
  • the present invention is directed to text recognition methods find techniques.
  • a text region within an image is identified.
  • Feature data are obtained from the text region using a preprocessing model, which is trained using a machine learning process.
  • the text region is enhanced using the feature data before text recognition is performed.
  • CNN convolutional neural network
  • CTC connectionist temporal classification
  • attention-based neural network using cross-entropy loss.
  • the CNN model approach is more friendly with general and long scene text recognition (e.g., text, paper, magazines) since it can extract more sequence features and does not require a specific input size.
  • the attention-based model approach can focus more details on the character pixels, which performs better in street view, vertical text, etc.
  • the present invention provides methods and systems that uses deep learning techniques to remove noise and complex backgrounds of a scene text image in order to improve the performance of text recognition models.
  • Figure 1 is a simplified block diagram illustrating a system 100 for capturing images and recognizing text from the captured images according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in tire art would recognize many variations, alternatives, and modifications.
  • the text recognition system 100 can be configured within a housing 110 and can include a camera device 120 (or other image or video capturing device), a processor device 130, a memory device 140, and a storage device 150.
  • the camera 120 can be mounted on the housing 110 and be configured to capture an input image.
  • the input image can be stored in the memory 140, which can include a random-access memory (RAM) device, an image buffer device, or the like.
  • the storage device 150 can be configured to store a preprocessing model, which is used to evaluate the input image for text recognition.
  • the processor 130 can be coupled to each of the previously mentioned components and be configured to communicate between these components.
  • the processor 130 can include a central processing unit (CPU), a network processing unit (NPU), or the like.
  • the system 100 can also include a user interface 160 and a network interface 170.
  • the user interface 160 can be configured to display an identified text (e.g., from the input image).
  • the user interface 160 can include a display region 162 to display the identified text. This display region 162 can also be a touchscreen display (e.g., in a mobile device, tablet, etc.)
  • the user interface 160 can also include a touch interface 164 for receiving user input (e.g., keyboard or keypad in a mobile device, laptop, or other computing devices).
  • the user interface 160 can be used in real-time applications (RTAs), such as multimedia streaming, video conferencing/messaging, navigation, and the like.
  • RTAs real-time applications
  • the network interface 170 can be configured to transmit and receive images (e.g., using Wi-Fi, Bluetooth, Ethernet, etc.) for text recognition.
  • the network interface 170 can also be configured to compress or down-sample images for transmission or further processing.
  • the network interface 170 can also be configured to send one or more images to a server for OCR.
  • the processor 130 can also be coupled to and configured to communicate between the user interface 160, the network interface 170, and any other interfaces.
  • the processor 130 can be configured to identify a text region on the input image, which can be stored in the memory device 140; to generate a feature map using the preprocessing model and the text region: to provide an enhanced text region using the feature map and the text region; to generate a data sequence using the enhanced text region; and to map the data sequence to the identified text.
  • processor 130 includes a neural network processor configured for generating the feature map using a U-net architecture.
  • FIG. 2 is a simplified block diagram illustrating a server 200 for generating an imaging preprocessing model and/or performing text recognition according to embodiments of the present invention.
  • This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • server 200 includes a processor 210 coupled to a storage device 220 and a network interface 230. Similar to the text recognition system 100 of Figure 1, the processor 210 can be configured to communicate with the storage device 220 and the network interface 230.
  • the storage device 220 can also be configured to store a preprocessing model, which is used to evaluate the input image for text recognition.
  • the processor 210 can include a central processing unit (CPU), a network processing unit (NPU), or the like.
  • the network interface 230 can be configured to transmit and receive one or more input images (e.g., using Wi-Fi, Bluetooth, Ethernet, etc.) for text recognition.
  • the network interface 230 can also be configured to compress or down-sample images for transmission or further processing.
  • the server 200 is configured to perform OCR, or similar text recognition processing, on images received over a network from various network- enabled devices, such as mobile devices 291 and other computing devices 292. The server 200 can transmit the resulting identified text from the text recognition process back to these devices (291, 292) over the respective networks.
  • server 200 can be configured to perform OCR services for network- enabled devices (e.g., mobile device 291, computing device 292, network- enabled user devices, etc.).
  • server 200 can be configured to process low-quality/noisy images or images with complex backgrounds for text recognition.
  • text recognition services can be performed by server 200 for real- time applications and/or batch processing applications.
  • the processor 210 can be configured to identify a text region on the input image; to generate a feature map using tire preprocessing model (from the storage device 220) and the text region; to provide an enhanced text region using the feature map and the text region: to generate a data sequence using the enhanced text region; and to map the data sequence to the identified text.
  • the processor 210 includes a neural network processor configured for generating the feature map using a U-net architecture.
  • server 200 can also be configured to perform neural network model training for use in text recognition processes.
  • the training can include obtaining training images and backgrounds, generating training data sets, and creating or updating a preprocessing model (e.g., image preprocessing model stored in storage 220, or the like). Further details of the model training process are described with reference to Figure 3.
  • the processor 210 of the server 200 can also be coupled to memory devices, other peripheral devices, a user interface, and/or other interfaces.
  • Other embodiments of this system include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • Figure 3 is a simplified flow diagram illustrating method for training a preprocessing model according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limit the scope of the claims.
  • the method 300 includes step 302 of generating a text string and step 304 of providing a background image.
  • the text string can include any words, phrases, sentences, etc.
  • the method can include retrieving the background image from a network source.
  • a processor e.g., processor 130 of system 100, processor 210 of server 200, or tire like
  • the method includes generating a reference image containing the text string and generating an initial training image containing the text string and the background image, respectively.
  • the method includes generating a modified training image by reducing the quality of the initial training image using one or more randomized processes. Further details of the training images are discussed with reference to Figure 4.
  • die method includes calculating a pattern between the reference image and the modified training image using a convolutional neural network (CNN).
  • CNN convolutional neural network
  • the reference image and the modified training image can be used to train a neural network model (e.g., CNN, or the like) to serve as an image preprocessing model.
  • This image preprocessing model can be used to remove the complex background portions or noise in an input image that makes it difficult to identify a target text region of the input image.
  • the model is trained using U-shape down sampling and up sampling neural network (U-net) architecture for removing such complex backgrounds or noise.
  • U-net U-shape down sampling and up sampling neural network
  • the input image can include a down-sampling encoding process to obtain a string of features smaller than the original image, which is similar to the effect of an image compression process.
  • the method can include artificially altering a training image (e.g., adding noise, reducing quality, or other image manipulation) to generate the modified training image to be used to train an image preprocessing model. Further details of the image preprocessing model training are discussed with reference to Figures 6, 7A, and 7B.
  • the method includes storing the pattern as an image preprocessing model at a storage device.
  • the processor 130 can be configured to store the image preprocessing model at the storage device 160 for future use in a text recognition process (e.g., method 500 of Figure 5). Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
  • Figure 4 is a simplified diagram illustrating synthetic images generated for training according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • Figure 4 shows multiple training images that can be used for generating an image preprocessing model, as described previously for method 300 in Figure 3.
  • the generated text string is ‘‘Super Text”.
  • Training image 400 shows a clean reference image with the generated text string on a clear background.
  • Training images 401-403 show the text string on a background with a repeating pattern, a background of noise/random variations, and a background of the text, respectively.
  • a synthetic data generator can be used to produce a modified training image (e.g., step 310 of method 300).
  • the synthetic data generator can use images or combinations of images retrieved from a network source, image dataset, or the like.
  • the generator can also use a white or black background with randomized text using different transparency levels.
  • the generator can randomly add different types of image noi se (e.g., Gaussian, Poisson, etc.), apply different levels of compression (i.e., reducing image quality at randomized levels), and apply different levels of scaling (i.e., enlarging the image, reducing the image, etc.).
  • the clean reference image 400 can be used with any of the training images 401-403, or similar training images, to train a neural network model as an image preprocessing model to be used to remove complex background portions or noise from an input image.
  • the implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
  • Figure 5 is a simplified flow diagram illustrating method for recognizing texts in an image according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limi t the scope of the claims.
  • method 500 includes step 502 of obtaining an image, which includes at least a first text region, and step 504 of storing the image in a memory device.
  • the processor 130 can be configured to obtain the image using the camera device 120 and to store the image in the memory 140.
  • the method can include receiving the image from a network interface.
  • the processor 130 can also be configured to obtain the image from the network interface 180.
  • the method includes identifying the first text region.
  • the method can include cropping the first text region to remove unrelated portions of the image.
  • a second text region and/or additional text regions may be identified and cropped as well.
  • the location of the text regions can be located by corner points’ coordinates that outline the boundaries to be cropped.
  • the method includes obtaining an image preprocessing model from a storage and generating a feature map using the image preprocessing model and the first text region, respectively.
  • the processor 130 can be configured to obtain the image preprocessing model from the storage device 150 and to generate the feature map.
  • the method can include extracting feature data in multiple iterations from the first text region using the image preprocessing model.
  • the method can also include extracting such feature data from a second text region and/or additional text regions.
  • the method includes providing an enhanced first text region using the feature map and the first text region.
  • the method can include identifying and removing a background from the image.
  • the method can also include providing an enhanced second text region and/or additional enhanced text regions using the image preprocessing model.
  • the method includes generating a data sequence using the enhanced first text region by a convoluted neural network.
  • the convolutional neural network includes a U-net architecture, or the like.
  • the method can also include generating a low-frequency information level based on the first text region using the U-net architecture.
  • the low-frequency information level and the first text region can be characterized by different dimensions.
  • the method includes mapping the data sequence to an identified text.
  • the method can include overlaying the identified text over the first text region.
  • Figure 6 is a simplified flow diagram illustrating a text recognition method 600 with a shared backbone according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limit the scope of the claims.
  • method 600 includes step 602 of receiving an input image for text recognition processing.
  • this input image includes at least a text region and a background region (e.g., background pattern, background text, image noise, etc.).
  • the method includes training an image preprocessing model using a U-net architecture. As discussed previously, these steps can be carried out by a processor in a text recognition system, a server, or other computing devices.
  • the method includes generating a background removed image, i.e., an enhanced input image.
  • the generating the background removed image includes generating a feature map using the image preprocessing model that was trained using the U-net architecture, and then using the feature map to produce an enhanced text region.
  • the method includes applying a CNN model to extract one or more features associated with the text region, and applying a sequence module (i.e., processing by a sequence layer) to encode the extracted features into a feature sequence, respectively.
  • the step of extracting features data can be performed in multiple iterations on the text region using the CNN model.
  • the method includes applying dense module (i.e., processing by a dense layer) to map the feature sequence to one or more identified text portions (e.g., characters, words, etc.).
  • dense module i.e., processing by a dense layer
  • Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. Further details of these techniques are discussed with reference to Figures 7 A and 7B.
  • FIG. 7A is a simplified diagram illustrating a U-net architecture used in image preprocessing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • diagram 701 shows example inputs and outputs to an image preprocessing model.
  • the input and output dimensions of the model are the same with respect to batch size, channel, height, and width.
  • the first input dimensions are shown as (64, 1 , 32, 320), referring to the previous dimensions, respectively.
  • the input image is a greyscale image, which is denoted by a channel value of one.
  • the method includes down-sampling the input image (left-side), which is characterized as follows: (64, 1, 32, 320) -> (64, 64, 16, 160) ”» (64, 128, 8, 80) (64, 160, 4, 80) -> (64, 256, 2, 80)
  • the method includes reverting the down-sampling process (right- side) to obtain the image at the original size.
  • the reversion process includes applying a CNN model and applying concatenation processes to effectively up-sample the image.
  • the U-net network architecture has a plurality of layers, including an upper layer (upper part of the U-shape) and a bottom layer configured to process the input image/graph.
  • the upper layer of the network can be configured to obtain detailed information of the graph.
  • the bottom layer can be configured to obtain low-frequency information of the graph (e.g., using a large receptive field to obtain large outline information).
  • the method can include using one or more skip connections can be used to retain information at each level, which enables the network to remember all of the graph information.
  • FIG. 7B is a simplified diagram illustrating a u-net architecture used in image preprocessing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in tire art would recognize many variations, alternatives, and modifications.
  • flow diagram 702 illustrates a method of training a U-net model, similar to that shown in Figure 7A.
  • the method includes down-sampling an input image 710 (left-side).
  • the different stages of the down-sampled image are shown by the block flow 720 of transformed blocks representing reducing the quality of the image.
  • the down-sampling reaches a bottleneck 720, where a CNN model can be applied to revert the down-sampling process, which is shown by the block flow 730.
  • skip connections (shown by the arrows between the block flow 720 and the block flow 730) can be used to retain the image information at each level, which can be used with the CNN model to effectively up-sample the image back to the original size at the output image 740.
  • the present invention can use a CNN-based model, which is compatible with most OCR applications, to generate a data sequence to be mapped to an identified text.
  • the basic CNN architecture can be a general CNN backbone without the last several dense/classification layers (e.g., Resnet, VGG-16, ghostNet, EfficientNet, etc.) as long as it can extract the feature at the pixel level.
  • the dense/classification layers may be used to extract the pixel-level features, as shown in method 600 of Figure 6.
  • the type of CNN depends on the restriction of the speed and memory.
  • the recurrent neural network can be added for feature encoding.
  • the sliced feature can be mapped with each word or character using bidirectional long short-term memory (BiLSTM).
  • the LSTM is directional, i.e., it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, two LSTMs, one forward and one backward, can be combined into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM.
  • LSTM Besides LSTM, other recurrent neural networks, such as a gated recurrent unit (GRU) or the like, may be used as well.
  • GRU gated recurrent unit
  • the GRU is very similar to LSTM, having update and reset gates.
  • the deep structure can allow for a higher level of abstractions compared a shallow one. Such advantages and result in significant performance improvements in the task of text recognition.
  • the basic model consists of ConvNet and recurrent neural networks.
  • the convolutional layers automatically extract a feature sequence from each input image.
  • a recurrent network is built for predicting each frame of the feature sequence that is outputted by the convolutional layers.
  • Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.
  • the present invention uses a connectionist temporal classification (CTC) algorithm to align sequences where timing is variable.
  • CTC connectionist temporal classification
  • the CTC algorithm is alignment-free, i.e., it doesn't require an alignment between the input and the output.
  • CTC works by summing over the probability of all possible alignments between the two sequences.
  • the CTC alignments give us a natural way to go from probabilities at each time-step to the probability of an output sequence.
  • the CTC function is defined as follows:
  • T* argmax p(y

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The present invention is directed to text recognition methods and techniques. According to a specific embodiment, a text region within an image is identified. Feature data are obtained from the text region using a preprocessing model, which is trained using a machine learning process. The text region is enhanced using the feature data before text recognition is performed. There are other embodiments as well.

Description

METHODS AND SYSTEMS FOR TEXT RECOGNITION WITH IMAGE PREPROCESSING
BACKGROUND OF THE INVENTION
[0001] The present invention is directed to text recognition methods and techniques.
[0002] As more and more documents and images are stored electronically, recognizing and extracting text from images have become ubiquitous. For example, one of the tools is optical character recognition (OCR), which refers to electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text. The input image may be obtained from a scanned document, a photo of a document, a scene photo, or subtitle text superimposed on an image. Optical character recognition is useful for extracting information from images, and its applications include searching, positioning, translation, recommendation, and many others. Over the past, many conventional text recognition systems have been proposed, but they were inadequate for the reasons detailed below.
[0003] Therefore, new and improved methods and systems for text recognition are desired.
BRIEF SUMMARY OF THE INVENTION
[0004] The present invention is directed to text recognition methods and techniques. According to a specific embodiment, a text region within an image is identified. Feature data are obtained from the text region using a preprocessing model, which is trained using a machine learning process. The text region is enhanced using the feature data before text recognition is performed. There are other embodiments as well.
[0005] Embodiments of the present invention can be implemented in conjunction with existing systems and processes. For example, the text recognition system according to the present invention can be used in a wide variety of systems, including mobile devices, communication systems, and the like. Additionally, various techniques according to the present invention can be adopted into existing systems via training of a convolutional neural network (CNN) model, which is compatible with most optical character recognition (OCR) applications. There are other benefits as well.
[0006] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for text recognition. The method also includes obtaining an image, the image including at least a first text region. The method also includes storing tire image in a memory. The method also includes identifying the first text region. The method also includes obtaining an image preprocessing model from storage. The method also includes generating a feature map using the image preprocessing model and tire first text region. The method also includes providing an enhanced first text region using the feature map and the first text region. The method also includes generating a data sequence using the enhanced first text region by a convoluted neural network. The method also includes mapping the data sequence to an identified text. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0007] Implementations may include one or more of the following features. The method may include receiving the image from a network interface. The method may include cropping the first text region. The method may include extracting feature data in multiple iterations from the first text region using the image preprocessing model. The method may include identifying and removing a background from the image. The method may include overlaying the identified text over the first text region. The method may include detecting a language using at least the feature map. The convolutional neural network may include a u-net architecture. The method may include generating a low-frequency information level based on the first text region using the u-net architecture, the low-frequency information level. The low-frequency information level and the first text region are characterized by different dimensions. The method may include removing a background from the image. The method may include detecting a language associated with the first text region: Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0008] One general aspect includes a system for text recognition. The system also includes a housing. The system also includes a camera mounted on the housing and configured to capture an input image. The system also includes a memory configured to store the input image. The system also includes a storage configured to store a preprocessing model. The system also includes a user interface configured to display an identified text. The system also includes a processor coupled to the storage and the memory, the processor being configured to: identify a text region on the input image, generate a feature map using the preprocessing model and the text region, provide an enhanced text region using the feature map and the text region, generate a data sequence using the enhanced text region, and map the data sequence to the identified text. Other embodiments of this aspect include corresponding computer systems, apparatus. and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0009] Implementations may include one or more of the following features. The system where the processor may include a neural network processor configured for generating the feature map using a u-net architecture. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0010] One general aspect includes a method for providing a training model for processing text images. The method also includes generating a text string. The method also includes providing a background image. The method also includes generating a reference image containing the text string. The method also includes generating an initial training image containing the text string and the background image. The method also includes generating a modified training image by reducing the quality of the initial training image using one or more randomized processes. The method also includes calculating a pattern between the reference image and the modified training image using a convoluted neural network. The method also includes storing the pattern as an image preprocessing model at a storage device. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0011] Implementations may include one or more of the following features. The method may include retrieving the background image from a network source. The method may include encoding the initial training image. The method may include compressing and/or scaling the initial training image. The method may include introducing one or more types of noise to the initial training image. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0012] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, the present systems and methods for image processing increase the model performance directly, resulting in improved image quality and easier text recognition. Additionally, the use of the lightweight U-net model will have a large positive influence on the OCR result without adding much computational cost.
[0013] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings. BRIEF DESCRIPTION OF THE DRAWINGS
[0014] Figure 1 is a simplified block diagram illustrating a system for capturing images and recognizing text from the captured images according to embodiments of the present invention. [0015] Figure 2 is a simplified block diagram illustrating a server for generating an imaging preprocessing model and/or performing text recognition according to embodiments of the present invention.
[0016] Figure 3 is a simplified flow diagram illustrating method for training a preprocessing model according to embodiments of the present invention.
[0017] Figure 4 is a simplified diagram illustrating synthetic images generated for training according to embodiments of the present invention.
[0018] Figure 5 is a simplified flow diagram illustrating method for recognizing texts in an image according to embodiments of the present invention.
[0019] Figure 6 is a simplified flow diagram illustrating a text recognition method with a shared backbone according to embodiments of the present invention.
[0020] Figure 7A is a simplified diagram illustrating a U-net architecture used in image preprocessing according to embodiments of the present invention.
[0021] Figure 7B is a simplified diagram illustrating a U-net architecture used in image preprocessing according to embodiments of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0022] The present invention is directed to text recognition methods find techniques. According to a specific embodiment, a text region within an image is identified. Feature data are obtained from the text region using a preprocessing model, which is trained using a machine learning process. The text region is enhanced using the feature data before text recognition is performed. There tire other embodiments as well.
[0023] Over tire years, many techniques for scene text recognition (STR) have been developed, including both traditional and deep learning approaches. For deep learning, the pros and cons of the following approaches are considered: convolutional neural network (CNN) backbone using sequence module and connectionist temporal classification (CTC) loss; and attention-based neural network using cross-entropy loss. The CNN model approach is more friendly with general and long scene text recognition (e.g., text, paper, magazines) since it can extract more sequence features and does not require a specific input size. On the other hand, the attention-based model approach can focus more details on the character pixels, which performs better in street view, vertical text, etc. [0024] However, conventional deep-learning STR techniques, such as those described above, struggle when applied to various complicated images, such as those with messy backgrounds or image noise, because the recognition model will tend to regard extraneous portions of the image as a part of the text to be identified. These errors occur due to the contours/comers of these extraneous portions being evaluated as features for recognition. Since conventional deep learning techniques do not have good solutions for these situations, traditional techniques often use image processing methods to increase the image quality to differentiate the text to be identified.
[0025] There are many drawbacks with using traditional image processing methods (e.g., denoising and increasing contrast). Denoising methods using low-frequency pass bands can remove many text contours details. Since the edges and corners of text are also present within the high-frequency information, denoising methods that smooth or remove the high-frequency features can also cause the loss of text details as well. Methods that increase the contrast to enhance the text region can also enhance the noise or extraneous portions of the background. In both cases, the processing may actually reduce the image quality and reduce the accuracy, resulting in a poor performance by the recognition model.
[0026] Thus, a general aspect of the present invention is come up with a new solution to remove the noise for scene text recognition. In various embodiments, the present invention provides methods and systems that uses deep learning techniques to remove noise and complex backgrounds of a scene text image in order to improve the performance of text recognition models.
[0027] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the arc, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
[0028] In the following detailed description, numerous specific details tire set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention. [0029] The reader’s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. Ah the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. [0030] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of” or “act of” in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.
[0031] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise and counter-clockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of an object.
[0032] Figure 1 is a simplified block diagram illustrating a system 100 for capturing images and recognizing text from the captured images according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in tire art would recognize many variations, alternatives, and modifications.
[0033] As shown, the text recognition system 100 can be configured within a housing 110 and can include a camera device 120 (or other image or video capturing device), a processor device 130, a memory device 140, and a storage device 150. The camera 120 can be mounted on the housing 110 and be configured to capture an input image. The input image can be stored in the memory 140, which can include a random-access memory (RAM) device, an image buffer device, or the like. The storage device 150 can be configured to store a preprocessing model, which is used to evaluate the input image for text recognition. The processor 130 can be coupled to each of the previously mentioned components and be configured to communicate between these components. In a specific example, the processor 130 can include a central processing unit (CPU), a network processing unit (NPU), or the like. [0034] The system 100 can also include a user interface 160 and a network interface 170. The user interface 160 can be configured to display an identified text (e.g., from the input image). In a specific example, the user interface 160 can include a display region 162 to display the identified text. This display region 162 can also be a touchscreen display (e.g., in a mobile device, tablet, etc.) Alternatively, the user interface 160 can also include a touch interface 164 for receiving user input (e.g., keyboard or keypad in a mobile device, laptop, or other computing devices). The user interface 160 can be used in real-time applications (RTAs), such as multimedia streaming, video conferencing/messaging, navigation, and the like.
[0035] The network interface 170 can be configured to transmit and receive images (e.g., using Wi-Fi, Bluetooth, Ethernet, etc.) for text recognition. In a specific example, the network interface 170 can also be configured to compress or down-sample images for transmission or further processing. The network interface 170 can also be configured to send one or more images to a server for OCR. The processor 130 can also be coupled to and configured to communicate between the user interface 160, the network interface 170, and any other interfaces.
[0036] In an example, the processor 130 can be configured to identify a text region on the input image, which can be stored in the memory device 140; to generate a feature map using the preprocessing model and the text region: to provide an enhanced text region using the feature map and the text region; to generate a data sequence using the enhanced text region; and to map the data sequence to the identified text. In a specific example, processor 130 includes a neural network processor configured for generating the feature map using a U-net architecture.
[0037] Other embodiments of this system include corresponding computer systems, apparatus, and computer programs recorded on one or more comptiter storage devices, each configured to perform the actions of the methods. Further details of methods for text recognition, model training for processing text images, and related techniques are discussed with reference to the following figures.
[0038] Figure 2 is a simplified block diagram illustrating a server 200 for generating an imaging preprocessing model and/or performing text recognition according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
[0039] As shown, server 200 includes a processor 210 coupled to a storage device 220 and a network interface 230. Similar to the text recognition system 100 of Figure 1, the processor 210 can be configured to communicate with the storage device 220 and the network interface 230. The storage device 220 can also be configured to store a preprocessing model, which is used to evaluate the input image for text recognition. In a specific example, the processor 210 can include a central processing unit (CPU), a network processing unit (NPU), or the like. [0040] The network interface 230 can be configured to transmit and receive one or more input images (e.g., using Wi-Fi, Bluetooth, Ethernet, etc.) for text recognition. In a specific example, the network interface 230 can also be configured to compress or down-sample images for transmission or further processing. In a specific example, the server 200 is configured to perform OCR, or similar text recognition processing, on images received over a network from various network- enabled devices, such as mobile devices 291 and other computing devices 292. The server 200 can transmit the resulting identified text from the text recognition process back to these devices (291, 292) over the respective networks.
[0041] In an example, server 200 can be configured to perform OCR services for network- enabled devices (e.g., mobile device 291, computing device 292, network- enabled user devices, etc.). Using the techniques of the present invention, server 200 can be configured to process low-quality/noisy images or images with complex backgrounds for text recognition. Such text recognition services can be performed by server 200 for real- time applications and/or batch processing applications.
[0042] Similar to the processor 130 of the system 100, the processor 210 can be configured to identify a text region on the input image; to generate a feature map using tire preprocessing model (from the storage device 220) and the text region; to provide an enhanced text region using the feature map and the text region: to generate a data sequence using the enhanced text region; and to map the data sequence to the identified text. In a specific example, the processor 210 includes a neural network processor configured for generating the feature map using a U-net architecture.
[0043] In an example, server 200 can also be configured to perform neural network model training for use in text recognition processes. The training can include obtaining training images and backgrounds, generating training data sets, and creating or updating a preprocessing model (e.g., image preprocessing model stored in storage 220, or the like). Further details of the model training process are described with reference to Figure 3.
[0044] In an example, the processor 210 of the server 200 can also be coupled to memory devices, other peripheral devices, a user interface, and/or other interfaces. Other embodiments of this system include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
[0045] Figure 3 is a simplified flow diagram illustrating method for training a preprocessing model according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limit the scope of the claims.
[0046] As shown, the method 300 includes step 302 of generating a text string and step 304 of providing a background image. The text string can include any words, phrases, sentences, etc. In a specific example, the method can include retrieving the background image from a network source. A processor (e.g., processor 130 of system 100, processor 210 of server 200, or tire like) can be configured to generate the text string and to retrieve the background image from the network source using a network interface (e.g., network interface 170 of system 100, network interface 230 of server 200, or the like).
[0047] In steps 306 and 306, the method includes generating a reference image containing the text string and generating an initial training image containing the text string and the background image, respectively. In step 310, the method includes generating a modified training image by reducing the quality of the initial training image using one or more randomized processes. Further details of the training images are discussed with reference to Figure 4.
[0048] In step 312, die method includes calculating a pattern between the reference image and the modified training image using a convolutional neural network (CNN). In an example, the reference image and the modified training image can be used to train a neural network model (e.g., CNN, or the like) to serve as an image preprocessing model. This image preprocessing model can be used to remove the complex background portions or noise in an input image that makes it difficult to identify a target text region of the input image. In a specific example, the model is trained using U-shape down sampling and up sampling neural network (U-net) architecture for removing such complex backgrounds or noise.
[0049] With the U-net architecture, the input image can include a down-sampling encoding process to obtain a string of features smaller than the original image, which is similar to the effect of an image compression process. After a decoding process, the goal is to restore the original image. Thus, the method can include artificially altering a training image (e.g., adding noise, reducing quality, or other image manipulation) to generate the modified training image to be used to train an image preprocessing model. Further details of the image preprocessing model training are discussed with reference to Figures 6, 7A, and 7B.
[0050] In step 314, the method includes storing the pattern as an image preprocessing model at a storage device. Referring to system 100, the processor 130 can be configured to store the image preprocessing model at the storage device 160 for future use in a text recognition process (e.g., method 500 of Figure 5). Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0051] Figure 4 is a simplified diagram illustrating synthetic images generated for training according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
[0052] As shown, Figure 4 shows multiple training images that can be used for generating an image preprocessing model, as described previously for method 300 in Figure 3. In these cases, the generated text string is ‘‘Super Text”. Training image 400 shows a clean reference image with the generated text string on a clear background. Training images 401-403 show the text string on a background with a repeating pattern, a background of noise/random variations, and a background of the text, respectively.
[0053] In a specific example, a synthetic data generator can be used to produce a modified training image (e.g., step 310 of method 300). The synthetic data generator can use images or combinations of images retrieved from a network source, image dataset, or the like. The generator can also use a white or black background with randomized text using different transparency levels. Further, the generator can randomly add different types of image noi se (e.g., Gaussian, Poisson, etc.), apply different levels of compression (i.e., reducing image quality at randomized levels), and apply different levels of scaling (i.e., enlarging the image, reducing the image, etc.).
[0054] The clean reference image 400 can be used with any of the training images 401-403, or similar training images, to train a neural network model as an image preprocessing model to be used to remove complex background portions or noise from an input image. The implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
[0055] Figure 5 is a simplified flow diagram illustrating method for recognizing texts in an image according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limi t the scope of the claims.
[0056] As shown, method 500 includes step 502 of obtaining an image, which includes at least a first text region, and step 504 of storing the image in a memory device. Referring to the text recognition system 100 of Figure 1, the processor 130 can be configured to obtain the image using the camera device 120 and to store the image in the memory 140. In a specific example, the method can include receiving the image from a network interface. In such a case, the processor 130 can also be configured to obtain the image from the network interface 180. [0057] In step 506, the method includes identifying the first text region. In a specific example, the method can include cropping the first text region to remove unrelated portions of the image. In other examples, a second text region and/or additional text regions may be identified and cropped as well. Using a text detection module, the location of the text regions can be located by corner points’ coordinates that outline the boundaries to be cropped.
[0058] In steps 508 and 510, the method includes obtaining an image preprocessing model from a storage and generating a feature map using the image preprocessing model and the first text region, respectively. Referring to system 100 of Figure 1, the processor 130 can be configured to obtain the image preprocessing model from the storage device 150 and to generate the feature map. In a specific example, the method can include extracting feature data in multiple iterations from the first text region using the image preprocessing model.
Similarly, the method can also include extracting such feature data from a second text region and/or additional text regions.
[0059] In step 512, the method includes providing an enhanced first text region using the feature map and the first text region. In a specific example, the method can include identifying and removing a background from the image. Similarly, the method can also include providing an enhanced second text region and/or additional enhanced text regions using the image preprocessing model.
[0060] In step 514, the method includes generating a data sequence using the enhanced first text region by a convoluted neural network. In a specific example, the convolutional neural network includes a U-net architecture, or the like. The method can also include generating a low-frequency information level based on the first text region using the U-net architecture.
The low-frequency information level and the first text region can be characterized by different dimensions.
[0061] In step 516, the method includes mapping the data sequence to an identified text. In a specific example, the method can include overlaying the identified text over the first text region. The method can also include detecting language associated with the first text region, the second text region, and/or any additional text regions. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer- accessible medium.
[0062] Figure 6 is a simplified flow diagram illustrating a text recognition method 600 with a shared backbone according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, rearranged, replaced, modified, and/or overlapped, and they should limit the scope of the claims.
[0063] As shown, method 600 includes step 602 of receiving an input image for text recognition processing. In an example, this input image includes at least a text region and a background region (e.g., background pattern, background text, image noise, etc.). In step 604, the method includes training an image preprocessing model using a U-net architecture. As discussed previously, these steps can be carried out by a processor in a text recognition system, a server, or other computing devices.
[0064] In step 606, the method includes generating a background removed image, i.e., an enhanced input image. In a specific example, the generating the background removed image includes generating a feature map using the image preprocessing model that was trained using the U-net architecture, and then using the feature map to produce an enhanced text region. [0065] In step 608 and 610, the method includes applying a CNN model to extract one or more features associated with the text region, and applying a sequence module (i.e., processing by a sequence layer) to encode the extracted features into a feature sequence, respectively. In a specific example, the step of extracting features data can be performed in multiple iterations on the text region using the CNN model.
[0066] In step 612, the method includes applying dense module (i.e., processing by a dense layer) to map the feature sequence to one or more identified text portions (e.g., characters, words, etc.). Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. Further details of these techniques are discussed with reference to Figures 7 A and 7B.
[0067] Figure 7A is a simplified diagram illustrating a U-net architecture used in image preprocessing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
[0068] As shown, diagram 701 shows example inputs and outputs to an image preprocessing model. For an input image, the input and output dimensions of the model are the same with respect to batch size, channel, height, and width. Here, the first input dimensions are shown as (64, 1 , 32, 320), referring to the previous dimensions, respectively. In this case, the input image is a greyscale image, which is denoted by a channel value of one. In this example, the method includes down-sampling the input image (left-side), which is characterized as follows: (64, 1, 32, 320) -> (64, 64, 16, 160) ”» (64, 128, 8, 80) (64, 160, 4, 80) -> (64, 256, 2, 80)
“> (64, 256, 1 , 160). Then, the method includes reverting the down-sampling process (right- side) to obtain the image at the original size. In a specific example, the reversion process includes applying a CNN model and applying concatenation processes to effectively up-sample the image.
[0069] In a specific example, the U-net network architecture has a plurality of layers, including an upper layer (upper part of the U-shape) and a bottom layer configured to process the input image/graph. The upper layer of the network can be configured to obtain detailed information of the graph. The bottom layer can be configured to obtain low-frequency information of the graph (e.g., using a large receptive field to obtain large outline information). The method can include using one or more skip connections can be used to retain information at each level, which enables the network to remember all of the graph information.
[0070] Figure 7B is a simplified diagram illustrating a u-net architecture used in image preprocessing according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in tire art would recognize many variations, alternatives, and modifications.
[0071] As shown, flow diagram 702 illustrates a method of training a U-net model, similar to that shown in Figure 7A. In an example, the method includes down-sampling an input image 710 (left-side). The different stages of the down-sampled image are shown by the block flow 720 of transformed blocks representing reducing the quality of the image. The down-sampling reaches a bottleneck 720, where a CNN model can be applied to revert the down-sampling process, which is shown by the block flow 730. As discussed previously, skip connections (shown by the arrows between the block flow 720 and the block flow 730) can be used to retain the image information at each level, which can be used with the CNN model to effectively up-sample the image back to the original size at the output image 740.
[0072] As discussed previously, the present invention can use a CNN-based model, which is compatible with most OCR applications, to generate a data sequence to be mapped to an identified text. In a specific example, the basic CNN architecture can be a general CNN backbone without the last several dense/classification layers (e.g., Resnet, VGG-16, GhostNet, EfficientNet, etc.) as long as it can extract the feature at the pixel level. In other cases, the dense/classification layers may be used to extract the pixel-level features, as shown in method 600 of Figure 6. The type of CNN depends on the restriction of the speed and memory.
[0073] Once finishing the feature extraction, the recurrent neural network, known as the sequence module, can be added for feature encoding. For feature encoding, the sliced feature can be mapped with each word or character using bidirectional long short-term memory (BiLSTM). The LSTM is directional, i.e., it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, two LSTMs, one forward and one backward, can be combined into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM.
[0074] Besides LSTM, other recurrent neural networks, such as a gated recurrent unit (GRU) or the like, may be used as well. The GRU is very similar to LSTM, having update and reset gates. We can use a two-direction GRU as a BiGRU module for feature encoding. The deep structure can allow for a higher level of abstractions compared a shallow one. Such advantages and result in significant performance improvements in the task of text recognition. [0075] In a specific example, the basic model consists of ConvNet and recurrent neural networks. In such a case, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for predicting each frame of the feature sequence that is outputted by the convolutional layers.
Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.
[0076] In a specific example, the present invention uses a connectionist temporal classification (CTC) algorithm to align sequences where timing is variable. The CTC algorithm is alignment-free, i.e., it doesn't require an alignment between the input and the output. However, to get the probability of an output given an input, CTC works by summing over the probability of all possible alignments between the two sequences. The CTC alignments give us a natural way to go from probabilities at each time-step to the probability of an output sequence. The CTC function is defined as follows:
Figure imgf000015_0001
[0077] After the training of the model, applying tire model to find a likely output for a given input requires solving the following:
T* = argmax p(y|X) [0078] One heuristic is to take the most likely output at each time step. This approach gives the alignment with the highest probability, shown as follows:
Figure imgf000015_0002
[0079] By finding the most possible words of each feature, the final result of the prediction can be obtained. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. [0080] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.

Claims

WHAT IS CLAIMED IS:
1. A method for text recognition, the method comprising: obtaining an image, the image including at least a first text region; storing the image in a memory; identifying the first text region; obtaining an image preprocessing model from a storage; generating a feature map using the image preprocessing model and the first text region; providing an enhanced first text region using the feature map and the first text region; generating a data sequence using the enhanced first text region by a convoluted neural network: and mapping the data sequence to an identified text.
2. The method of claim 1 further comprising receiving the image from a network interface.
3. The method of claim 1 further comprising cropping tire first text region.
4. The method of claim 1 further comprising extracting feature data in multiple iterations from the first text region using the image preprocessing model.
5. The method of claim 1 further comprising identifying and removing a background from the image.
6. The method of claim 1 further comprising overlaying the identified text over the first text region.
7. The method of claim 1 further comprising detecting a language using at least the feature map.
8. The method of claim 1 wherein the convolutional neural network comprises a U-net architecture.
9. The method of claim 8 further comprising generating a low-frequency information level based on the first text region using the U-net architecture.
10. The method of claim 9 wherein the low-frequency information level and the first text region are characterized by different dimensions.
11. The method of claim 1 further comprising removing a background from the image.
12 . The method of claim 1 further comprising: identifying a second text region; and providing an enhanced second text region using the image preprocessing model and the second text region.
13. The method of claim 1 further comprising detecting a language associated with the first text region:
14. A system for text recognition, the system comprising: a housing: a camera mounted on the housing and configured to capture an input image; a memory configured to store the input image; a storage configured to store a preprocessing model; a user interface configured to display an identified text; and a processor coupled to the storage and the memory, the processor being configured to: identify a text region on the input image; generate a feature map using the preprocessing model and the text region; provide an enhanced text region using the feature map and the text region; generate a data sequence using the enhanced text region; and map the data sequence to the identified text.
15. The system of claim 14 wherein the processor comprises a neural network processor configured for generating the feature map using a U-net architecture.
16. A method for providing a training model for processing text images, the method comprising: generating a text string; providing a background image; generating a reference image containing the text string; generating an initial training image containing the text string and the background image; generating a modified training image by reducing a quality of the initial training image using one or more randomized processes; calculating a pattern between the reference image and the modified training image using a convoluted neural network; and storing the pattern as an image preprocessing model at a storage device.
17. The method of claim 16 further comprising retrieving the background image from a network source.
18. The method of claim 16 further comprising encoding the initial training image.
19. The method of claim 16 further comprising compressing and/or scaling the initial training image.
20. The method of claim 16 further comprising introducing one or more types of noise to the initial training image.
PCT/US2022/040331 2022-08-15 2022-08-15 Methods and systems for text recognition with image preprocessing WO2024039362A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2022/040331 WO2024039362A1 (en) 2022-08-15 2022-08-15 Methods and systems for text recognition with image preprocessing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/040331 WO2024039362A1 (en) 2022-08-15 2022-08-15 Methods and systems for text recognition with image preprocessing

Publications (1)

Publication Number Publication Date
WO2024039362A1 true WO2024039362A1 (en) 2024-02-22

Family

ID=89942115

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/040331 WO2024039362A1 (en) 2022-08-15 2022-08-15 Methods and systems for text recognition with image preprocessing

Country Status (1)

Country Link
WO (1) WO2024039362A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5625719A (en) * 1992-10-19 1997-04-29 Fast; Bruce B. OCR image preprocessing method for image enhancement of scanned documents
US20030200505A1 (en) * 1997-07-25 2003-10-23 Evans David A. Method and apparatus for overlaying a source text on an output text
US20150254507A1 (en) * 2012-11-29 2015-09-10 A9.Com, Inc. Image-Based Character Recognition
US20160110340A1 (en) * 2014-10-17 2016-04-21 Machine Zone, Inc. Systems and Methods for Language Detection
US20210064871A1 (en) * 2019-08-27 2021-03-04 Lg Electronics Inc. Apparatus and method for recognition of text information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5625719A (en) * 1992-10-19 1997-04-29 Fast; Bruce B. OCR image preprocessing method for image enhancement of scanned documents
US20030200505A1 (en) * 1997-07-25 2003-10-23 Evans David A. Method and apparatus for overlaying a source text on an output text
US20150254507A1 (en) * 2012-11-29 2015-09-10 A9.Com, Inc. Image-Based Character Recognition
US20160110340A1 (en) * 2014-10-17 2016-04-21 Machine Zone, Inc. Systems and Methods for Language Detection
US20210064871A1 (en) * 2019-08-27 2021-03-04 Lg Electronics Inc. Apparatus and method for recognition of text information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AZAD REZA; BOZORGPOUR I AFSHIN; ASADI-AGHBOLAGHI MARYAM; MERHOF DORIT; ESCALERA SERGIO: "Deep Frequency Re-calibration U-Net for Medical Image Segmentation", 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 11 October 2021 (2021-10-11), pages 3267 - 3276, XP034027723, DOI: 10.1109/ICCVW54120.2021.00366 *
BAOGUANG SHI, XIANG BAI, CONG YAO: "An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 39, no. 11, 29 December 2016 (2016-12-29), pages 2298 - 2304, XP055405545, DOI: 10.1109/TPAMI.2016.2646371 *
REDDY SUSMITH: "Pre-Processing in OCR!!!", TOWARDS DATA SCIENCE, 25 March 2019 (2019-03-25), pages 1 - 18, XP093145084, Retrieved from the Internet <URL:https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7> [retrieved on 20240325] *
WOJCIECH BIENIECKI ; SZYMON GRABOWSKI ; WOJCIECH ROZENBERG: "Image Preprocessing for Improving OCR Accuracy", PERSPECTIVE TECHNOLOGIES AND METHODS IN MEMS DESIGN, 2007. MEMSTECH 20 07. INTERNATIONAL CONFERENCE ON, 1 May 2007 (2007-05-01), Pi , pages 75 - 80, XP031122877, ISBN: 978-966-553-614-7 *

Similar Documents

Publication Publication Date Title
AU2020319589B2 (en) Region proposal networks for automated bounding box detection and text segmentation
US10867171B1 (en) Systems and methods for machine learning based content extraction from document images
US8733650B1 (en) Decoding barcodes from images with varying degrees of focus
Luo et al. Design and implementation of a card reader based on build-in camera
CN109117846B (en) Image processing method and device, electronic equipment and computer readable medium
US20110090253A1 (en) Augmented reality language translation system and method
WO2015195300A1 (en) Obtaining structural information from images
KR20130003006A (en) Image feature detection based on application of multiple feature detectors
CN112070649B (en) Method and system for removing specific character string watermark
JP2005346707A (en) Low-resolution ocr for document acquired by camera
US20220414335A1 (en) Region proposal networks for automated bounding box detection and text segmentation
Demilew et al. Ancient Geez script recognition using deep learning
WO2009114967A1 (en) Motion scan-based image processing method and device
Kantipudi et al. Scene text recognition based on bidirectional LSTM and deep neural network
CN114255337A (en) Method and device for correcting document image, electronic equipment and storage medium
CN106503112B (en) Video retrieval method and device
Asad et al. High performance OCR for camera-captured blurred documents with LSTM networks
CN118015644B (en) Social media keyword data analysis method and device based on pictures and characters
das Neves et al. HU‐PageScan: a fully convolutional neural network for document page crop
Natei et al. Extracting text from image document and displaying its related information
Hüsem et al. A survey on image super-resolution with generative adversarial networks
WO2024039362A1 (en) Methods and systems for text recognition with image preprocessing
Sourvanos et al. Challenges in input preprocessing for mobile OCR applications: A realistic testing scenario
CN112446297A (en) Electronic typoscope and intelligent mobile phone text auxiliary reading method applicable to same
CN115861663B (en) Document image content comparison method based on self-supervision learning model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22955871

Country of ref document: EP

Kind code of ref document: A1