WO2021081562A2

WO2021081562A2 - Multi-head text recognition model for multi-lingual optical character recognition

Info

Publication number: WO2021081562A2
Application number: PCT/US2021/014171
Authority: WO
Inventors: Kaiyu ZHANG; Yuan Lin
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-04-29
Also published as: WO2021081562A3

Abstract

This application is directed to performing optical character recognition (OCR) using deep learning techniques. An electronic device receives an image and a language indicator that indicates that the textual content in the image corresponds to a first language. The electronic device processes the image using a multilingual text recognition model applicable to a plurality of languages. The electronic device generates a feature sequence including a plurality of probability values corresponding to the textual content of the image. The feature sequence includes a plurality of feature subsets that correspond to the plurality of languages. For each feature subset, each probability value indicates a probability that a respective textual content corresponds to a respective character in a dictionary of the corresponding language. The electronic device constructs a sparse mask based on the first language and combines the feature sequence and the sparse mask to determine the textual content.

Description

nr —

WO 2021/081562 PCT/US2021/014171

Multi-Head Text Recognition Model for Multi-lingual Optical

Character Recognition

TECHNICAL FIELD

[0001] The present application generally relates to artificial intelligence, particularly to methods and systems for using deep learning techniques to perform multi-lingual optical character recognition (OCR) on images having textual content.

BACKGROUND

[0002] Optical character recognition (OCR) is the electronic or mechanical conversion of an image of typed, handwritten, or printed text into machine-encoded text. The image may be a scanned copy or a snapshot of a paper document, a photo of a scene, or an image superimposed with subtitle text. OCR is an important technique for extracting information from different types of images. Once an image containing text goes through OCR processing, the text of the image is recognized and can be subsequently edited. OCR is also widely used with supplemental functions of searching, positioning, translation, recommendation. The accuracy of text recognition using OCR has improved greatly since the technique was first available.

[0003] Deep learning algorithms have been used to implement text recognition models in

OCR recently. These deep learning algorithms are normally developed for individual languages. When multiple languages are involved, a deep learning algorithm can grow in size drastically and demand a large amount of computational resources and training data. In many situations, training data is limited for some languages that are not frequently applied, rendering the deep learning algorithms for recognizing content in those languages inaccurate. Thus, it is beneficial to have an OCR text recognition model that has a reasonable size and can recognize multiple languages using the same deep learning algorithm.

SUMMARY

[0004] The present application describes embodiments related to a text recognition model and, more particularly, to systems and methods for using a deep learning based text recognition model to recognize text in an input image that may correspond to different languages. This text recognition model offers an end-to-end text recognition solution that shares a backbone and a nr —

WO 2021/081562 PCT/US2021/014171 body of deep learning with different classification heads and automatically detect and recognize the text of the input image in different languages. Training data of different languages are shared to train the text recognition model jointly. In an OCR stage, a sparse mask is applied to select an output in a designated language of the text of the input image, while the text may be recognized in multiple languages.

[0005] In one aspect, a method for identifying user gestures is implemented at a computer system having one or more processors and memory. The method includes receiving the image and a language indicator, processing the image using a multilingual text recognition model that is applicable to a plurality of languages, and generating a feature sequence including a plurality of probability values corresponding to the textual content of the image. The language indicator indicating that the textual content in the image corresponds to a first language. The feature sequence includes a plurality of feature subsets each of which corresponds to a respective one of the plurality of languages. For each feature subset, each probability value indicates a probability that the textual content corresponds to a respective one of a plurality of characters in a dictionary of the language corresponding to the respective feature subset. The method further includes constructing a sparse mask based on the first language and combining the feature sequence and the sparse mask to determine the textual content in the first language.

[0006] According to another aspect of the present application, an electronic device includes one or more processing units, memory and a plurality of programs stored in the memory. The programs, when executed by the one or more processing units, cause the electronic device to perform the method for identifying user gestures as described above.

[0007] According to another aspect of the present application, a non-transitory computer readable storage medium stores a plurality of programs for execution by an electronic apparatus having one or more processing units. The programs, when executed by the one or more processing units, cause the electronic apparatus to perform the method for identifying user gestures as described above.

BRIEF DESCRIPTION OF DRAWINGS

[0008] The accompanying drawings, which are included to provide a further understanding of the embodiments and are incorporated herein and constitute a part of the specification, illustrate the described embodiments and together with the description serve to explain the underlying principles. Like reference numerals refer to corresponding parts. nr —

WO 2021/081562 PCT/US2021/014171

[0009] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments. [0010] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.

[0011] Figure 3 is another example data processing system for training and applying a neural network based data processing model for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.

[0012] Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node 40 in the NN, in accordance with some embodiments.

[0013] Figure 5 is a block diagram of an exemplary text recognition system applying multi-head text recognition model for optical character recognition (OCR), in accordance with some embodiments.

[0014] Figure 6 illustrates a simplified process of feature encoding using bidirectional long short memory term (BiLSTM) in a text recognition model for OCR, in accordance with some embodiments.

[0015] Figure 7 is a flowchart illustrating an exemplary OCR process for recognizing multiple languages with a multi-head text recognition model, in accordance with some embodiments.

DETAILED DESCRIPTION

[0016] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from ssthe scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0017] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers nr —

WO 2021/081562 PCT/US2021/014171

104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network- connected home devices (e.g., a camera). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0018] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102, and implement some data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.

[0019] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area nr —

WO 2021/081562 PCT/US2021/014171 networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0020] Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video, image, audio, or textual data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C). The client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally.

Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104 A). The server 102 A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client nr —

WO 2021/081562 PCT/US2021/014171 device 104 A obtains the content data, sends the content data to the server 102 A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application). The client device 104A itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.

[0021] Figure 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104. nr —

WO 2021/081562 PCT/US2021/014171

[0022] Memory 206 includes high-speed random access memory, such as DRAM,

SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non-transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, and/or other web or non- nr —

WO 2021/081562 PCT/US2021/014171 web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g.,

IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 to be presented on client device 104.

[0023] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the nr —

WO 2021/081562 PCT/US2021/014171 above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.

[0024] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0025] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0026] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data

306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a nr —

WO 2021/081562 PCT/US2021/014171 predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.

[0027] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0028] The data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318. The data pre processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre-processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre- processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre- processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model nr —

WO 2021/081562 PCT/US2021/014171 training module 226 to process the pre-processed content data. The model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0029] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0030] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404

(e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max nr —

WO 2021/081562 PCT/US2021/014171 pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0031] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis.

[0032] Alternatively and additionally, in some embodiments, a recurrent neural network

(RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0033] The training process is a process for calibrating all of the weights Wi for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and nr —

WO 2021/081562 PCT/US2021/014171 intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0034] In some embodiments, the deep learning model 400 is trained to recognize textual content of an image that corresponds to one of a plurality of languages. More details on deep learning based text recognition are discussed below with reference to Figures 5-7.

[0035] Figure 5 is a block diagram of an exemplary text recognition system 500 applying multilingual text recognition model 550 for optical character recognition (OCR), in accordance with some embodiments. The multilingual text recognition model 550 receives an image or video with textual content (e.g., an input image 501 with textual content) and implements text detection and recognition in an end-to-end manner to generate recognized textual content (e.g., digitized text) in a plurality of languages. The text recognition model is regarded as a “multi head” model because a user may choose any one of the multiple available heads (e.g., mask A 510a, mask B 510b, or mask C 510c) as the main head for efficient language classification and recognition. These masks 510a-510c receives the recognized textual content provided by the text recognition model 550 and outputs a subset of the recognized textual content in a respective language (e.g., textual content in language A 512a, textual content in language B 512b, or textual content in language C 512c).

[0036] The multilingual text recognition model 550 includes a (1) a preprocessing module 502 for converting the input image, (2) a convolutional neural network 504 for extracting features related to textual content from the input image, (3) a recurrent neural network 506 for labeling the extracted features, and (4) a dense layer (also known as a fully-connected layer) for mapping an extracted feature vector to respective keys in a predefined dictionary. The multilingual text recognition model 550 is coupled to a main head, for language classification, which includes a selected mask (also known as a sparse matrix) (e.g., mask A 510a, mask B 510b, or mask A 510c).

[0037] The input image may include redundant information that is non-essential to deep learning based OCR tasks, and the preprocessing module 502 removes such redundant nr —

WO 2021/081562 PCT/US2021/014171 information to reduce computation complexity for the convolutional neural network 504. For example, the preprocessing module 502 converts the input image 501 having three channels into a greyscale image, degrades a resolution of the input image 501. In another example, the preprocessing module 502 crops the input image 501 to remove non-textual areas (e.g., areas that correspond to graphical elements or blank space). The preprocessing module 502 determines an area in the input image 501 with textual information and extracts this area to be used as an input for the convolutional neural network 504. For example, the preprocessing module 502 may draw rectangular bounding boxes around areas of the detected text. The size of the cropped images is adaptive to the font size or paragraph size of the textual content or is user adjustable. In another example, the preprocessing module 502 allows a user to manually select text areas (e.g., by drawing rectangular boxes on the original input image) and crops the selected areas to be used as inputs for the convolutional neural network 504. In another example, the preprocessing module 502 may split each of the cropped images into different frames (e.g., overlapping sub-image of a cropped image) to feed the CNN layers (e.g., using Keras TimeDistributed wrapper).

[0038] After the preprocessing module 502 turns the input image 501 with textual content into a plurality of cropped images (or frames) each containing a high concentration of line-segmented textual content, the convolutional neural network 504 receives each of the cropped images to extract segment-wise visual features. A convolutional neural network 504 is a common type of neural networks used in computer vision to recognize objects and patterns in images and uses filters (e.g., matrices with randomized number values) within convolutional layers (e.g., an input is transformed before being passed to a next layer). Specifically, a textual region of the input image 501 is divided to a plurality of segments. The convolutional neural network 504 extracts a feature vector for each segment and arranged the feature vectors for the plurality of segments into an ordered feature sequence.

[0039] In some embodiments, the convolutional neural network 504 does not include any fully connected layer, and a user can set an input shape with a specific height (e.g., 32) but with no fixed width, e.g., for each segment divided from the textual region of the input image 501.

The length of the input text can be adaptive and is adjustable for different scenarios. Depending on the specific speed/size requirement for a text recognition task, different types of convolutional neural networks are chosen. Examples of convolutional neural network include Densenet, ResNet50 for online model and Mobile NetV2, MobileNetV3, GhostNet-light for offline models. 119’ " " — "

WO 2021/081562 PCT/US2021/014171

[0040] The convolutional neural network 504 outputs the feature sequence (e.g., feature sequence 602 in Figure 6) as an image descriptor of the input image. The feature sequence is optionally sliced into a plurality of feature vectors (e.g., feature vector 604a) in a predefined direction (e.g., from left to right by column). Each feature vector in the feature sequence corresponds to a rectangular segment of the input image and can be regarded as an image descriptor of that segment.

[0041] The output feature vectors from the convolutional neural network 504 are then fed to the recurrent neural network 506 for feature labeling. A recurrent neural network 506 is designed to interpret spatial context information by receiving the feature vectors corresponding to the segments of the input image 501 and reusing activations of preceding or following segments in the input image 501 to determine a textual recognition output for each segment in the input image 501. In some embodiments, the recurrent neural network 506 maps the sliced feature vector containing textual information using a Bidirectional Long Short Memory Term (BiLSTM). A BiLSTM (including forward and backward directions) instead of a Long Short Memory Term (LSTM) is used since in image-based sequences, contexts from both directions (e.g., the previous and the subsequent characters/words) are useful and complementary to each other. In some embodiments, multiple BiLSTMs are used jointly to create a deep BiLSTM. The deep structure of joined BiLSTMs allows a higher level of abstractions than a shallow one and can result in significant performance improvements. As such, after the convolutional neural network 504 automatically extracts a feature sequence having feature vectors associated with segments in the input image 501, the recurrent neural network 506 predicts textual content corresponding to the feature sequence of the input image 501 based on at least the spatial context of each segment in the input image 501.

[0042] In some embodiments, the loss function used in training and deploying the text recognition model 500 is a connectionist temporal classification (CTC) loss function. The CTC loss function assigns a probability for any output (Y) given an input (X) (e.g., probabilities that a given image contains certain characters). The CTC loss function is alignment-free in that it does not require an alignment between the input and the output. To get the probability of an output given an input, the CTC loss function sums over the probabilities of all possible alignment between the input and the output.

[0043] Given that the convolutional neural network 504 and the recurrent neural network nr —

WO 2021/081562 PCT/US2021/014171

506 implements a CTC-loss function for training and deployment, a large dictionary is needed to cover all the characters inside a training ground truth for multi-language recognition. The dense layer 508 is implemented to receive an input from all the neurons present in the previous layer (e.g., the last layer in the recurrent neural network 506). However, given that the dictionary is often very large (e.g., containing thousands of keys corresponding to different characters in different languages), the dense layer 508 is correspondingly large and can result in overfitting of the neural networks during deployment. In addition, characters in different languages can look similar and may mislead the text recognition model. As a result, the text recognition model 500 includes a mask for language selection by a user.

[0044] The mask (e.g., mask A 501a) coupled to the dense layer 508 helps to increase recognition accuracy by removing inferences from unrelated languages. The mask is a sparse matrix consisting of zeros and ones. When the mask is multiplied with output vector of the dense layer 508, only certain numbers in the output vector (e.g., those multiplied by the ones in the sparse matrix) are preserved while the rest of the numbers are set to zero. For example, the output vector from the dense layer 508 may be a vector of length N. Each index in the output vector corresponds to a respective key in a hash map, wherein the corresponding value represents a character in a specific language. Each value of the output vector from the dense layer 508 represents a probability that the recognized character is the corresponding value in the hash map. [0045] For example, an output vector from the dense layer may be a vector of 8000 elements, defined as:

[0046] A corresponding dictionary is a hash map, defined as:

D = [1: v_l 2: v₂, 3: v₃ ... 8000: v₈₀₀₀]

[0047] Each of the values in the dictionary (14- u₈₀₀₀) corresponds to a character in a particular language. For example, v_x — u₄₀₀₀ rnay correspond to 4000 different characters in Chinese, v₄₀₀₁ — v₆₀₀₀ may correspond to 2000 different characters in Japanese, 1 ₀₀₁ ^— V₇₀₀₀ may correspond to 1000 different characters in Korean, and v₇₀₀₁ — u₈₀₀₀ may correspond to 1000 different characters in Latin. As such, the output vector from the dense layer corresponds to a segment in the input image 501. Each element in the output vector corresponds to a character of a respective language, and a value of the respective element indicates a probability of textual 119’ " " — "

WO 2021/081562 PCT/US2021/014171 content of the segment in the input image 501 corresponding to the character of the respective language.

[0048] Each mask is a sparse matrix with size 1 x 8000. For example, a first sparse matrix, when multiplying with V_output preserves the first 4000 vector elements and sets the rest of the vector elements to zeros (e.g., the first mask is for the Chinese language). A second sparse matrix, when multiplying with V_output preserves the next 2000 vector elements and sets the rest of the vector elements to zeros (e.g., the first mask is for the Japanese language). Therefore, after the user applies a particular mask based on a language indicator obtained with the input image 501, the text recognition system 500 avoids interference caused by other languages and improves model accuracy using a mask or sparse matrix. Furthermore, by using a combination of a text recognition model 550 with a language mask 510, training data of different languages can be used to train the text recognition model 550 without separating training data of different languages and without training the model 550 for different languages separately.

[0049] Figure 6 illustrates a simplified process 600 of feature encoding using bidirectional long-short memory term (BiLSTM) in a text recognition model 550 for OCR, in accordance with some embodiments. The input image 606 is a grayscale cropped image, e.g., which is processed from the input image 501 by the preprocessing module 502 in Figure 5. The input image 606 includes English text “STATE.” The convolutional neural network 504 coverts the input image 606 into a feature sequence 602 that represents an image descriptor of the input image 606. Specifically, the input image 606 is sliced or divided into a plurality of segments 610. The convolutional neural network 504 extracts a feature vector 604 from each segment 610 in the input image 606, and arranges the feature vectors 604 of the plurality of segments 610 into an ordered feature sequence 602. The feature sequence 602 is further processes by a recurrent neural network (e.g., recurrent neural network 506 in Figure 5) to update the feature vectors 604 based on a spatial context of each segment 610 in the input image 606.

[0050] For a specific segment 610 in the input image 606, the corresponding feature vector 604 includes an ordered sequence of vector elements corresponding to a plurality of characters in a dictionary. Each vector element represents a probability value of the specific segment 610 represent a corresponding character in the dictionary. The vector elements can be grouped into a sequence of feature subsets, and each feature subset corresponds to a respective one of the languages in the dictionary. The vector elements in each feature subset corresponds to nr —

WO 2021/081562 PCT/US2021/014171 characters in the respective one of the languages in the dictionary. For example, if the dictionary has 8000 characters of 5 languages in total, the specific segment 610 is associated with 8000 vector elements and 8000 probability values for being each character in the dictionary via the corresponding feature vector 604.

[0051] Further, in this example, the 8000 vector elements are divided into 5 feature subsets, and vector elements in each feature subset corresponds to the characters of a corresponding language. The first 2000 vector elements corresponds to Chinese, and selected by a sparse matrix defined by a corresponding language indicator. In some embodiments, the vector element having the largest probability value is identified from the first 2000 vector elements, and the corresponding Chinese character in the dictionary is identified as the textual content of the specific segment 610 in the image 606.

[0052] Figure 7 is a flowchart illustrating an exemplary OCR process 700 for recognizing multiple languages with a multilingual text recognition model 550, in accordance with some embodiments. For convenience, the process 700 is described as being implemented by a text recognition system 500. The text recognition system 500 receives (702) an image 501 and a language indicator. The language indicator indicates that the textual content in the image corresponds to a first language. In some embodiments, the image 501 is a three-channel image that includes both textual content and non-textual content. In some embodiments, the language indicator is entered by a user. In some embodiments, the language indicator is generated based on the image 501. The text recognition system 500 implements (704) a multilingual text recognition model 550 that is applicable to a plurality of languages to process the image 501. [0053] The text recognition system 500 generates (706) a feature sequence (e.g., a feature sequence 602 of Figure 6) including a plurality of probability values corresponding to the textual content of the image. The feature sequence includes (708) a plurality of feature subsets each of which corresponds to a respective one of the plurality of languages. For each feature subset, each probability value indicates a probability that the textual content corresponds to a respective one of a plurality of characters in a dictionary of the language corresponding to the respective feature subset (710). The text recognition system 500 constructs (712) a sparse mask based on the first language and combines (714) the feature sequence with the sparse matrix to determine the textual content in the first language.

[0054] In some embodiments, the text recognition system 500 identifies in the image 501 a textual region (e.g., the image 606 in Figure 6) having pictorial representation of the textual nr —

WO 2021/081562 PCT/US2021/014171 content in the first language (e.g., through the preprocessing module 502 of Figure 5) and slices the textual region into a plurality of segments 610 (e.g., respective fields). The feature sequence 602 includes a plurality of feature vectors 604, and each segment 610 corresponds to a respective one of the plurality of feature vectors 604.

[0055] In some embodiments, each segment 610 has a fixed text height and a respective text width that is adaptive and adjustable based on textual content of the respective segment 610. [0056] In some embodiments, the multilingual text recognition model is established based on at least a Bidirectional Long Short Memory Term (BiLSTM). Generating the feature sequence 602 further includes, for an intermediate segment: generating a respective one of the plurality of feature vectors 604 based on a first feature vector of a first segment and a second feature vector of a second segment. The intermediate segment is located between the first and second segments in the textual region (e.g., immediately adjacent to the first and second segments). This allows textual content to be recognized for the intermediate segment based on context information of the intermediate segment.

[0057] In some embodiments, the feature sequence and the sparse mask are combined by for each segment 610: identifying in the feature sequence a first feature subset corresponding to the first language and associated with the respective segment 610, determining a largest probability value among the probability values corresponding to the first feature subset, determining that the largest probability value corresponds to the respective one of the plurality of characters in the dictionary of the first language, and associating the respective segment 610 with the respective one of the plurality of characters in the first language.

[0058] In some embodiments, the multilingual text recognition model 550 includes one or more neural networks, and the one or more neural networks includes a convolution neural network (CNN) followed by a recurrent neural network (RNN). Further, in some embodiments, the RNN includes a BiLSTM configured to determine the plurality of probability values in the feature sequence based on bidirectional context of the textual content. Additionally, in some embodiments, the BiLSTM is a deep BiLSTM including a plurality of BiLSTMs that are stacked to each other. Also, in some embodiments, the CNN is selected from Densenet, ResNet50,

MobileNetV2, ModileNetV3, and GhostNet-light based on speed and size requirements.

[0059] In some embodiments, one or more neural networks are jointly using training data of the plurality of languages. Based on the training, the multilingual text recognition model 550 is established based on the one or more neural networks. Further, in some embodiments, the one nr —

WO 2021/081562 PCT/US2021/014171 or more neural networks are trained by minimizing a Connectionist Temporal Classification (CTC) based loss. In some embodiments, the training data includes a first data item in one of the plurality of languages. The one or more neural networks is trained by training the one or more neural networks to recognize the first data item in each of the plurality of languages.

[0060] It should be understood that the particular order in which the operations in Figure

7 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to cache and distribute specific data as described herein. Additionally, it should be noted that details described above with respect to Figures 1-6 are also applicable in an analogous manner to the process 700 described above with respect to Figure 7. For brevity, these details are not repeated here.

[0061] In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer- readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, computer-readable media generally may correspond to (1) tangible computer- readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the embodiments described in the present application. A computer program product may include a computer-readable medium.

[0062] The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of claims. As used in the description of the embodiments and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, elements, and/or components, but do 119’ " " — "

WO 2021/081562 PCT/US2021/014171 not preclude the presence or addition of one or more other features, elements, components, and/or groups thereof.

[0063] It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first electrode could be termed a second electrode, and, similarly, a second electrode could be termed a first electrode, without departing from the scope of the embodiments. The first electrode and the second electrode are both electrodes, but they are not the same electrode.

[0064] The description of the present application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications, variations, and alternative embodiments will be apparent to those of ordinary skill in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others skilled in the art to understand the invention for various embodiments and to best utilize the underlying principles and various embodiments with various modifications as are suited to the particular use contemplated. Therefore, it is to be understood that the scope of claims is not to be limited to the specific examples of the embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims.

Claims

nr — WO 2021/081562 PCT/US2021/014171 What is claimed is:

1. A method for identifying textual content in an image, comprising: receiving the image and a language indicator, the language indicator indicating that the textual content in the image corresponds to a first language; processing the image using a multilingual text recognition model that is applicable to a plurality of languages; generating a feature sequence including a plurality of probability values corresponding to the textual content of the image, wherein: the feature sequence includes a plurality of feature subsets each of which corresponds to a respective one of the plurality of languages; and for each feature subset, each probability value indicates a probability that the textual content corresponds to a respective one of a plurality of characters in a dictionary of the language corresponding to the respective feature subset; constructing a sparse mask based on the first language; and combining the feature sequence and the sparse mask to determine the textual content in the first language.

2. The method of claim 1, wherein, comprising: identifying in the image a textual region having pictorial representation of the textual content in the first language; slicing the textual region into a plurality of segments, wherein the feature sequence includes a plurality of feature vectors, and each segment corresponds to a respective one of the plurality of feature vectors.

3. The method of claim 2, wherein each segment has a fixed text height and a respective text width that is adaptive and adjustable based on textual content of the respective segment.

4. The method of claim 2, wherein: the multilingual text recognition model is established based on at least a Bidirectional Long Short Memory Term (BiLSTM), and generating the feature sequence further comprises, for an intermediate segment: nr —

WO 2021/081562 PCT/US2021/014171 generating a respective one of the plurality of feature vectors based on at a first feature vector of a first segment and a second feature vector of a second segment, the intermediate segment located between the first and second segments in the textual region.

5. The method of claim 2, wherein combining the feature sequence and the sparse mask further comprises for each segment: identifying, in the feature sequence, a first feature subset corresponding to the first language and associated with the respective segment; determining a largest probability value among the probability values corresponding to the first feature subset; determining that the largest probability value corresponds to the respective one of the plurality of characters in the dictionary of the first language; and associating the respective segment with the respective one of the plurality of characters in the first language.

6. The method of any of the preceding claims, wherein the multilingual text recognition model includes one or more neural networks, and the one or more neural networks includes a convolution neural network (CNN) followed by a recurrent neural network (RNN).

7. The method of claim 6, wherein the RNN includes a BiLSTM configured to determine the plurality of probability values in the feature sequence based on bidirectional context of the textual content.

8. The method of claim 7, wherein the BiLSTM is a deep BiLSTM including a plurality of BiLSTMs that are stacked to each other.

9. The method of claim 6, wherein the CNN is selected from Densenet, ResNet50, MobileNetV2, ModileNetV3, and GhostNet-light based on speed and size requirements.

10. The method of any of the preceding claims, further comprising: training one or more neural networks jointly using training data of the plurality of languages; and based on the training, establishing the multilingual text recognition model based on the one or more neural networks. nr —

WO 2021/081562 PCT/US2021/014171

11. The method of claim 10, wherein training the one or more neural networks further comprises minimizing a Connectionist Temporal Classification (CTC) based loss.

12. The method of claim 10, wherein the training data includes a first data item in one of the plurality of languages, and training the one or more neural networks further comprises: training the one or more neural networks to recognize the first data item in each of the plurality of languages.

13. A computer system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-12.

14. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-12.