WO2022046486A1

WO2022046486A1 - Scene text recognition model with text orientation or angle detection

Info

Publication number: WO2022046486A1
Application number: PCT/US2021/046490
Authority: WO
Inventors: Kaiyu ZHANG; Yuan Lin; Junxi YIN
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2022-03-03

Abstract

Novel tools and techniques are provided for implementing scene text recognition model with text orientation detection or text angle detection. In various embodiments, a computing system may perform feature extraction on an input image, containing text, using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, and may perform orientation or angle determination of the text in the input image, using a first dense layer of the CNN. If the image of the text is determined to be in the normal orientation or in response to the input image having been rotated to the normal orientation, the computing system may perform feature encoding on values in the feature map, using a sequence layer of the CNN to produce an encoded feature map. The computing system may use a second dense layer of the CNN to process each encoded feature to produce a classification of text.

Description

SCENE TEXT RECOGNITION MODEL WITH TEXT ORIENTATION OR ANGLE DETECTION

COPYRIGHT STATEMENT

[0001] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD

[0002] The present disclosure relates, in general, to methods, systems, and apparatuses for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing scene text recognition model with text orientation detection or text angle detection.

BACKGROUND

[0003] Optical character recognition ("OCR") is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text, whether from a scanned document, from a photo of a document, from a scenephoto, or from subtitle text superimposed on an image, which is an important technical method for extracting information from images. OCR is widely used in industry with searching, positioning, translation, and recommendation, etc., with great commercial value. With scene text recognition based on captured images, however, although conventional techniques exist, such techniques do not accurately and precisely determine angles, especially ones that are rotated by 180 degrees, making precise text recognition difficult or ineffective. Conventional scene text recognition systems and techniques thus require additional corrective processes that have the drawbacks or requirements of addition time, space, and energy. Especially where real-time or near-real-time implementations are involved, such additional time for scene text recognition processing is detrimental. With mobile applications, space requirements (i.e., memory requirements) and energy requirements (i.e., battery requirements) are potentially greater than the limited memory storage and battery capacity of mobile devices, and thus the required additional space and additional energy for corrective processes are also non-ideal.

[0004] Hence, there is a need for more robust and scalable solutions for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications.

SUMMARY

[0005] The techniques of this disclosure generally relate to tools and techniques for implementing neural network, Al, machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing scene text recognition model with text orientation detection or text angle detection.

[0006] In an aspect, a method may be provided for implementing scene text recognition model with text orientation detection or text angle detection. The method may comprise: performing, using a computing system, feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene; performing, using the computing system, orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map; based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotating, using the computing system, the input image to the normal orientation; based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, performing, using the computing system, feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map; and performing, using the computing system, text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text.

[0007] In another aspect, an apparatus may be provided for implementing scene text recognition model with text orientation detection or text angle detection. The apparatus might comprise at least one processor and a non-transitory computer readable medium communicatively coupled to the at least one processor. The non- transitory computer readable medium might have stored thereon computer software comprising a set of instructions that, when executed by the at least one processor, causes the apparatus to: perform feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene; perform orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map; based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotate the input image to the normal orientation; based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, perform feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map; and perform text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text. [0008] In yet another aspect, a system may be provided for implementing scene text recognition model with text orientation detection or text angle detection. The system might comprise a computing system, which might comprise at least one first processor and a first non-transitory computer readable medium communicatively coupled to the at least one first processor. The first non-transitory computer readable medium might have stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: perform feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene; perform orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map; based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotate the input image to the normal orientation; based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, perform feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map; and perform text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text. [0009] Various modifications and additions can be made to the embodiments discussed without departing from the scope of the invention. For example, while the embodiments described above refer to particular features, the scope of this invention also includes embodiments having different combination of features and embodiments that do not include all of the above-described features.

[0010] The details of one or more aspects of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques described in this disclosure will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components. [0012] Fig. 1 is a schematic diagram illustrating a system for implementing scene text recognition model with text orientation or text angle detection, in accordance with various embodiments.

[0013] Figs. 2A and 2B are a schematic block flow diagrams illustrating a nonlimiting example of a method for implementing scene text recognition and scene text recognition model training, in accordance with various embodiments.

[0014] Figs. 3A-3D are diagrams illustrating a non-limiting example of input image processing, text orientation or angle determination, text classification, and training during implementation of scene text recognition model with text orientation or text angle detection, in accordance with various embodiments.

[0015] Figs. 4A-4E are flow diagrams illustrating a method for implementing scene text recognition model with text orientation or text angle detection, in accordance with various embodiments.

[0016] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments.

[0017] Fig. 6 is a block diagram illustrating a networked system of computers, computing systems, or system hardware architecture, which can be used in accordance with various embodiments.

DETAILED DESCRIPTION

[0018] Overview

[0019] Various embodiments provide tools and techniques for implementing neural network, Al, machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing scene text recognition model with text orientation detection or text angle detection.

[0020] Scene text recognition ("STR") model is a key module in optical character recognition ("OCR") pipelines, which reflect the model accuracy and performance directly. The model should be robust and flexible for any size input and multiple languages. For most situations, the OCR software applications ("apps") or services may be restricted by the model size and running speed, with required accuracy and application scenarios. The model according to the various embodiments can greatly cover these restrictions. By using multi-task training and inference techniques, the various embodiments only need to add a dense layer after feature extraction by the convolutional neural network ("CNN") for angle classification. In this way, the scene text recognition model, in accordance with the various embodiments, have two outputs as a multi-task model - namely, (a) text orientation or angle and (b) recognized text.

[0021] In various embodiments, a computing system may perform feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene. The computing system may perform orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map. Based on a determination that the image of the text in the input image is rotated compared with a normal orientation, the computing system may rotate the input image to the normal orientation. Based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, the computing system may perform feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map. The computing system may perform text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text.

[0022] In some embodiments, the computing system may comprise at least one of a scene text recognition computing system, a machine learning system, an artificial intelligence ("Al") system, a deep learning system, a neural network, the CNN, a fully convolutional network ("FCN"), a recurrent neural network ("RNN"), a processor on the user device, one or more graphics processing units ("GPUs"), a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like.

[0023] According to some embodiments, the computing system may analyze the captured image to identify each location of at least one image of text contained within the captured image, using text detection system. For each image of text, the computing system may extract said image of text from the captured image, by cropping said image of text from the captured image. The computing system may input each cropped image of text to the convolutional layer of the CNN. In some cases, identifying each location of the at least one image of text contained within the captured image may comprise identifying, using the text detection system, coordinates for each of four corners defining a rectangular shape that encapsulates each image of text. In some instances, based on a determination that a cropped image of text is embodied by at least one of a non-rectangular shape or a rectangular shape that has its length that is not oriented along a horizontal orientation, the computing system may apply a transform on the cropped image of text to a rectangular shape that has its length along a horizontal orientation, using a spatial transform network ("STN"), prior to inputting the cropped image of text to the convolutional layer of the CNN.

[0024] In some embodiments, the feature map may comprise text data including at least one of shape, texture, or color of the text, and/or the like. In some instances, the orientation or angle determination may output one of the normal orientation or rotated orientation, wherein the rotated orientation may be 180 degrees compared with the normal orientation.

[0025] According to some embodiments, performing feature encoding on each value in the feature map may comprise mapping each sliced feature with each word or character, using the sequence layer of the CNN. In some cases, mapping each sliced feature with each word or character may comprise using at least one of long short term memory ("LSTM") techniques, bidirectional LSTM ("BiLSTM") techniques, multiple or stacked BiLSTM techniques, gated recurrent unit ("GRU") techniques, or bidirectional GRU techniques, and/or the like. In some instances, the text may comprise a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts.

[0026] In some embodiments, for training the CNN, the computing system may apply a loss function on the classification of text, the loss function comprising one of a connectionist temporal classification ("CTC") loss, a cross entropy ("CE") loss, a combination CTC-CE loss, a binary CE ("BCE") loss, or a combination CTC-BCE loss, and/or the like. In some cases, applying the loss function may comprise: generating an orientation or angle loss value based on the comparison of an output of the orientation or angle determination of the image of the text in the input image with a ground truth of an orientation of the image of the text in the input image; generating a text recognition loss value based on the comparison of the classification of text with a ground truth of the text; and generating an overall loss value based on a loss function that combines the orientation or angle loss value and the text recognition loss value.

[0027] The various aspects described herein provide scene text recognition ("STR") model with text orientation detection or text angle detection. The scene text recognition model provides at least the following benefits: (1) predicting image input orientation (or direction or angle) in addition to the text recognition task; (2) reducing the model size and required memory by compressing multiple models into one; and (3) reduces processing time. Regarding (1), the STR model can not only recognize scene texts in the input image, but can also produce other information, including, but not limited to, text orientation (or direction or angle), or the like, by adding the fully connected layer after the CNN backbone. The STR model may also provide a template as a multi-task model for extracting more information given one image and one model. Regarding (2), when conventional multi-task text recognition service is used, all those models need to be turned on to avoid prolonged wait times or processing times, thus resulting in high memory consumption. The STR model compresses the multiple models into one model, thereby much reducing the size of the STR model and thus much reducing the memory required. Regarding (3), as the STR model compresses the multiple models into one model, processing time, in some cases, may be reduced.

[0028] These and other aspects of the system and method for performing scene text recognition model with text orientation detection or text angle detection are described in greater detail with respect to the figures.

[0029] The following detailed description illustrates a few embodiments in further detail to enable one of skill in the art to practice such embodiments. The described examples are provided for illustrative purposes and are not intended to limit the scope of the invention.

[0030] In the following description, for the purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments of the present invention may be practiced without some of these details. In other instances, some structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0031] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered nonexclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0032] Various embodiments as described herein - while embodying (in some cases) software products, computer-performed methods, and/or computer systems - represent tangible, concrete improvements to existing technological areas, including, without limitation, text recognition technology, text orientation or angle recognition technology, machine learning technology, deep learning technology, artificial intelligence ("Al") technology, and/or the like. In other aspects, some embodiments can improve the functioning of user equipment or systems themselves (e.g., text recognition systems, text orientation or angle recognition systems, machine learning systems, deep learning systems, Al systems, etc.), for example, for training, by performing, using a computing system, feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene; performing, using the computing system, orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map; based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotating, using the computing system, the input image to the normal orientation; based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, performing, using the computing system, feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map; and performing, using the computing system, text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text; and/or the like.

[0033] In particular, to the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as, using a CNN to implement scene text recognition model with text orientation detection or text angle detection and training thereof, to generate text orientation or direction determination and text classification of text captured in an image(s), and/or the like, to name a few examples, that extend beyond mere conventional computer processing operations. These functionalities can produce tangible results outside of the implementing computer system, including, merely by way of example, optimized scene text recognition that (1) predicts image input orientation (or direction or angle) in addition to the text recognition, (2) reduces the model size and required memory by compressing multiple models into one, and (3) in some cases, reducing processing time (by virtue of the compression of multiple models into one, etc., at least some of which may be observed or measured by users, action recognition system developers, and/or neural network/machine learning/deep learning researchers.

[0034] Some Embodiments

[0035] We now turn to the embodiments as illustrated by the drawings. Figs. 1-6 illustrate some of the features of the method, system, and apparatus for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing scene text recognition model with text orientation detection or text angle detection, as referred to above. The methods, systems, and apparatuses illustrated by Figs. 1-6 refer to examples of different embodiments that include various components and steps, which can be considered alternatives or which can be used in conjunction with one another in the various embodiments. The description of the illustrated methods, systems, and apparatuses shown in Figs. 1-6 is provided for purposes of illustration and should not be considered to limit the scope of the different embodiments.

[0036] With reference to the figures, Fig. 1 is a schematic diagram illustrating a system 100 for implementing scene text recognition model with text orientation or text angle detection, in accordance with various embodiments.

[0037] In the non- limiting embodiment of Fig. 1, system 100 may comprise computing system 105 and an artificial intelligence ("Al") system 110. The computing system 105 and/or the Al system 110 may be part of a scene text recognition system 115, or may be separate, yet communicatively coupled with, the scene text recognition system 115. In some instances, the computing system 105 and the Al system 110 may be embodied as an integrated system. Alternatively, the computing system 105 and the Al system 110 may be embodied as separate, yet communicatively coupled, systems. In some embodiments, computing system 105 may include, without limitation, at least one of a scene text recognition computing system, a machine learning system, an artificial intelligence ("Al") system, a deep learning system, a neural network, a convolutional neural network ("CNN"), a fully convolutional network ("FCN"), a recurrent neural network ("RNN"), a processor on the user device, one or more graphics processing units ("GPUs"), a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like.

[0038] System 100 may further comprise a user device 125 and one or more user devices 130a-130n (collectively, "user device 125," "user devices 130," or "user devices 125-130," or the like) that communicatively couple with at least one of computing system 105, Al system 110, and/or action recognition system 115, via network(s) 155 and via wired (denoted by line connections in Fig. 1) and/or wireless communications links (denoted by lightning bolt symbols in Fig. 1). According to some embodiments, the user devices 125-130 may each include, but is not limited to, a portable gaming device, a smart phone, a tablet computer, a laptop computer, an image sharing platform-compliant device, a web-based image sharing platform- compliant device, an app-based image sharing platform-compliant device, an image capture device, a video capture device, a law enforcement imaging device, a security system imaging device, a surveillance system imaging device, a military imaging device, and/or the like. In some embodiments, the user device 125 may include, without limitation, at least one of processor 125a, data store 125b, camera 125c, display device 125d, or communications system 125e, and/or the like. The user devices 125 and/or 130a-130n and an object 135 with the text 140 may be disposed within scene or location 145. In some instances, text 140 contained on or within object 135 may be visible within a field of view ("FOV") of a person or an image capture device, or the like (such as FOV 150, or the like).

[0039] In operation, computing system 105, Al system 110, and/or scene text recognition system or processor 125a (collectively, "computing system" or the like) may perform feature extraction on an input image using a convolutional layer of a CNN to produce a feature map, the input image containing an image of text (e.g., text 140, or the like) that is cropped from a captured image (e.g., captured image 160 or image captured by camera 125c, or the like) of a scene (e.g., scene or location 145, or the like). In some embodiments, a basic CNN architecture may include a general CNN backbone without the last several dense or classification layers (such as, but not limited to, residual neural network ("ResNet"), Oxford visual geometry group network ("VGG-16"), GhostNet, EfficientNet, etc.), if it can extract the feature at the pixel level. In some instances, the type of CNN may depend on the restriction of speed and memory. The computing system may perform orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map. Based on a determination that the image of the text in the input image is rotated compared with a normal orientation, the computing system may rotate the input image to the normal orientation. Herein, "normal orientation" may refer to an upright orientation of text in which a top portion of characters of the text are facing up and a bottom portion of said characters of the text are facing down, without the text being rotated and where the text is angled at 0 degrees. Based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, the computing system may perform feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map. The computing system may perform text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text. Herein, "fully connected layer" or "dense layer" refers to a layer whose nodes or neurons are connected with every node or neuron in the preceding layer.

[0040] In some cases, such as shown in Fig. 1, text recognition (e.g., text recognition 165 that is output by the scene text recognition system, or the like) may comprise at least one of text orientation determination (e.g., text orientation determination 165a, or the like) and text classification (e.g., text classification 165b, or the like) - that is, whether the image of the text is oriented in the normal orientation and the output of the text recognition (after ensuring the image of text is in the normal orientation), respectively. In some embodiments, the feature map may comprise text data including at least one of shape, texture, or color of the text, and/or the like. In some instances, the orientation or angle determination may output one of the normal orientation or rotated orientation, wherein the rotated orientation may be 180 degrees compared with the normal orientation.

[0041] According to some embodiments, the computing system may analyze the captured image to identify each location of at least one image of text contained within the captured image, using text detection system. For each image of text, the computing system may extract said image of text from the captured image, by cropping said image of text from the captured image. The computing system may input each cropped image of text to the convolutional layer of the CNN. In some cases, identifying each location of the at least one image of text contained within the captured image may comprise identifying, using the text detection system, coordinates for each of four corners defining a rectangular shape that encapsulates each image of text. In some instances, based on a determination that a cropped image of text is embodied by at least one of a non-rectangular shape or a rectangular shape that has its length that is not oriented along a horizontal orientation, the computing system may apply a transform on the cropped image of text to a rectangular shape that has its length along a horizontal orientation, using a spatial transform network ("STN"), prior to inputting the cropped image of text to the convolutional layer of the CNN.

[0042] In some instances, performing feature encoding on each value in the feature map may comprise mapping each sliced feature with each word or character, using the sequence layer of the CNN. In some cases, mapping each sliced feature with each word or character may comprise using at least one of long short term memory ("LSTM") techniques, bidirectional LSTM ("BiLSTM") techniques, multiple or stacked BiLSTM techniques, gated recurrent unit ("GRU") techniques, or bidirectional GRU techniques, and/or the like. In some instances, the text may comprise a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts.

[0043] In some embodiments, for training the CNN, the computing system may apply a loss function on the classification of text, the loss function comprising one of a connectionist temporal classification ("CTC") loss, a cross entropy ("CE") loss, a combination CTC-CE loss, a binary CE ("BCE") loss, or a combination CTC-BCE loss, and/or the like. In some cases, applying the loss function may comprise: generating an orientation or angle loss value ("LossAng") based on the comparison of an output of the orientation or angle determination of the image of the text in the input image with a ground truth of an orientation of the image of the text in the input image; generating a text recognition loss value ("Loss Rec") based on the comparison of the classification of text with a ground truth of the text; and generating an overall loss value based on a loss function that combines the orientation or angle loss value and the text recognition loss value. The computing system may train the first dense layer and the second dense layer by updating these components of CNN with weighted loss values "a x Loss_Ang" and "/? x Loss_Rec," respectively, where a and ?are weighted coefficients. Training may be repeated with various instances of characters, words, fonts, etc. until the loss values are minimized - in some cases, repeating training until a subsequently calculated overall loss value is reduced to a value that is less than a predetermined threshold value (either actual value or percentage value compared with the previous overall loss value, the predetermined threshold value including, but not limited to, one of 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, or 0.01, etc.).

[0044] These and other functions of the system 100 (and its components) are described in greater detail below with respect to Figs. 2-4.

[0045] Figs. 2A and 2B (collectively, "Fig. 2") are a schematic block flow diagrams illustrating a non-limiting example 200 and 200' of a method for implementing scene text recognition and scene text recognition model training, respectively, in accordance with various embodiments.

[0046] With reference to the non-limiting embodiment 200 of Fig. 2A, a convolutional neural network ("CNN") system 205 (or other Al, deep learning, or machine learning system, or the like) may include, but is not limited to, a convolutional layer 210, a first fully connected layer ("first dense layer" or "dense layer 1") 215, a text rotation layer 220, a sequence layer 225, and a second fully connected layer ("second dense layer" or dense layer 2") 230, or the like.

[0047] In some embodiments, scene text recognition may be performed by CNN 205, as follows. Convolutional layer 210 may receive an input image 235, the input image containing an image of text that is cropped from a captured image of a scene. Convolutional layer 210 may perform feature extraction on the input image to produce a feature map 240. The first dense layer 215 may then perform orientation or angle determination of the image of the text in the input image, to process each value in the feature map to produce text orientation 245.

[0048] The text rotation layer 220 may utilize text orientation 245 as a conditional input, such that: (0) if the text orientation 245 indicates that the image of the text is oriented in the normal orientation (i.e., an upright orientation of text in which a top portion of characters of the text are facing up and a bottom portion of said characters of the text are facing down, without the text being rotated and where the text is angled at 0 degrees, or the like) (in some cases, with a value of "0" or other value indicative of normal orientation, or the like), the text rotation layer 220 may relay the input image 235 (and the feature map 240), without performing rotation of the input image 235, and, in some cases, may output text rotation 250 that indicates that no text or image rotation has been performed (or that no text or image rotation need be performed); or (1) if the text orientation 245 indicates that the image of the text is oriented in a flipped or rotated orientation (i.e., an orientation of text that is 180 degrees compared with the normal orientation, or the like) (in some cases, with a value of "1" or other value indicative of rotated orientation, or the like), the text rotation layer 220 may rotate the input image 235 to the normal orientation, and, in some cases, may output text rotation 250 that indicates that text or image rotation has been performed.

[0049] Based on a determination that the image of the text in the input image 235 is in the normal orientation or in response to the input image 235 having been rotated to the normal orientation (in some cases, based on text rotation 250, or the like), the sequence layer 225 may perform feature encoding on each value in the feature map 240 to produce encoded features 255. In some embodiments, performing feature encoding on each value in the feature map 240 may comprise mapping each sliced feature with each word or character, in some cases, using at least one of long short term memory ("LSTM") techniques, bidirectional LSTM ("BiLSTM") techniques, multiple or stacked BiLSTM techniques, gated recurrent unit ("GRU") techniques, or bidirectional GRU ("BiGRU") techniques, and/or the like. LSTM is directional, as it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, in some cases, two LSTMs can be combined, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM. Alternative to LSTM, other recurrent neural networks like GRU, which is very similar to LSTM, may include update and reset gates. A two-direction GRU such as a BiGRU module for feature encoding. The deep structure allows for a higher level of abstractions compared with a shallow one and may achieve significant performance improvements in the task of text recognition. In some instances, the text may comprise a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts.

[0050] The second dense layer 230 may then perform text recognition on the image of text contained in the input image 235 to process each encoded feature 255 in the encoded feature map to produce text classification 260.

[0051] Turning to the non-limiting embodiment 200' of Fig. 2B, for training the CNN (i.e., to perform scene text recognition model training, or the like), a computing system 265 may apply a loss function on the text classification 260, the loss function including, but not limited to, one of a connectionist temporal classification ("CTC") loss, a cross entropy ("CE") loss, a combination CTC-CE loss, a binary CE ("BCE") loss, or a combination CTC-BCE loss, and/or the like. According to some embodiments, computing system 265 may include, without limitation, an orientation or angle loss module 270a (e.g., a CE loss module or other suitable loss module for calculating loss for text orientation determination, or the like), a text recognition loss module 270b (e.g., a CTC loss module or other suitable loss module for calculating loss for text classification, or the like), a loss module 270 (or other suitable loss module for calculating overall or total loss, or the like), and one or more training modules (e.g., training modules 275a and 275b, or the like). [0052] In some embodiments, scene text recognition model training may be performed by computing system 265, as follows. Orientation or angle loss module 270a (e.g., a CE loss module or other suitable loss module) may generate orientation or angle loss value 290a (i.e., "LossAng" or the like) based on the comparison of text orientation 245 with a ground truth of an orientation of the image of the text in the input image (i.e., "Orientation GT 280" or the like). In some instances, since the CNN features can represent the original image (or input image), it can be used to determine if the direction of the input is flipped or not (as described above). The output of CNN features may include, but are not limited to, at least one of batch, channel, width, or height, or a combination of two or more of these outputs (e.g., [batch, channel, width, height], or the like). Before the first fully connect layer, one can add a global max pooling on dimension (e.g., [width, height], or the like). CNN feature extraction can reduce the dimension and remove padding or 0-value pixels. After adding the global max pooling and fully connected layer, the output has the dimension [batch, class]. The loss function may, in some cases, include cross entropy loss, such as defined by:

= - [t log p) + (1 - t) log(l - p)], (Eqn. 1) where /, is the truth value taking a value 0 or 1 and p, is the probability for the i‘^h class. [0053] Text recognition loss module 270b (e.g., a CTC loss module or other suitable loss module) may generate text recognition loss value 290b (i.e., "LossRec" or the like) based on the comparison of text classification 260 with a ground truth of the text (i.e., "Classification GT 285" or the like).

[0054] Loss module 270 (or other suitable loss module) may generate an overall or total loss value 290 (i.e., "Loss-rota" or the like) based on a loss function that combines the orientation or angle loss value 290a and the text recognition loss value 290b, for example, but not limited to, "Loss_Totai = a x Loss_Ang + /? x Loss_Rec" or the like.

[0055] Training module 275a and training module 275b (which, in some cases, may be embodied as a single training module, or the like) may train the first dense layer 215 and the second dense layer 230 by updating these components of CNN 205 with weighted loss values "a x Loss_Ang" 295a and "/? x Loss_Rec" 295b, respectively, where a and [3 are weighted coefficients. Training may be repeated with various instances of characters, words, fonts, etc. until the loss values are minimized - in some cases, repeating training until a subsequently calculated overall loss value is reduced to a value that is less than a predetermined threshold value (either actual value or percentage value compared with the previous overall loss value, the predetermined threshold value including, but not limited to, one of 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, or 0.01, etc.).

[0056] Figs. 3A-3D (collectively, "Fig. 3") are diagrams illustrating a nonlimiting examples 300, 300', 300", and 300"' of input image processing, text orientation or angle determination, text classification, and training, respectively, during implementation of scene text recognition model with text orientation or text angle detection, in accordance with various embodiments. The processes of input image processing (Fig. 3A), text orientation or angle determination (Fig. 3B), text classification (Fig. 3C), and training (Fig. 3D) successively proceed from one figure to the next via circular markers denotes, "A," "B," and "C," respectively. The values used in this non-limiting example are merely arbitrary values used merely for illustrative purposes, and the various embodiments are not limited to such values nor do the arbitrary values correspond to values arising from actual text recognition of the non-limiting example captured image (using the scene text recognition model according to the various embodiments).

[0057] With reference to the non-limiting example 300 of Fig. 3 A, input image processing may be performed as follows. A camera 305 (or other image capture device, or the like; similar to camera 125c of Fig. 1, or the like) may capture an image 345 of a scene that contains text (in this case, a highway sign indicating directions to the city of Los Angeles, although not limited to such images or types of images, or to such types of text). In some instances, the text may comprise a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts. When mobile device cameras are used to capture images, it is possible (and, in some cases, likely) that the resultant captured image is flipped 180 degrees from the normal orientation (such as the case with captured image 345 in Fig. 3A).

[0058] Computing system 310 (similar to computing system 105 or 265 of Fig. 1 or 2, or the like) may identify text (or location of text) within the captured image 345, and, once the text (or location of text) has been identified, may generate a bounding box 350 around the text (or location of text). Computing system 310 may then crop the image to produce a cropped image 355. In the case that the cropped image 355 of text is embodied by at least one of a non-rectangular shape or a rectangular shape that has its length that is not oriented along a horizontal orientation (in this case, a rectangular shape that is rotated relative to a horizontal plane, orientation, or direction by angle 0 that is non- zero, or the like), the computing system or a spatial transform network ("STN") 315 may apply a transform on the cropped image 355 of text to produce an input image 360 having a rectangular shape that has its length along a horizontal orientation (i.e., either 0 or 180 degrees relative to a horizontal plane, orientation, or direction). The process may then continue to text orientation or angle determination in Fig. 3B, following the circular marker denoted, "A."

[0059] Referring to the non- limiting example 300' of Fig. 3B, text orientation or angle determination may be performed as follows. Convolutional neural network ("CNN") 320 (similar to Al system 110 or CNN 205 of Fig. 1 or 2, or the like) may receive the input image 360 having characteristics 365 (in this case, a height of 41 units, a width of 153 units, and 3 channels (e.g., corresponding to red, green, blue (or "RGB") channels, or the like). Convolutional layer 325 of CNN 320 (similar to convolutional layer 210 of Fig. 2, or the like) may perform feature extraction on the input image 360 to produce a feature map having extracted features 370 (in this case, transformed features having a height of 1, a width of 512, and 64 channels, or the like). First fully connected layer ("first dense layer") 330 of the CNN (similar to first dense layer 215 of Fig. 2, or the like) may perform orientation or angle determination 375 of the image of the text in the input image, to process each value in the feature map. In this case, the orientation or angle determination may indicate that the text in the input image is in a rotated orientation that is 180 degrees compared with the normal orientation (in some cases, with a rotated flag having a value of "1" or the like). Based on a determination that the image of the text in the input image is rotated compared with the normal orientation (i.e., rotated by 180 degrees compared with the normal orientation), computing system 310 may rotate the input image 360 to the normal orientation, such that a rotated state 380 of the resultant image would indicate that the image is in the normal orientation (in some cases, with a rotated flag having a value of "0" or the like). The process may then continue to text classification in Fig. 3C, following the circular marker denoted, "B."

[0060] Turning to the non-limiting example 300" of Fig. 3C, text classification may be performed as follows. Based on a determination that the image of the text in the input image 360 is in the normal orientation or in response to the input image 360 having been rotated to the normal orientation, computing system 310 or sequence layer 335 of CNN 320 (similar to sequence layer 225 of Fig. 2, or the like) may perform feature encoding on each value in the feature map, to produce an encoded feature map 385. In some embodiments, performing feature encoding on each value in the feature map may comprise mapping each sliced feature with each word or character (in this case, producing sliced features: "L"; "o"; "s"; "A"; "n"; "g"; "e"; "1"; "e"; and "s"). In some cases, mapping each sliced feature with each word or character may comprise using at least one of long short term memory ("LSTM") techniques, bidirectional LSTM ("BiLSTM") techniques, multiple or stacked BiLSTM techniques, gated recurrent unit ("GRU") techniques, or bidirectional GRU techniques, and/or the like.

[0061] Computing system 310 or second fully connected layer ("second dense layer") 340 of CNN 320 (similar to second dense layer 230 of Fig. 2, or the like) may perform text recognition on the image of text contained in the input image, to process each encoded feature in the encoded feature map to produce a text classification 390. According to some embodiments, the text classification 390 may include classification based on index values in an index or dictionary in which each value corresponds to a number, a symbol, a character (e.g., English alphabetic character, a character in one of a plurality of other languages), or the like. In this non-limiting example - assuming an index in which the numbers 0 through 9 are represented by values "0" through "9," respectively, with upper case English alphabetic characters being represented by values "10" through "35," and with lower case English alphabetic characters being represented by values "36" through "61," or the like, although the various embodiments are not limited to such an index, but may allow for any suitable index with any suitable order of entries, or the like -, text classification 390 may have values "21," "50," "54," "10," "49," "42," "40," "47," "1," "40," and "54," or the like, corresponding to "L," "o," "s," "A," "n," "g," "e," "1," "e," and "s," respectively, or "Los Angeles" (which incorrectly classifies the letter "1" with the number "1," which is a common optical character recognition ("OCR") error). In some instances, the index or dictionary may further include values representing each character in each of a plurality of non- English languages, where the second dense layer is updated or trained to recognize each of these characters regardless of any known typeset fonts or handwritten scripts, or the like. For training purposes, the process may then continue to training in Fig. 3D, following the circular marker denoted, "C."

[0062] With respect to the non-limiting example 300"' of Fig. 3D, training may be performed as follows. Computing system 310 may generate orientation or angle loss value 395a (in this case, with "LossAng" having a value of "0" or the like) based on the comparison of text orientation (i.e., with the rotated flag having a value of "1" or the like) with a ground truth of an orientation of the image of the text in the input image (i.e., with a ground truth ("GT") rotated flag having a value of "1" or the like). Computing system 310 may generate text recognition loss value 395b (in this case, with "Loss Rec" having a value of ". . ." or the like) based on the comparison of text classification (in this case, text classification 390 having values "21," "50," "54," "10," "49," "42," "40," "47," "1," "40," and "54," or the like, corresponding to "L," "o," "s," "A," "n," "g," "e," "1," "e," and "s," respectively, or "Los Angeles" or the like) with a ground truth of the text (i.e., with Classification GT having values "21," "50," "54," "10," "49," "42," "40," "47," "47," "40," and "54," or the like, corresponding to "L," "o," "s," "A," "n," "g," "e," "1," "e," and "s," respectively, or "Los Angeles" or the like). Computing system 310 may generate an overall or total loss value 395c based on a loss function that combines the orientation or angle loss value 395a and the text recognition loss value 395b, for example, but not limited to, "Loss_Totai = a x Loss_Ang + (3 x Loss_Rec" or the like.

[0063] Computing system 310 may train or update the scene text recognition model of the Al system - in particular, the first dense layer 330 and the second dense layer 340 - by updating these components of CNN 320 with weighted loss values "a x Loss_Ang" and "(3 x Loss_Rec", respectively, where a and /3 are weighted coefficients. Training may be repeated with various instances of characters, words, fonts, etc. until the loss values are minimized - in some cases, repeating training until a subsequently calculated overall loss value is reduced to a value that is less than a predetermined threshold value (either actual value or percentage value compared with the previous overall loss value, the predetermined threshold value including, but not limited to, one of 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, or 0.01, etc.). In this case, until the scene text recognition model correctly distinguishes and recognizes the letter "1" as being distinct from the number "1," but also to accurately and precisely recognize text and characters in its index or dictionary, regardless of language or font.

[0064] Figs. 4A-4E (collectively, "Fig. 4") are flow diagrams illustrating a method 400 for implementing scene text recognition model with text orientation or text angle detection, in accordance with various embodiments.

[0065] Method 400 of Fig. 4A continues onto Fig. 4D following the circular marker denoted, "A," and may return to Fig. 4A from Fig. 4B and/or 4D following the circular marker denoted, "B."

[0066] While the techniques and procedures are depicted and/or described in a certain order for purposes of illustration, it should be appreciated that certain procedures may be reordered and/or omitted within the scope of various embodiments. Moreover, while the method 400 illustrated by Fig. 4 can be implemented by or with (and, in some cases, are described below with respect to) the systems, examples, or embodiments 100, 200, 200', 300, 300', 300", and 300'" of Figs. 1, 2A, 2B, 3A, 3B, 3C, and 3D, respectively (or components thereof), such methods may also be implemented using any suitable hardware (or software) implementation. Similarly, while each of the systems, examples, or embodiments 100, 200, 200', 300, 300', 300", and 300'" of Figs. 1, 2A, 2B, 3A, 3B, 3C, and 3D, respectively (or components thereof), can operate according to the method 400 illustrated by Fig. 4 (e.g., by executing instructions embodied on a computer readable medium), the systems, examples, or embodiments 100, 200, 200', 300, 300', 300", and 300'" of Figs. 1, 2A, 2B, 3A, 3B, 3C, and 3D can each also operate according to other modes of operation and/or perform other suitable procedures.

[0067] In the non- limiting embodiment of Fig. 4A, method 400, at block 405, may comprise performing, using a computing system, feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene. At block 410, method 400 may comprise performing, using the computing system, orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map. Method 400 may further comprise, at block 415, based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotating, using the computing system, the input image to the normal orientation. Method 400, at block 420, may comprise, based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, performing, using the computing system, feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map. Method 400 may further comprise performing, using the computing system, text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text (block 425).

[0068] In some embodiments, the computing system may comprise at least one of a scene text recognition computing system, a machine learning system, an artificial intelligence ("Al") system, a deep learning system, a neural network, the CNN, a fully convolutional network ("FCN"), a recurrent neural network ("RNN"), a processor on the user device, one or more graphics processing units ("GPUs"), a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. Herein, "normal orientation" may refer to an upright orientation of text in which a top portion of characters of the text are facing up and a bottom portion of said characters of the text are facing down, without the text being rotated and where the text is angled at 0 degrees.

[0069] According to some embodiments, the feature map may comprise text data including at least one of shape, texture, or color of the text, and/or the like. In some instances, the orientation or angle determination may output one of the normal orientation or rotated orientation, where the rotated orientation may be 180 degrees compared with the normal orientation.

[0070] Method 400 may continue onto the process at block 470 in Fig. 4D following the circular marker denoted, "A."

[0071] With reference to Fig. 4B, method 400 may further comprise receiving, using the computing system, the captured image. Method 400 may also comprise, at block 435, analyzing, using the computing system, the captured image to identify each location of at least one image of text contained within the captured image, using text detection system. In some cases, identifying each location of the at least one image of text contained within the captured image (at block 435) may comprise identifying, using the computing system and using the text detection system, coordinates for each of four comers defining a rectangular shape that encapsulates each image of text (block 440). Method 400 may further comprise, for each image of text, extracting, using the computing system, said image of text from the captured image, by cropping said image of text from the captured image (block 445). At block 450, method 400 may further comprise determining whether a cropped image of text is embodied by at least one of a non-rectangular shape or a rectangular shape that has its length that is not oriented along a horizontal orientation. If so, method 400 may continue onto the process at block 455. If not, method 400 may continue onto the process at block 460.

[0072] At block 455, method 400 may comprise, based on a determination that the cropped image of text is embodied by at least one of a non-rectangular shape or a rectangular shape that has its length that is not oriented along a horizontal orientation, applying, using the computing system, a transform on the cropped image of text to a rectangular shape that has its length along a horizontal orientation, using a spatial transform network ("STN"). Method 400 may continue onto the process at block 460. At block 460, method 400 may further comprise inputting, using the computing system, each cropped image of text to the convolutional layer of the CNN.

[0073] Method 400 may return to the process at block 405 in Fig. 4A following the circular marker denoted, "B."

[0074] Turning to Fig. 4C, performing feature encoding on each value in the feature map (at block 420) may comprise mapping, using the computing system, each sliced feature with each word or character, using the sequence layer of the CNN (block 465). In some cases, mapping each sliced feature with each word or character (at block 465) may comprise using at least one of long short term memory ("LSTM") techniques, bidirectional LSTM ("BiLSTM") techniques, multiple or stacked BiLSTM techniques, gated recurrent unit ("GRU") techniques, or bidirectional GRU techniques, and/or the like. In some instances, the text may comprise a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts. [0075] At block 470 in Fig. 4D (following the circular marker denoted, "A," in Fig. 4A), method 400 may comprise, for training the CNN, applying, using the computing system, a loss function on the classification of text. In some embodiments, the loss function may include, without limitation, one of a connectionist temporal classification ("CTC") loss, a cross entropy ("CE") loss, a combination CTC-CE loss, a binary CE ("BCE") loss, or a combination CTC-BCE loss, and/or the like. Method 400 may further comprise, at block 475, updating at least one of the first dense layer or the second dense layer based on the loss function.

[0076] Method 400 may return to the process at block 405 in Fig. 4A following the circular marker denoted, "B."

[0077] Referring to Fig. 4E, applying the loss function (at block 470) may comprise generating, using the computing system, an orientation or angle loss value based on the comparison of an output of the orientation or angle determination of the image of the text in the input image with a ground truth of an orientation of the image of the text in the input image (block 480); generating, using the computing system, a text recognition loss value based on the comparison of the classification of text with a ground truth of the text (block 485); and generating, using the computing system, an overall or total loss value based on a loss function that combines the orientation or angle loss value and the text recognition loss value (block 490). Training may be repeated with various instances of characters, words, fonts, etc. until the loss values are minimized - in some cases, repeating training until a subsequently calculated overall loss value is reduced to a value that is less than a predetermined threshold value (either actual value or percentage value compared with the previous overall loss value, the predetermined threshold value including, but not limited to, one of 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, or 0.01, etc.).

[0078] Examples of System and Hardware Implementation

[0079] Fig. 5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments. Fig. 5 provides a schematic illustration of one embodiment of a computer system 500 of the service provider system hardware that can perform the methods provided by various other embodiments, as described herein, and/or can perform the functions of computer or hardware system (i.e., computing systems 105, 265, and 310, artificial intelligence ("Al") system 110, scene text recognition system 115, user devices 125 and 130a-130n, convolutional neural network ("CNN") systems 205 and 320, etc.), as described above. It should be noted that Fig. 5 is meant only to provide a generalized illustration of various components, of which one or more (or none) of each may be utilized as appropriate. Fig. 5, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.

[0080] The computer or hardware system 500 - which might represent an embodiment of the computer or hardware system (i.e., computing systems 105, 265, and 310, Al system 110, scene text recognition system 115, user devices 125 and 130a-130n, CNN system 205 and 320, etc.), described above with respect to Figs. 1-4 - is shown comprising hardware elements that can be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and/or the like); one or more input devices 515, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, a printer, and/or the like.

[0081] The computer or hardware system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like. [0082] The computer or hardware system 500 might also include a communications subsystem 530, which can include, without limitation, a modem, a network card (wireless or wired), an infra-red communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, cellular communication facilities, etc.), and/or the like. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, and/or with any other devices described herein. In many embodiments, the computer or hardware system 500 will further comprise a working memory 535, which can include a RAM or ROM device, as described above.

[0083] The computer or hardware system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments (including, without limitation, hypervisors, VMs, and the like), and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

[0084] A set of these instructions and/or code might be encoded and/or stored on a non-transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer or hardware system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer or hardware system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.

[0085] It will be apparent to those skilled in the art that substantial variations may be made in accordance with particular requirements. For example, customized hardware (such as programmable logic controllers, field-programmable gate arrays, application-specific integrated circuits, and/or the like) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0086] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer or hardware system 500) to perform methods in accordance with various embodiments of the invention.

According to a set of embodiments, some or all of the procedures of such methods are performed by the computer or hardware system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein. [0087] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in some fashion. In an embodiment implemented using the computer or hardware system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire, and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including without limitation radio, acoustic, and/or light waves, such as those generated during radiowave and infra-red data communications).

[0088] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0089] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer or hardware system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.

[0090] The communications subsystem 530 (and/or components thereof) generally will receive the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 505 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510.

[0091] As noted above, a set of embodiments comprises methods and systems for implementing neural network, artificial intelligence ("Al"), machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing scene text recognition model with text orientation detection or text angle detection. Fig. 6 illustrates a schematic diagram of a system 600 that can be used in accordance with one set of embodiments. The system 600 can include one or more user computers, user devices, or customer devices 605. A user computer, user device, or customer device 605 can be a general purpose personal computer (including, merely by way of example, desktop computers, tablet computers, laptop computers, handheld computers, and the like, running any appropriate operating system, several of which are available from vendors such as Apple, Microsoft Corp., and the like), cloud computing devices, a server(s), and/or a workstation computer(s) running any of a variety of commercially-available UNIX™ or UNIX-like operating systems. A user computer, user device, or customer device 605 can also have any of a variety of applications, including one or more applications configured to perform methods provided by various embodiments (as described above, for example), as well as one or more office applications, database client and/or server applications, and/or web browser applications. Alternatively, a user computer, user device, or customer device 605 can be any other electronic device, such as a thin- client computer, Internet-enabled mobile telephone, and/or personal digital assistant, capable of communicating via a network (e.g., the network(s) 610 described below) and/or of displaying and navigating web pages or other types of electronic documents. Although the system 600 is shown with two user computers, user devices, or customer devices 605, any number of user computers, user devices, or customer devices can be supported.

[0092] Some embodiments operate in a networked environment, which can include a network(s) 610. The network(s) 610 can be any type of network familiar to those skilled in the art that can support data communications using any of a variety of commercially-available (and/or free or proprietary) protocols, including, without limitation, TCP/IP, SNA™, IPX™, AppleTalk™, and the like. Merely by way of example, the network(s) 610 (similar to network(s) 155 of Fig. 1, or the like) can each include a local area network ("LAN"), including, without limitation, a fiber network, an Ethernet network, a Token-Ring™ network, and/or the like; a wide-area network ("WAN"); a wireless wide area network ("WWAN"); a virtual network, such as a virtual private network ("VPN"); the Internet; an intranet; an extranet; a public switched telephone network ("PSTN"); an infra-red network; a wireless network, including, without limitation, a network operating under any of the IEEE 802.11 suite of protocols, the Bluetooth™ protocol known in the art, and/or any other wireless protocol; and/or any combination of these and/or other networks. In a particular embodiment, the network might include an access network of the service provider (e.g., an Internet service provider ("ISP")). In another embodiment, the network might include a core network of the service provider, and/or the Internet.

[0093] Embodiments can also include one or more server computers 615. Each of the server computers 615 may be configured with an operating system, including, without limitation, any of those discussed above, as well as any commercially (or freely) available server operating systems. Each of the servers 615 may also be running one or more applications, which can be configured to provide services to one or more clients 605 and/or other servers 615.

[0094] Merely by way of example, one of the servers 615 might be a data server, a web server, a cloud computing device(s), or the like, as described above. The data server might include (or be in communication with) a web server, which can be used, merely by way of example, to process requests for web pages or other electronic documents from user computers 605. The web server can also run a variety of server applications, including HTTP servers, FTP servers, CGI servers, database servers, Java servers, and the like. In some embodiments of the invention, the web server may be configured to serve web pages that can be operated within a web browser on one or more of the user computers 605 to perform methods of the invention.

[0095] The server computers 615, in some embodiments, might include one or more application servers, which can be configured with one or more applications accessible by a client running on one or more of the client computers 605 and/or other servers 615. Merely by way of example, the server(s) 615 can be one or more general purpose computers capable of executing programs or scripts in response to the user computers 605 and/or other servers 615, including, without limitation, web applications (which might, in some cases, be configured to perform methods provided by various embodiments). Merely by way of example, a web application can be implemented as one or more scripts or programs written in any suitable programming language, such as Java™, C, C#™ or C++, and/or any scripting language, such as Perl, Python, or TCL, as well as combinations of any programming and/or scripting languages. The application server(s) can also include database servers, including, without limitation, those commercially available from Oracle™, Microsoft™, Sybase™, IBM™, and the like, which can process requests from clients (including, depending on the configuration, dedicated database clients, API clients, web browsers, etc.) running on a user computer, user device, or customer device 605 and/or another server 615. In some embodiments, an application server can perform one or more of the processes for implementing neural network, Al, machine learning, and/or deep learning applications, and, more particularly, to methods, systems, and apparatuses for implementing scene text recognition model with text orientation detection or text angle detection, as described in detail above. Data provided by an application server may be formatted as one or more web pages (comprising HTML, JavaScript, etc., for example) and/or may be forwarded to a user computer 605 via a web server (as described above, for example). Similarly, a web server might receive web page requests and/or input data from a user computer 605 and/or forward the web page requests and/or input data to an application server. In some cases, a web server may be integrated with an application server.

[0096] In accordance with further embodiments, one or more servers 615 can function as a file server and/or can include one or more of the files (e.g., application code, data files, etc.) necessary to implement various disclosed methods, incorporated by an application running on a user computer 605 and/or another server 615. Alternatively, as those skilled in the art will appreciate, a file server can include all necessary files, allowing such an application to be invoked remotely by a user computer, user device, or customer device 605 and/or server 615.

[0097] It should be noted that the functions described with respect to various servers herein (e.g., application server, database server, web server, file server, etc.) can be performed by a single server and/or a plurality of specialized servers, depending on implementation- specific needs and parameters.

[0098] In some embodiments, the system can include one or more databases 620a- 620n (collectively, "databases 620"). The location of each of the databases 620 is discretionary: merely by way of example, a database 620a might reside on a storage medium local to (and/or resident in) a server 615a (and/or a user computer, user device, or customer device 605). Alternatively, a database 620n can be remote from any or all of the computers 605, 615, so long as it can be in communication (e.g., via the network 610) with one or more of these. In a particular set of embodiments, a database 620 can reside in a storage-area network ("SAN") familiar to those skilled in the art. (Likewise, any necessary files for performing the functions attributed to the computers 605, 615 can be stored locally on the respective computer and/or remotely, as appropriate.) In one set of embodiments, the database 620 can be a relational database, such as an Oracle database, that is adapted to store, update, and retrieve data in response to SQL-formatted commands. The database might be controlled and/or maintained by a database server, as described above, for example.

[0099] According to some embodiments, system 600 may further comprise a computing system 625 (similar to computing systems 105, 265, and 310 of Figs. 1-3, or the like) and an artificial intelligence ("Al") system 630 (similar to Al system 110 or convolutional neural network ("CNN") systems 205 and 320 of Figs. 1-3, or the like), both of which may be part of a scene text recognition system 635 (similar to scene text recognition system 115 of Fig. 1, or the like). System 600 may also comprise a database(s) 640 (similar to database(s) 120 of Fig. 1, or the like) communicatively coupled to the scene text recognition system 635. System 600 may further comprise user device 645 (similar to user device 125 of Fig. 1, or the like), user devices 605 (including user devices 605a and 605b, or the like; similar to user devices 130a-130n of Fig. 1, or the like), an object 650 containing text 655 (which is visible within a field of view ("FOV") of a person or an image capture device, or the like (such as FOV 665, or the like); similar to objects 135, text 140, and FOVs 150 of Fig. 1, or the like), or the like. In some instances, user device 645 may include, without limitation, at least one of processor 645a, data store 645b, camera 645c, display device 645d, or communications system 645e, and/or the like (similar to processor 125a, data store 125b, camera 125c, display device 125d, or communications system 125e, respectively, of Fig. 1, or the like). The user devices 605 and/or 645 and the object 650 with the text 655 may be disposed within scene or location 660 (similar to scene or location 145 of Fig. 1, or the like).

[0100] In operation, computing system 625, Al system 630, and/or scene text recognition system or processor 645a (collectively, "computing system" or the like) may perform feature extraction on an input image using a convolutional layer of a CNN to produce a feature map, the input image containing an image of text (e.g., text 655, or the like) that is cropped from a captured image (e.g., image captured by camera 645c, or the like) of a scene (e.g., scene or location 660, or the like). The computing system may perform orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map. Based on a determination that the image of the text in the input image is rotated compared with a normal orientation, the computing system may rotate the input image to the normal orientation. Herein, "normal orientation" may refer to an upright orientation of text in which a top portion of characters of the text are facing up and a bottom portion of said characters of the text are facing down, without the text being rotated and where the text is angled at 0 degrees. Based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, the computing system may perform feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map. The computing system may perform text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text.

[0101] In some embodiments, the feature map may comprise text data including at least one of shape, texture, or color of the text, and/or the like. In some instances, the orientation or angle determination may output one of the normal orientation or rotated orientation, wherein the rotated orientation may be 180 degrees compared with the normal orientation.

[0102] According to some embodiments, the computing system may analyze the captured image to identify each location of at least one image of text contained within the captured image, using text detection system. For each image of text, the computing system may extract said image of text from the captured image, by cropping said image of text from the captured image. The computing system may input each cropped image of text to the convolutional layer of the CNN. In some cases, identifying each location of the at least one image of text contained within the captured image may comprise identifying, using the text detection system, coordinates for each of four corners defining a rectangular shape that encapsulates each image of text. In some instances, based on a determination that a cropped image of text is embodied by at least one of a non-rectangular shape or a rectangular shape that has its length that is not oriented along a horizontal orientation, the computing system may apply a transform on the cropped image of text to a rectangular shape that has its length along a horizontal orientation, using a spatial transform network ("STN"), prior to inputting the cropped image of text to the convolutional layer of the CNN. [0103] In some instances, performing feature encoding on each value in the feature map may comprise mapping each sliced feature with each word or character, using the sequence layer of the CNN. In some cases, mapping each sliced feature with each word or character may comprise using at least one of long short term memory ("LSTM") techniques, bidirectional LSTM ("BiLSTM") techniques, multiple or stacked BiLSTM techniques, gated recurrent unit ("GRU") techniques, or bidirectional GRU techniques, and/or the like. In some instances, the text may comprise a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts.

[0104] In some embodiments, for training the CNN, the computing system may apply a loss function on the classification of text, the loss function comprising one of a connectionist temporal classification ("CTC") loss, a cross entropy ("CE") loss, a combination CTC-CE loss, a binary CE ("BCE") loss, or a combination CTC-BCE loss, and/or the like. In some cases, applying the loss function may comprise: generating an orientation or angle loss value based on the comparison of an output of the orientation or angle determination of the image of the text in the input image with a ground truth of an orientation of the image of the text in the input image; generating a text recognition loss value based on the comparison of the classification of text with a ground truth of the text; and generating an overall or total loss value based on a loss function that combines the orientation or angle loss value and the text recognition loss value.

[0105] These and other functions of the system 600 (and its components) are described in greater detail above with respect to Figs. 1-4.

[0106] While particular features and aspects have been described with respect to some embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while particular functionality is ascribed to particular system components, unless the context dictates otherwise, this functionality need not be limited to such and can be distributed among various other system components in accordance with the several embodiments. [0107] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with — or without — particular features for ease of description and to illustrate some aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method for implementing scene text recognition model with text orientation detection or text angle detection, comprising: performing, using a computing system, feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene; performing, using the computing system, orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map; based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotating, using the computing system, the input image to the normal orientation; based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, performing, using the computing system, feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map; and performing, using the computing system, text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text.

2. The method of claim 1, the computing system comprises at least one of a scene text recognition computing system, a machine learning system, an artificial intelligence ("Al") system, a deep learning system, a neural network, the CNN, a fully convolutional network ("FCN"), a recurrent neural network ("RNN"), a processor on the user device, one or more graphics processing units ("GPUs"), a server computer over a network, a cloud computing system, or a distributed computing system.

3. The method of claim 1 or 2, further comprising:

37 analyzing, using the computing system, the captured image to identify each location of at least one image of text contained within the captured image, using text detection system; for each image of text, extracting, using the computing system, said image of text from the captured image, by cropping said image of text from the captured image; and inputting, using the computing system, each cropped image of text to the convolutional layer of the CNN.

4. The method of any of claims 3, wherein identifying each location of the at least one image of text contained within the captured image comprises identifying, using the computing system and using the text detection system, coordinates for each of four comers defining a rectangular shape that encapsulates each image of text.

5. The method of any of claims 3, further comprising: based on a determination that a cropped image of text is embodied by at least one of a non-rectangular shape or a rectangular shape that has its length that is not oriented along a horizontal orientation, applying, using the computing system, a transform on the cropped image of text to a rectangular shape that has its length along a horizontal orientation, using a spatial transform network ("STN"), prior to inputting the cropped image of text to the convolutional layer of the CNN.

6. The method of any of claims 1-5, wherein the feature map comprises text data including at least one of shape, texture, or color of the text.

7. The method of any of claims 1-6, wherein the orientation or angle determination outputs one of the normal orientation or rotated orientation, wherein the rotated orientation is 180 degrees compared with the normal orientation.

8. The method of any of claims 1-7, wherein performing feature encoding on each value in the feature map comprises: mapping, using the computing system, each sliced feature with each word or character, using the sequence layer of the CNN.

38

9. The method of any of claims 8, wherein mapping each sliced feature with each word or character comprises using at least one of long short term memory ("LSTM") techniques, bidirectional LSTM ("BiLSTM") techniques, multiple or stacked BiLSTM techniques, gated recurrent unit ("GRU") techniques, or bidirectional GRU techniques.

10. The method of any of claims 1-9, wherein the text comprises a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts.

11. The method of any of claims 1-10, further comprising: for training the CNN, applying, using the computing system, a loss function on the classification of text, the loss function comprising one of a connectionist temporal classification ("CTC") loss, a cross entropy ("CE") loss, a combination CTC-CE loss, a binary CE ("BCE") loss, or a combination CTC-BCE loss.

12. The method of any of claims 11, wherein applying the loss function comprises: generating, using the computing system, an orientation or angle loss value based on the comparison of an output of the orientation or angle determination of the image of the text in the input image with a ground truth of an orientation of the image of the text in the input image; generating, using the computing system, a text recognition loss value based on the comparison of the classification of text with a ground truth of the text; and generating, using the computing system, an overall loss value based on a loss function that combines the orientation or angle loss value and the text recognition loss value.

13. An apparatus for implementing scene text recognition model with text orientation detection or text angle detection, comprising: at least one processor; and a non-transitory computer readable medium communicatively coupled to the at least one processor, the non-transitory computer readable medium having stored thereon computer software comprising a set of instructions that, when executed by the at least one processor, causes the apparatus to: perform feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene; perform orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map; based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotate the input image to the normal orientation; based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, perform feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map; and perform text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text.

14. A system for implementing scene text recognition model with text orientation detection or text angle detection, comprising: a computing system, comprising: at least one first processor; and a first non-transitory computer readable medium communicatively coupled to the at least one first processor, the first non-transitory computer readable medium having stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: perform feature extraction on an input image using a convolutional layer of a convolutional neural network ("CNN") to produce a feature map, the input image containing an image of text that is cropped from a captured image of a scene; perform orientation or angle determination of the image of the text in the input image, using a first fully connected layer ("first dense layer") of the CNN to process each value in the feature map; based on a determination that the image of the text in the input image is rotated compared with a normal orientation, rotate the input image to the normal orientation; based on a determination that the image of the text in the input image is in the normal orientation or in response to the input image having been rotated to the normal orientation, perform feature encoding on each value in the feature map, using a sequence layer of the CNN to produce an encoded feature map; and perform text recognition on the image of text contained in the input image, using a second fully connected layer ("second dense layer") of the CNN to process each encoded feature in the encoded feature map to produce a classification of text.

15. The system of claim 14, the computing system comprises at least one of a scene text recognition computing system, a machine learning system, an artificial intelligence ("Al") system, a deep learning system, a neural network, the CNN, a fully convolutional network ("FCN"), a recurrent neural network ("RNN"), a processor on the user device, one or more graphics processing units ("GPUs"), a server computer over a network, a cloud computing system, or a distributed computing system.

16. The system of claim 14 or 15, wherein the feature map comprises text data including at least one of shape, texture, or color of the text.

17. The system of any of claims 14-16, wherein the orientation or angle determination outputs one of the normal orientation or rotated orientation, wherein the rotated orientation is 180 degrees compared with the normal orientation.

18. The system of any of claims 14-17, wherein the text comprises a word or character in at least one language among a plurality of human languages and in at least one font among a plurality of fonts.

19. The system of any of claims 14-18, wherein the first set of instructions, when executed by the at least one first processor, further causes the computing system to: for training the CNN, applying, using the computing system, a loss function on the classification of text, the loss function comprising one of a connectionist temporal classification ("CTC") loss, a cross entropy ("CE") loss, a combination CTC-CE loss, a binary CE ("BCE") loss, or a combination CTC-BCE loss.

20. The system of any of claims 14-19, wherein applying the loss function comprises: generating, using the computing system, an orientation or angle loss value based on the comparison of an output of the orientation or angle determination of the image of the text in the input image with a ground truth of an orientation of the image of the text in the input image; generating, using the computing system, a text recognition loss value based on the comparison of the classification of text with a ground truth of the text; and generating, using the computing system, an overall loss value based on a loss function that combines the orientation or angle loss value and the text recognition loss value.

42