US20180373947A1 - Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same - Google Patents

Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same Download PDF

Info

Publication number
US20180373947A1
US20180373947A1 US15/630,188 US201715630188A US2018373947A1 US 20180373947 A1 US20180373947 A1 US 20180373947A1 US 201715630188 A US201715630188 A US 201715630188A US 2018373947 A1 US2018373947 A1 US 2018373947A1
Authority
US
United States
Prior art keywords
training
testing
image
text
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US15/630,188
Other versions
US10163022B1 (en
Inventor
Hojin Cho
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Stradvision Inc
Original Assignee
Stradvision Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stradvision Inc filed Critical Stradvision Inc
Priority to US15/630,188 priority Critical patent/US10163022B1/en
Assigned to StradVision, Inc. reassignment StradVision, Inc. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHO, HOJIN
Application granted granted Critical
Publication of US10163022B1 publication Critical patent/US10163022B1/en
Publication of US20180373947A1 publication Critical patent/US20180373947A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • G06K9/18
    • G06F17/21
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06K9/72
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N99/005
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • G06V30/18019Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections by matching or filtering
    • G06V30/18038Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters
    • G06V30/18048Biologically-inspired filters, e.g. difference of Gaussians [DoG], Gabor filters with interaction between the responses of different filters, e.g. cortical complex cells
    • G06V30/18057Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/22Character recognition characterised by the type of writing
    • G06V30/224Character recognition characterised by the type of writing of printed characters having additional code marks or containing code marks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention relates to a method, and an apparatus for learning one or more parameters used to recognize one or more characters included in a text in a scene text image of training set, and more particularly, to the method and the apparatus for performing processes of (1) generating or allowing other device to generate each feature vector corresponding to each of segmented character images, if the segmented character images are acquired by dividing an image of the text into separate images of the characters; (2) processing or allowing other device to process feature vectors c i+j of neighboring character images to thereby generate a support vector to be used for recognizing a specific character image, on the condition that the specific character image and the neighboring character images are included in the segmented character images, wherein index j is not equal to 0 and ⁇ m ⁇ j ⁇ n; (3) obtaining or allowing other device to obtain a merged vector or its processed value by executing a computation with the support vector and a feature vector c i of the specific character image; and (4) (i) performing or allowing other device to perform a classification of the
  • the technologies trains an apparatus, and then the trained apparatus applies various text recognition algorithms to identify texts.
  • a technology for detecting texts may find out a position and a size of each text in the natural image and a technology for recognizing texts may identify a set of characters located at the position.
  • a text in an image could be detected by a device itself like ADAS, i.e., advanced driver-assistance systems, or inputted by a user through a touch interface.
  • the technology for detecting texts may be implemented more easily than the technology for recognizing texts.
  • FIGS. 1A and 1B are respective drawings illustrating each type of the methods.
  • FIG. 1A is a drawing illustrating a method of segmenting an input image by each of words in the input image and holistically recognizing each of the words in each of corresponding word-level bounding boxes.
  • FIG. 1B is a drawing illustrating a method of segmenting an input image by each of characters in the input image, recognizing each of the characters in each of corresponding character-level bounding boxes and combining the recognized characters to determine an appropriate word with a certain meaning.
  • the conventional word-level processing method such as FIG. 1A may be vulnerable to variations in text length, variations in spacing between characters, and languages such as Chinese or Japanese that have no spaces in its text.
  • the conventional character-level processing method such as FIG. 1B may suffer from ambiguity between similar-shaped characters, e.g., ⁇ I,l,1 ⁇ , ⁇ 0,O ⁇ , ⁇ 5,S ⁇ .
  • a novel text recognition method with a high efficiency in identifying characters with similar shape is devised by reflecting a numerical value, which is determined by referring to feature information of at least one or more of neighboring characters adjacent to a specific character as a subject to be identified, in a numerical value of feature of the specific character.
  • a method for learning one or more parameters used to recognize one or more characters included in a text in a scene text image of training set including steps of: (a) a training apparatus, if segmented character images are acquired by dividing an image of the text into separate images of the characters, generating or allowing other device to generate each feature vector corresponding to each of the segmented character images; (b) the training apparatus, on the condition that a specific character image and its neighboring character images are included in the segmented character images, processing or allowing other device to process feature vectors ci+j of at least part of the neighboring character images by executing at least one of computations to thereby generate a support vector to be used for a recognition of the specific character image, wherein index j is not equal to 0 and ⁇ m ⁇ j ⁇ n; (c) the training apparatus obtaining or allowing other device to obtain a merged vector or its processed value by executing a computation with the support vector and a feature vector ci
  • a method for recognizing one or more characters for testing included in a text for testing in a scene text image of testing set including steps of: (a) a testing apparatus generating or allowing other device to generate each feature vector for testing corresponding to each of segmented character images for testing if the segmented character images are acquired by dividing an image of the text for testing into separate images of the characters for testing, on the condition that (i) a first process of generating each feature vector for training corresponding to each of segmented character images for training if the segmented character images for training are acquired by dividing an image of a text for training into separate images of characters included in the text for training; (ii) a second process of processing feature vectors ci+j for training of at least part of neighboring character images for training by executing at least one of computations to thereby generate a support vector for training to be used for recognizing a specific character image for training, wherein the specific character image for training and its neighboring character images for training
  • a training apparatus for learning one or more parameters used to recognize one or more characters included in a text in a scene text image of training set, including: a communication part for acquiring (i) segmented character images obtained by dividing an image of the text in the scene text image into separate images of the characters, (ii) the image of the text or (iii) the scene text image; and a processor for performing processes of (i) generating or allowing other device to generate each feature vector corresponding to each of the segmented character images, (ii) generating or allowing other device to generate a support vector to be used for recognizing a specific character image by executing at least one of computations with feature vectors c i+j of one or more neighboring character images, wherein the specific character image and the neighboring character images are included in the segmented character images, and wherein index j is not equal to 0 and ⁇ m ⁇ j ⁇ n; (iii) obtaining or allowing other device to obtain a
  • a testing apparatus for recognizing one or more characters for testing included in a text for testing in a scene text image of testing set, including: a communication part for acquiring (i) segmented character images for testing obtained by dividing an image of the text for testing in the scene text image into separate images of the characters for testing, (ii) the image of the text for testing or (iii) the scene text image, on the condition that (1) a first process of generating each feature vector for training corresponding to each of segmented character images for training if the segmented character images for training are acquired by dividing an image of a text for training into separate images of characters included in the text for training; (2) a second process of generating a support vector for training to be used for recognizing a specific character image for training by executing at least one of computations with feature vectors ci+j for training of one or more neighboring character images for training, wherein the specific character image for training and the neighboring character images for training are included in the segment
  • FIG. 1A is a drawing schematically illustrating a conventional word-level processing method for recognizing texts.
  • FIG. 1B is a drawing schematically illustrating another conventional character-level processing method for recognizing texts.
  • FIG. 2 is a block diagram showing a configuration of a training apparatus for recognizing texts in a scene text image in accordance with one example embodiment of the present invention.
  • FIG. 3 is a drawing illustrating a method for training an apparatus to recognize a text in an image of training set by learning syntactic relationships between characters in the text in accordance with one example embodiment of the present invention.
  • FIG. 2 is a block diagram showing a configuration of a training apparatus for recognizing texts in a scene text image in accordance with one example embodiment of the present invention.
  • a training apparatus 200 for recognizing texts in a scene text image includes a communication unit 210 and a processor 220 .
  • the communication unit 210 may be configured to communicate with external devices. Particularly, the communication unit 210 may be configured to receive a scene text image of training set, in which texts as subjects to be recognized are included.
  • the processor 220 described below may be configured to detect and extract a text including characters in the scene text image.
  • the processor 220 may be configured to segment an image of the text into a set of images of the characters, thereby acquiring segmented character images.
  • the present invention does not exclude a case in which the communication unit 210 is configured to receive the image of the text or to receive the segmented character images which are obtained by dividing the image of the text into separate images of the characters.
  • a method for generating the segmented character images is described below.
  • a synthetic data generator based on an image degradation model or an equivalent component may divide an image of the extracted text into separate images of the characters, i.e., the segmented character images.
  • the segmented character images are normalized images to be used for calculating feature vectors thereof.
  • the processor 220 may perform a process of calculating or allowing other device to calculate each feature vector corresponding to each of the segmented character images if the segmented character images are acquired.
  • a meaning of “calculating each feature vector corresponding to each of the segmented character image” may represent that feature information of each character in the segmented character image is expressed as multi-dimensional values. That is, one or more embedding functions may apply operations to each one of the segmented character images to map the feature of the character into the multi-dimensional numeric representation. Also, said features may not only include classic features derived from Haar, HOG (Histogram of Oriented Gradients), or LBP (Local Binary Pattern) but also include features acquired from CNN (convolutional neural network).
  • the processor 220 may perform a process of acquiring or allowing other device to acquire a support vector, i.e., a residual guidance as shown in FIG. 3 , to be used subsidiarily to recognize a specific character image, by executing at least one of computations with feature vectors c i+j of one or more neighboring character images.
  • the specific character image and its neighboring character images are included in the segmented character images.
  • index j is not equal to 0 and ⁇ m ⁇ j ⁇ n.
  • the specific character image may refer to a character image as a subject to be recognized among the segmented character images.
  • the neighboring character images may refer to character images adjacent to the specific character image within a predetermined distance among the segmented character images.
  • the neighboring character images are determined as character images within the same distance to both sides of the specific character, character images within a certain distance to the forward side thereof, or character images within a certain distance to the backward side thereof.
  • variable index j stands for the context window size and determines the number of the adjacent character images to be utilized by the computations.
  • the term “context” represents the syntactic relationships between the specific character and its neighboring characters
  • the term “residual guidance” represents a vector whose elements are expressed by using information on the context.
  • the computations may include a projection operation for acquiring a projected vector by utilizing the feature vectors c i+j of at least part of the neighboring character images, and a transformation operation for acquiring the support vector, i.e., the residual guidance, by applying at least one of a normalization process or a scale process.
  • the projection operation includes an operation of multiplying each corresponding weighted value to elements in each of the feature vectors c i+j and an operation of averaging the elements element-wisely across the feature vectors c i+j .
  • the weighted value mentioned above may be set differently for each of the feature vectors c i+j of the neighboring images involved in the operation.
  • the weighted value may be set to be higher as a distance value between a location of a certain neighboring character image selected from the neighboring character images and that of the specific character image becomes smaller.
  • the weighted value may be set to be equal for each of the feature vectors c i+j of the neighboring images involved in setting the weighted value.
  • the processor 220 may perform a process of obtaining or allowing other device to obtain a merged vector or its processed value by executing a computation with the support vector and a feature vector c i of the specific character image.
  • the processor 220 may perform a process of classifying or allowing other device to classify the specific character image as a letter in a predetermined set of letters by referring to the merged vector or its processed value.
  • the processor 220 may perform a process of adjusting or allowing other device to adjust trainable parameters by referring to a result of the classification.
  • the parameters are adjusted by performing a backpropagation training technique. That is, as part of the training process, if the letter predicted by the classifier for the specific character image is different from the known desired letter, i.e. ground truth output, for the specific character image, the classifier will adjust its parameters to reduce error by referring to difference information acquired by comparing the result of the classification to the ground truth output.
  • FIG. 3 is a drawing illustrating the method for training the apparatus to recognize a text in an image of the training set by learning syntactic relationships between characters in the text in accordance with one example embodiment of the present invention.
  • the apparatus may be comprised of the following three parts, i.e., a feature extraction layer, an RBOC (residual bag-of-character) layer and a classifier.
  • the feature extraction layer may acquire respective images of characters, extract features from the respective images of the characters and output information on the features as multi-dimensional vectors, i.e., feature vectors.
  • the RBOC layer may execute operations to acquire the support vector, i.e. residual guidance, by utilizing feature vectors of at least part of the characters excluding the specific character as a subject to be identified and then an operation of adding the residual guidance and the feature vector of the specific character to thereby output the merged vector.
  • the classifier may output a predicted letter as a result of the classification determined by referring to the merged vector.
  • the CNN uses a small number of layers and channels, which is efficient in terms of computations and a model size.
  • the RBOC layer may generate residual guidance and add the residual guidance and the feature vector of the specific character.
  • the residual guidance that represents syntactic and semantic relationships between characters may be acquired through several computations.
  • the computations may include a projection operation and a transformation operation.
  • the projection operation may be implemented by 1D convolution with a filter size of 2 k+1, where variable k refers to a context window size.
  • the convolution kernel for the projection operation may be set to [0.25, 0.25, 0, 0.25, 0.25], where weight for the specific character is 0.
  • the 1D convolution operation to which the convolution kernel is applied may represent an average operation on feature vectors of the four neighboring character images adjacent to the specific character.
  • the residual guidance may be acquired by applying both the projection operation and the transformation operation, or by applying the projection operation only.
  • executing an add operation of the residual guidance c 1 r and the feature vector ci of the specific character may generate the ultimately computed feature vector c i (tilt) of the specific character to be identified.
  • the classifier may be implemented by conventional Support Vector Machine (SVM), or it may be implemented as a linear classifier, but it is not limited thereto.
  • SVM Support Vector Machine
  • a weight and a bias of the linear classifier are learned by using training data set.
  • the linear classifier may receive c i (tilt) as input and outputs a predicted letter A for the specific character image as a result of the classification.
  • the method of FIG. 3 is an end-to-end fashion supervised learning. Optimal values of the parameters for precise recognition of the texts may be learned via the method.
  • the testing apparatus in accordance with another example embodiment of the present invention may also include a communication unit and a processor.
  • the text for recognition in a scene text image of testing set will be referred as “the text for testing”, and the characters for recognition in such a text will be referred as “the characters for testing”.
  • the segmentation of an image of the text for testing into separate images of the characters for testing will be referred as “segmented character images for testing”.
  • the processor may perform a process of generating or allowing other device to generate a support vector for testing, i.e. residual guidance for testing, to be used subsidiarily to recognize a specific character image for testing by executing at least one of computations with feature vectors c i+j for testing of one or more neighboring character images for testing.
  • the specific character image for testing and the neighboring character images for testing are included in the segmented character images for testing and that index j is not equal to 0 and ⁇ m ⁇ j ⁇ n.
  • the processor may perform a process of obtaining or allowing other device to obtain a merged vector for testing or its processed value by executing a computation with the support vector for testing and a feature vector ci for testing of the specific character image for testing.
  • the processor may perform a process of classifying or allow other device to classify the specific character image for testing as a letter included in the predetermined set of letters by referring to the merged vector for testing or its processed value.
  • the training apparatus may adjust parameters by performing a backpropagation training technique while the testing apparatus may not perform this process.
  • the present invention has an effect of providing a text recognition method with a high efficiency in identifying similar shaped characters by performing the operation of adding the feature vector of the specific character as a subject to be identified and the residual guidance determined by referring to feature information of at least one or more of the neighboring characters adjacent to the specific character.
  • the embodiments of the present invention as explained above can be implemented in a form of executable program command through a variety of computer means recordable to computer readable media.
  • the computer readable media may include solely or in combination, program commands, data files, and data configurations.
  • the program commands recorded to the media may be components specially designed for the present invention or may be usable to a skilled person in a field of computer software or the related.
  • Computer readable record media include magnetic media such as hard disk, floppy disk, and magnetic tape, optical media such as CD-ROM and DVD, magneto-optical media such as floptical disk and hardware devices such as ROM, RAM, and flash memory specially designed to store and carry out programs.
  • Program commands include not only a machine language code made by a complier but also a high level code that can be used by an interpreter etc., which is executed by a computer.
  • the aforementioned hardware device can work as more than a software module to perform the action of the present invention and they can do the same in the opposite case.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

A method for learning parameters used to recognize characters included in a text in a scene text image of training set is provided. The method includes steps of: (a) a training apparatus generating each feature vector corresponding to each of the segmented character images; (b) the training apparatus processing feature vectors ci+j of neighboring character images to thereby generate a support vector to be used for a recognition of a specific character image; (c) the training apparatus obtaining a merged vector by executing a computation with the support vector and a feature vector ci of the specific character image; and (d) the training apparatus (i) performing a classification of the specific character image as a letter included in a predetermined set of letters by referring to the merged vector; and (ii) adjusting the parameters by referring to a result of the classification.

Description

    FIELD OF THE INVENTION
  • The present invention relates to a method, and an apparatus for learning one or more parameters used to recognize one or more characters included in a text in a scene text image of training set, and more particularly, to the method and the apparatus for performing processes of (1) generating or allowing other device to generate each feature vector corresponding to each of segmented character images, if the segmented character images are acquired by dividing an image of the text into separate images of the characters; (2) processing or allowing other device to process feature vectors ci+j of neighboring character images to thereby generate a support vector to be used for recognizing a specific character image, on the condition that the specific character image and the neighboring character images are included in the segmented character images, wherein index j is not equal to 0 and −m≤j≤n; (3) obtaining or allowing other device to obtain a merged vector or its processed value by executing a computation with the support vector and a feature vector ci of the specific character image; and (4) (i) performing or allowing other device to perform a classification of the specific character image as a letter included in a predetermined set of letters by referring to the merged vector or its processed value; and (ii) adjusting or allowing other device to adjust the parameters by referring to a result of the classification; and a method and an apparatus using the same for recognizing one or more characters for testing included in a text for testing in a scene text image of testing set.
  • BACKGROUND OF THE INVENTION
  • Today, a variety of algorithms for text detection and text recognition have been devised and applied to various fields of application. The technologies for detecting or recognizing texts in natural images have gained a lot of attentions in recent years as a key component for reading texts in those natural images, and related patent applications have been filed as well.
  • With images of training set and training algorithms devised, the technologies trains an apparatus, and then the trained apparatus applies various text recognition algorithms to identify texts.
  • Given a natural image as an input, a technology for detecting texts may find out a position and a size of each text in the natural image and a technology for recognizing texts may identify a set of characters located at the position. A text in an image could be detected by a device itself like ADAS, i.e., advanced driver-assistance systems, or inputted by a user through a touch interface. Thus, the technology for detecting texts may be implemented more easily than the technology for recognizing texts.
  • The conventional text recognition methods may be categorized into two types. FIGS. 1A and 1B are respective drawings illustrating each type of the methods.
  • FIG. 1A is a drawing illustrating a method of segmenting an input image by each of words in the input image and holistically recognizing each of the words in each of corresponding word-level bounding boxes. And, FIG. 1B is a drawing illustrating a method of segmenting an input image by each of characters in the input image, recognizing each of the characters in each of corresponding character-level bounding boxes and combining the recognized characters to determine an appropriate word with a certain meaning.
  • However, the conventional word-level processing method such as FIG. 1A may be vulnerable to variations in text length, variations in spacing between characters, and languages such as Chinese or Japanese that have no spaces in its text. And the conventional character-level processing method such as FIG. 1B may suffer from ambiguity between similar-shaped characters, e.g., {I,l,1},{0,O},{5,S}.
  • As such, all the conventional text recognition approaches have such drawbacks as mentioned above. Thus, the applicant comes up to the invention of a robust and novel scene text recognition method. Particularly, a novel text recognition method with a high efficiency in identifying characters with similar shape is devised by reflecting a numerical value, which is determined by referring to feature information of at least one or more of neighboring characters adjacent to a specific character as a subject to be identified, in a numerical value of feature of the specific character.
  • SUMMARY OF THE INVENTION
  • It is an object of the present invention to solve all the problems mentioned above.
  • It is another object of the present invention to provide a text recognition method with a high efficiency in identifying similar-shaped characters by performing operations on a feature vector of a specific character as a subject to be identified with a feature vector determined by referring to feature information of at least one or more of neighboring characters adjacent to the specific character.
  • In accordance with one aspect of the present invention, there is provided a method for learning one or more parameters used to recognize one or more characters included in a text in a scene text image of training set, including steps of: (a) a training apparatus, if segmented character images are acquired by dividing an image of the text into separate images of the characters, generating or allowing other device to generate each feature vector corresponding to each of the segmented character images; (b) the training apparatus, on the condition that a specific character image and its neighboring character images are included in the segmented character images, processing or allowing other device to process feature vectors ci+j of at least part of the neighboring character images by executing at least one of computations to thereby generate a support vector to be used for a recognition of the specific character image, wherein index j is not equal to 0 and −m≤j≤n; (c) the training apparatus obtaining or allowing other device to obtain a merged vector or its processed value by executing a computation with the support vector and a feature vector ci of the specific character image; and (d) the training apparatus (i) performing or allowing other device to perform a classification of the specific character image as a letter included in a predetermined set of letters by referring to the merged vector or its processed value; and (ii) adjusting or allowing other device to adjust the parameters by referring to a result of the classification.
  • In accordance with another aspect of the present invention, there is provided a method for recognizing one or more characters for testing included in a text for testing in a scene text image of testing set, including steps of: (a) a testing apparatus generating or allowing other device to generate each feature vector for testing corresponding to each of segmented character images for testing if the segmented character images are acquired by dividing an image of the text for testing into separate images of the characters for testing, on the condition that (i) a first process of generating each feature vector for training corresponding to each of segmented character images for training if the segmented character images for training are acquired by dividing an image of a text for training into separate images of characters included in the text for training; (ii) a second process of processing feature vectors ci+j for training of at least part of neighboring character images for training by executing at least one of computations to thereby generate a support vector for training to be used for recognizing a specific character image for training, wherein the specific character image for training and its neighboring character images for training are included in the segmented character images for training and wherein index j is not equal to 0 and −m≤j≤n; (iii) a third process of obtaining a merged vector for training of the specific character image for training or its processed value by executing a computation with the support vector for training and a feature vector ci for training of the specific character image for training; and (iv) a fourth process of performing a classification of the specific character image for training as a letter included in a predetermined set of letters by referring to the merged vector for training or its processed value, and adjusting one or more parameters by referring to a result of the classification have been executed; (b) the testing apparatus, on the condition that a specific character image for testing and its neighboring character images are included in the segmented character images for testing, processing or allowing other device to process feature vectors ci+j for testing of at least part of the neighboring character images for testing by executing at least one of computations to thereby generate a support vector for testing to be used for recognizing the specific character image for testing, wherein index j is not equal to 0 and −m≤j≤n; (c) the testing apparatus obtaining or allowing other device to obtain a merged vector for testing or its processed value by executing a computation with the support vector for testing and a feature vector ci for testing of the specific character image for testing; and (d) the testing apparatus performing a classification or allowing other device to perform a classification of the specific character image for testing as a letter included in a predetermined set of letters by referring to the merged vector for testing or its processed value.
  • In accordance with still another aspect of the present invention, there is provided a training apparatus for learning one or more parameters used to recognize one or more characters included in a text in a scene text image of training set, including: a communication part for acquiring (i) segmented character images obtained by dividing an image of the text in the scene text image into separate images of the characters, (ii) the image of the text or (iii) the scene text image; and a processor for performing processes of (i) generating or allowing other device to generate each feature vector corresponding to each of the segmented character images, (ii) generating or allowing other device to generate a support vector to be used for recognizing a specific character image by executing at least one of computations with feature vectors ci+j of one or more neighboring character images, wherein the specific character image and the neighboring character images are included in the segmented character images, and wherein index j is not equal to 0 and −m≤j≤n; (iii) obtaining or allowing other device to obtain a merged vector or its processed value by executing a computation with the support vector and a feature vector ci of the specific character image; and (iv) (iv-1) classifying or allowing other device to classify the specific character image as a letter included in a predetermined set of letters by referring to the merged vector or its processed value, and (iv-2) adjusting or allowing other device to adjust the parameters by referring to a result of the classification.
  • In accordance with still yet another aspect of the present invention, there is provided a testing apparatus for recognizing one or more characters for testing included in a text for testing in a scene text image of testing set, including: a communication part for acquiring (i) segmented character images for testing obtained by dividing an image of the text for testing in the scene text image into separate images of the characters for testing, (ii) the image of the text for testing or (iii) the scene text image, on the condition that (1) a first process of generating each feature vector for training corresponding to each of segmented character images for training if the segmented character images for training are acquired by dividing an image of a text for training into separate images of characters included in the text for training; (2) a second process of generating a support vector for training to be used for recognizing a specific character image for training by executing at least one of computations with feature vectors ci+j for training of one or more neighboring character images for training, wherein the specific character image for training and the neighboring character images for training are included in the segmented character images for training, and wherein index j is not equal to 0 and −m≤j≤n; (3) a third process of obtaining a merged vector for training of the specific character image for training or its processed value by executing a computation with the support vector for training and a feature vector ci for training of the specific character image for training; and (4) a fourth process of classifying the specific character image for training as a letter included in a predetermined set of letters by referring to the merged vector for training or its processed value, and adjusting one or more parameters by referring to a result of the classification have been executed; and a processor for performing processes of (i) generating or allowing other device to generate each feature vector for testing corresponding to each of the segmented character images for testing; (ii) generating or allowing other device to generate a support vector for testing to be used for recognizing a specific character image for testing by executing at least one of computations with feature vectors ci+j for testing of one or more neighboring character images for testing, wherein the specific character image for testing and the neighboring character images for testing are included in the segmented character images for testing, and wherein index j is not equal to 0 and −m≤j≤n; (iii) obtaining or allowing other device to obtain a merged vector for testing or its processed value by executing a computation with the support vector for testing and a feature vector ci for testing of the specific character image for testing; and (iv) classifying or allow other device to classify the specific character image for testing as a letter included in a predetermined set of letters by referring to the merged vector for testing or its processed value.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The above and other objects of the present invention will become apparent from the following description of preferred embodiments given in conjunction with the accompanying drawings, in which:
  • FIG. 1A is a drawing schematically illustrating a conventional word-level processing method for recognizing texts.
  • FIG. 1B is a drawing schematically illustrating another conventional character-level processing method for recognizing texts.
  • FIG. 2 is a block diagram showing a configuration of a training apparatus for recognizing texts in a scene text image in accordance with one example embodiment of the present invention.
  • FIG. 3 is a drawing illustrating a method for training an apparatus to recognize a text in an image of training set by learning syntactic relationships between characters in the text in accordance with one example embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • To make purposes, technical solutions, and advantages of the present invention clear, reference is made to the accompanying drawings that show, by way of illustration, more detailed example embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention.
  • It is to be understood that the various embodiments of the present invention, although different, are not necessarily mutually exclusive. For example, a particular feature, structure, or characteristic described herein in connection with one embodiment may be implemented within other embodiments without departing from the spirit and scope of the present invention. In addition, it is to be understood that the position or arrangement of individual elements within each disclosed embodiment may be modified without departing from the spirit and scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims, appropriately interpreted, along with the full range of equivalents to which the claims are entitled. In the drawings, like numerals refer to the same or similar functionality throughout the several views.
  • FIG. 2 is a block diagram showing a configuration of a training apparatus for recognizing texts in a scene text image in accordance with one example embodiment of the present invention.
  • Referring to FIG. 2, a training apparatus 200 for recognizing texts in a scene text image includes a communication unit 210 and a processor 220.
  • The communication unit 210 may be configured to communicate with external devices. Particularly, the communication unit 210 may be configured to receive a scene text image of training set, in which texts as subjects to be recognized are included. The processor 220 described below may be configured to detect and extract a text including characters in the scene text image. As another example, the processor 220 may be configured to segment an image of the text into a set of images of the characters, thereby acquiring segmented character images. As still another example, the present invention does not exclude a case in which the communication unit 210 is configured to receive the image of the text or to receive the segmented character images which are obtained by dividing the image of the text into separate images of the characters.
  • For reference, a method for generating the segmented character images is described below. Given a scene text image, after a text including characters is extracted from the scene text image, a synthetic data generator based on an image degradation model or an equivalent component may divide an image of the extracted text into separate images of the characters, i.e., the segmented character images. Certainly, it is not limited thereto. Generally, the segmented character images are normalized images to be used for calculating feature vectors thereof.
  • The processor 220 may perform a process of calculating or allowing other device to calculate each feature vector corresponding to each of the segmented character images if the segmented character images are acquired.
  • Herein, a meaning of “calculating each feature vector corresponding to each of the segmented character image” may represent that feature information of each character in the segmented character image is expressed as multi-dimensional values. That is, one or more embedding functions may apply operations to each one of the segmented character images to map the feature of the character into the multi-dimensional numeric representation. Also, said features may not only include classic features derived from Haar, HOG (Histogram of Oriented Gradients), or LBP (Local Binary Pattern) but also include features acquired from CNN (convolutional neural network).
  • The processor 220 may perform a process of acquiring or allowing other device to acquire a support vector, i.e., a residual guidance as shown in FIG. 3, to be used subsidiarily to recognize a specific character image, by executing at least one of computations with feature vectors ci+j of one or more neighboring character images. Herein, the specific character image and its neighboring character images are included in the segmented character images. Further, index j is not equal to 0 and −m≤j≤n.
  • Herein, the specific character image may refer to a character image as a subject to be recognized among the segmented character images. The neighboring character images may refer to character images adjacent to the specific character image within a predetermined distance among the segmented character images. For example, the neighboring character images are determined as character images within the same distance to both sides of the specific character, character images within a certain distance to the forward side thereof, or character images within a certain distance to the backward side thereof.
  • Moreover, a value of the variable index j stands for the context window size and determines the number of the adjacent character images to be utilized by the computations. Besides, the term “context” represents the syntactic relationships between the specific character and its neighboring characters, and the term “residual guidance” represents a vector whose elements are expressed by using information on the context.
  • In accordance with one example embodiment of the present invention, the computations may include a projection operation for acquiring a projected vector by utilizing the feature vectors ci+j of at least part of the neighboring character images, and a transformation operation for acquiring the support vector, i.e., the residual guidance, by applying at least one of a normalization process or a scale process.
  • For example, in accordance with one example embodiment of the present invention, the projection operation includes an operation of multiplying each corresponding weighted value to elements in each of the feature vectors ci+j and an operation of averaging the elements element-wisely across the feature vectors ci+j.
  • Herein, the weighted value mentioned above may be set differently for each of the feature vectors ci+j of the neighboring images involved in the operation. In this case, the weighted value may be set to be higher as a distance value between a location of a certain neighboring character image selected from the neighboring character images and that of the specific character image becomes smaller. Contrarily, the weighted value may be set to be equal for each of the feature vectors ci+j of the neighboring images involved in setting the weighted value.
  • The processor 220 may perform a process of obtaining or allowing other device to obtain a merged vector or its processed value by executing a computation with the support vector and a feature vector ci of the specific character image.
  • For example, the merged vector may be obtained by adding the support vector and the feature vector ci of the specific character and it is served as an input of a classifier for determining an identity of the specific character. As a reference, the number of character classes depends on the recognition target language. For example, the number of classes is either 26 (case-insensitive) or 52 (case-sensitive) for English and it is 10 for digital numbers.
  • Additionally, the processor 220 may perform a process of classifying or allowing other device to classify the specific character image as a letter in a predetermined set of letters by referring to the merged vector or its processed value.
  • Finally, the processor 220 may perform a process of adjusting or allowing other device to adjust trainable parameters by referring to a result of the classification. To be more specific, the parameters are adjusted by performing a backpropagation training technique. That is, as part of the training process, if the letter predicted by the classifier for the specific character image is different from the known desired letter, i.e. ground truth output, for the specific character image, the classifier will adjust its parameters to reduce error by referring to difference information acquired by comparing the result of the classification to the ground truth output.
  • FIG. 3 is a drawing illustrating the method for training the apparatus to recognize a text in an image of the training set by learning syntactic relationships between characters in the text in accordance with one example embodiment of the present invention.
  • Referring to FIG. 3, the method with a high efficiency in identifying characters with similar shape is provided.
  • Specifically, referring to FIG. 3, the apparatus may be comprised of the following three parts, i.e., a feature extraction layer, an RBOC (residual bag-of-character) layer and a classifier. In detail, the feature extraction layer may acquire respective images of characters, extract features from the respective images of the characters and output information on the features as multi-dimensional vectors, i.e., feature vectors. The RBOC layer may execute operations to acquire the support vector, i.e. residual guidance, by utilizing feature vectors of at least part of the characters excluding the specific character as a subject to be identified and then an operation of adding the residual guidance and the feature vector of the specific character to thereby output the merged vector. And, the classifier may output a predicted letter as a result of the classification determined by referring to the merged vector.
  • In accordance with one example embodiment of the present invention, the feature extraction layer may be implemented by convolutional neural network (CNN). In detail, CNN parameters that generate the ground truth output are learned from given training data set and the learned CNN parameters may be applied to character images.
  • For example, the CNN may be configured to include five convolutional sublayers, where each layer may include five components. The five components may be in the order of ordinary 2D convolution, exponential linear unit (ELU), 1D convolution, rectified linear unit (ReLU), and batch normalization (BN). The ELU component has been placed between two convolutions to alleviate the vanishing gradient problem. Character images are normalized and quantified as vectors before being fed to the first CNN sublayer, and the subsequent CNN sublayers may take an operation result of the previous CNN sublayer as their input.
  • The detailed configuration of the CNN in accordance with the present invention is summarized and provided in the following Table 1. As it could be seen from the Table 1 below, the CNN uses a small number of layers and channels, which is efficient in terms of computations and a model size.
  • TABLE 1
    CNN configuration.
    Index Type Configurations
    5 Batch norm. m = 0.9
    Convolution k = 1, c = 128, s = 1, p = 0, ReLU
    Convolution k = 3, c = 128, s = 2, p = 1, ELU
    4 Batch norm. m = 0.9
    Convolution k = 1, c = 64, s = 1, p = 0, ReLU
    Convolution k = 3, c = 64, s = 2, p = 1, ELU
    3 Batch norm. m = 0.9
    Convolution k = 1, c = 32, s = 1, p = 0, ReLU
    Convolution k = 3, c = 32, s = 2, p = 1, ELU
    2 Batch norm. m = 0.9
    Convolution k = 1, c = 16, s = 1, p = 0, ReLU
    Convolution k = 3, c = 16, s = 2, p = 1, ELU
    1 Batch norm. m = 0.9
    Convolution k = 1, c = 8, s = 1, p = 0, ReLU
    Convolution k = 3, c = 8, s = 2, p = 1, ELU
    Image 32 × 32 × 1
    k: kernel size,
    c: channel,
    s: stride,
    p: padding size,
    m: momentum factor
  • In detail, the RBOC layer may generate residual guidance and add the residual guidance and the feature vector of the specific character. Herein, the residual guidance that represents syntactic and semantic relationships between characters may be acquired through several computations. In accordance with one example embodiment of the present invention, the computations may include a projection operation and a transformation operation.
  • For example, the projection operation may be implemented by 1D convolution with a filter size of 2 k+1, where variable k refers to a context window size. As an example, if k is set to 2, the convolution kernel for the projection operation may be set to [0.25, 0.25, 0, 0.25, 0.25], where weight for the specific character is 0. In this case, the 1D convolution operation to which the convolution kernel is applied may represent an average operation on feature vectors of the four neighboring character images adjacent to the specific character.
  • Meanwhile, the transformation operation may be a fully-connected layer or 1×1 convolution. Herein, the fully-connected layer may be implemented as an inner product layer capable of performing an operation of multiplying a weight matrix W and an operation of adding bias B. As one example, a size of the weight matrix W may be 128×128 and that of the bias B may be 1×128.
  • The residual guidance may be acquired by applying both the projection operation and the transformation operation, or by applying the projection operation only.
  • After the residual guidance being acquired, executing an add operation of the residual guidance c1 r and the feature vector ci of the specific character may generate the ultimately computed feature vector ci(tilt) of the specific character to be identified.
  • The classifier may be implemented by conventional Support Vector Machine (SVM), or it may be implemented as a linear classifier, but it is not limited thereto. For example, if the classifier is implemented by the linear classifier, a weight and a bias of the linear classifier are learned by using training data set. For example, the linear classifier may receive ci(tilt) as input and outputs a predicted letter A for the specific character image as a result of the classification.
  • The method of FIG. 3 is an end-to-end fashion supervised learning. Optimal values of the parameters for precise recognition of the texts may be learned via the method.
  • Meanwhile, the processor 220 performs a function of controlling data flow between the communication part 210 as described above and other components. In short, the processor 220 controls individual unique functions in the communication part 210 and other components by controlling data flow among the components of the training apparatus 200 for recognizing texts.
  • The processor 220 may include hardware features such as micro processing unit (MPU), central processing unit (CPU), cache memory, and data bus. Moreover, it may further include software features such as an operating system and applications that perform certain purposes.
  • Hereinafter, a configuration and corresponding functions of a testing apparatus, which is not shown, for recognizing texts in a scene text image of testing set will be described. The testing apparatus adopts the parameters learned through the method illustrated in FIG. 3 to recognize the texts in the testing images. The testing apparatus may be the same apparatus as the training apparatus aforementioned or it may be a different one. The duplicated description with the training apparatus set forth above may be omitted.
  • The testing apparatus (not shown) in accordance with another example embodiment of the present invention may also include a communication unit and a processor.
  • The communication unit may be configured to communicate with external devices. Particularly, the communication unit may be configured to receive a scene text image of testing set, in which a text for recognition is included. The processor described below may be configured to detect and extract characters for recognition which are included in the text in the scene text image. As another example, the processor may be configured to segment an image of the text for recognition into a set of images of the characters for recognition, i.e. segmented character images. Certainly, the present invention does not exclude the cases in which the communication unit is configured to receive the image of the text for recognition or to receive the segmented character images. Herein, the text for recognition in a scene text image of testing set will be referred as “the text for testing”, and the characters for recognition in such a text will be referred as “the characters for testing”. And the segmentation of an image of the text for testing into separate images of the characters for testing will be referred as “segmented character images for testing”.
  • The processor may perform a process of acquiring or allowing other device to acquire each feature vector for testing corresponding to each of the segmented character images for testing if the segmented character images for testing are acquired. Herein, each feature vector for testing may be referred to each feature vector for each character image included in the segmented character images for testing.
  • The processor may perform a process of generating or allowing other device to generate a support vector for testing, i.e. residual guidance for testing, to be used subsidiarily to recognize a specific character image for testing by executing at least one of computations with feature vectors ci+j for testing of one or more neighboring character images for testing. Herein, the specific character image for testing and the neighboring character images for testing are included in the segmented character images for testing and that index j is not equal to 0 and −m≤j≤n.
  • Furthermore, the processor may perform a process of obtaining or allowing other device to obtain a merged vector for testing or its processed value by executing a computation with the support vector for testing and a feature vector ci for testing of the specific character image for testing.
  • Additionally, the processor may perform a process of classifying or allow other device to classify the specific character image for testing as a letter included in the predetermined set of letters by referring to the merged vector for testing or its processed value.
  • As a reference, the training apparatus aforementioned may adjust parameters by performing a backpropagation training technique while the testing apparatus may not perform this process.
  • The present invention has following effects:
  • The present invention has an effect of providing a text recognition method with a high efficiency in identifying similar shaped characters by performing the operation of adding the feature vector of the specific character as a subject to be identified and the residual guidance determined by referring to feature information of at least one or more of the neighboring characters adjacent to the specific character.
  • The embodiments of the present invention as explained above can be implemented in a form of executable program command through a variety of computer means recordable to computer readable media. The computer readable media may include solely or in combination, program commands, data files, and data configurations. The program commands recorded to the media may be components specially designed for the present invention or may be usable to a skilled person in a field of computer software or the related. Computer readable record media include magnetic media such as hard disk, floppy disk, and magnetic tape, optical media such as CD-ROM and DVD, magneto-optical media such as floptical disk and hardware devices such as ROM, RAM, and flash memory specially designed to store and carry out programs. Program commands include not only a machine language code made by a complier but also a high level code that can be used by an interpreter etc., which is executed by a computer. The aforementioned hardware device can work as more than a software module to perform the action of the present invention and they can do the same in the opposite case.
  • As seen above, the present invention has been explained by specific matters such as detailed components, limited embodiments, and drawings. While the invention has been shown and described with respect to the preferred embodiments, it, however, will be understood by those skilled in the art that various changes and modification may be made without departing from the spirit and scope of the invention as defined in the following claims.
  • Accordingly, the thought of the present invention must not be confined to the explained embodiments, and the following patent claims as well as everything including variations equal or equivalent to the patent claims pertain to the category of the thought of the present invention.

Claims (18)

1. A method for learning one or more parameters of a Convolutional Neural Networks (CNN) used to recognize one or more characters included in a text in a scene text image of training set, comprising steps of:
a training apparatus, if segmented character images are acquired by dividing an image of the text into separate images of the characters, generating or allowing another device to generate each multidimensional feature vector corresponding to each of the segmented character images;
the training apparatus, on the condition that a specific character image and its neighboring character images are included in the segmented character images, processing or allowing another device to process multidimensional feature vectors ci+j of at least part of the neighboring character images by executing at least one of computations to thereby generate a support vector to be used for recognizing the specific character image, wherein index j is not equal to 0 and −m≤j≤n, and wherein m and n are positive integers;
the training apparatus obtaining or allowing another device to obtain a merged vector or its processed value by executing a computation with the support vector and a multidimensional feature vector ci of the specific character image;
the training apparatus (i) determining or allowing another device to determine that the specific character image is a specific letter included in a predetermined set of letters by referring to the merged vector or its processed value; and (ii) adjusting or allowing another device to adjust the parameters by referring to a result of the classification.
2. The method of claim 1, wherein the training apparatus adjusts or allows another device to adjust the parameters by referring to difference information acquired by comparing a value of a ground truth output and the result of the classification.
3. The method of claim 1, before the step of generating or allowing another device to generate each multidimensional feature vector corresponding to each of the segmented character images, further comprising a step of: the training apparatus, if the scene text image is inputted, detecting and extracting or allowing another device to detect and extract the image of the text from the scene text image, and segmenting or allowing another device to segment the image of the text.
4. The method of claim 1, wherein, at the step of processing or allowing another device to process the multidimensional feature vectors ci+j of at least part of the neighboring character images, the computations include a projection operation for acquiring a projected vector by utilizing the multidimensional feature vectors ci+j of at least part of the neighboring character images, and a transformation operation for acquiring the support vector by applying at least one of normalization process or scale process.
5. The method of claim 4, wherein the projection operation includes an operation of multiplying each weighted value to elements in each of the multidimensional feature vectors ci+j and an operation of averaging the elements element-wisely across the multidimensional feature vectors ci+j.
6. The method of claim 5, wherein the weighted value is set differently for each of the multidimensional feature vectors ci+j.
7. The method of claim 5, wherein the weighted value is set to be higher as a distance value between a location of a certain neighboring character image selected from the neighboring character images and that of the specific character image becomes smaller.
8. The method of claim 1, wherein, at the step of determining or allowing another device to determine that the specific character image is the specific letter included in the predetermined set of letters, the parameters are adjusted by performing a backpropagation training technique.
9. A method for recognizing one or more characters for testing included in a text for testing in a scene text image of testing set, comprising steps of:
(a) a testing apparatus generating or allowing another device to generate each multidimensional feature vector for testing corresponding to each of segmented character images for testing if the segmented character images are acquired by dividing an image of the text for testing into separate images of the characters for testing, on the condition that (i) a first process of generating each multidimensional feature vector for training corresponding to each of segmented character images for training if the segmented character images for training are acquired by dividing an image of a text for training into separate images of characters included in the text for training; (ii) a second process of processing multidimensional feature vectors ci+j for training of at least part of neighboring character images for training by executing at least one of computations to thereby generate a support vector for training to be used for recognizing a specific character image for training, wherein the specific character image for training and its neighboring character images for training are included in the segmented character images for training, wherein index j is not equal to 0 and −m≤j≤n, and wherein m and n are positive integers; (iii) a third process of obtaining a merged vector for training of the specific character image for training or its processed value by executing a computation with the support vector for training and a multidimensional feature vector ci for training of the specific character image for training; and (iv) a fourth process of determining that the specific character image for training is a specific letter included in a predetermined set of letters by referring to the merged vector for training or its processed value, and adjusting one or more parameters by referring to a result of the classification have been executed;
(b) the testing apparatus, on the condition that a specific character image for testing and its neighboring character images are included in the segmented character images for testing, processing or allowing another device to process multidimensional feature vectors ci+j for testing of at least part of the neighboring character images for testing by executing at least one of computations to thereby generate a support vector for testing to be used for recognizing the specific character image for testing, wherein index j is not equal to 0 and −m≤j≤n;
(c) the testing apparatus obtaining or allowing another device to obtain a merged vector for testing or its processed value by executing a computation with the support vector for testing and a multidimensional feature vector ci for testing of the specific character image for testing; and
(d) the testing apparatus performing a classification or allowing another device to perform a classification of the specific character image for testing as a letter included in a predetermined set of letters by referring to the merged vector for testing or its processed value.
10. A training apparatus for learning one or more parameters of a Convolutional Neural Networks (CNN) used to recognize one or more characters included in a text in a scene text image of training set, comprising:
a communication part for acquiring (i) segmented character images obtained by dividing an image of the text in the scene text image into separate images of the characters, (ii) the image of the text or (iii) the scene text image; and
a processor for performing processes of (i) generating or allowing another device to generate each multidimensional feature vector corresponding to each of the segmented character images, (ii) generating or allowing another device to generate a support vector to be used for recognizing a specific character image by executing at least one of computations with multidimensional feature vectors ci+j of one or more neighboring character images, wherein the specific character image and the neighboring character images are included in the segmented character images, wherein index j is not equal to 0 and −m≤j≤n, and wherein m and n are positive integers; (iii) obtaining or allowing another device to obtain a merged vector or its processed value by executing a computation with the support vector and a multidimensional feature vector ci of the specific character image; and (iv) (iv-1) determining or allowing another device to determine that the specific character image is a specific letter included in a predetermined set of letters by referring to the merged vector or its processed value, and (iv- 2) adjusting or allowing another device to adjust the parameters by referring to a result of the classification.
11. The training apparatus of claim 10, wherein the processor is configured to adjust or to allow another device to adjust the parameters by referring to difference information acquired by comparing a value of a ground truth output and the result of the classification.
12. The training apparatus of claim 10, wherein the processor, before executing the process of (i), is further configured to detect and extract or allow another device to detect and extract the image of the text from the scene text image if the scene text image is acquired, and to segment or allow another device to segment the image of the text.
13. The training apparatus of claim 10, wherein the computations in the process of (ii) include a projection operation for acquiring a projected vector by utilizing the multidimensional feature vectors ci+j of at least part of the neighboring character images, and a transformation operation for acquiring the support vector by applying at least one of normalization process or scale process.
14. The training apparatus of claim 13, wherein the projection operation includes an operation of multiplying each weighted value to elements in each of the multidimensional feature vectors ci+j and an operation of averaging the elements element-wisely across the multidimensional feature vectors ci+j.
15. The training apparatus of claim 14, wherein the weighted value is set differently for each of the multidimensional feature vectors ci+j.
16. The training apparatus of claim 14, the weighted value is set to be higher as a distance value between a location of a certain neighboring character image selected from the neighboring character images and that of the specific character image becomes smaller.
17. The training apparatus of claim 10, wherein the parameters are adjusted by performing a backpropagation training technique
18. A testing apparatus for recognizing one or more characters for testing included in a text for testing in a scene text image of testing set, comprising:
a communication part for acquiring (i) segmented character images for testing obtained by dividing an image of the text for testing in the scene text image into separate images of the characters for testing, (ii) the image of the text for testing or (iii) the scene text image, on the condition that (1) a first process of generating each multidimensional feature vector for training corresponding to each of segmented character images for training if the segmented character images for training are acquired by dividing an image of a text for training into separate images of characters included in the text for training; (2) a second process of generating a support vector for training to be used for recognizing a specific character image for training by executing at least one of computations with multidimensional feature vectors ci+j for training of one or more neighboring character images for training, wherein the specific character image for training and the neighboring character images for training are included in the segmented character images for training, wherein index j is not equal to 0 and −m≤j≤n, and wherein m and n are positive integers; (3) a third process of obtaining a merged vector for training of the specific character image for training or its processed value by executing a computation with the support vector for training and a multidimensional feature vector ci for training of the specific character image for training; and (4) a fourth process of determining that the specific character image for training is a specific letter included in a predetermined set of letters by referring to the merged vector for training or its processed value, and adjusting one or more parameters by referring to a result of the classification have been executed; and
a processor for performing processes of (i) generating or allowing another device to generate each multidimensional feature vector for testing corresponding to each of the segmented character images for testing; (ii) generating or allowing another device to generate a support vector for testing to be used for recognizing a specific character image for testing by executing at least one of computations with multidimensional feature vectors ci+j for testing of one or more neighboring character images for testing, wherein the specific character image for testing and the neighboring character images for testing are included in the segmented character images for testing, and wherein index j is not equal to 0 and −m≤j≤n; (iii) obtaining or allowing another device to obtain a merged vector for testing or its processed value by executing a computation with the support vector for testing and a multidimensional feature vector ci for testing of the specific character image for testing; and (iv) classifying or allow another device to classify the specific character image for testing as a letter included in a predetermined set of letters by referring to the merged vector for testing or its processed value.
US15/630,188 2017-06-22 2017-06-22 Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same Active US10163022B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/630,188 US10163022B1 (en) 2017-06-22 2017-06-22 Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/630,188 US10163022B1 (en) 2017-06-22 2017-06-22 Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same

Publications (2)

Publication Number Publication Date
US10163022B1 US10163022B1 (en) 2018-12-25
US20180373947A1 true US20180373947A1 (en) 2018-12-27

Family

ID=64692327

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/630,188 Active US10163022B1 (en) 2017-06-22 2017-06-22 Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same

Country Status (1)

Country Link
US (1) US10163022B1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147788A (en) * 2019-05-27 2019-08-20 东北大学 A kind of metal plate and belt Product labelling character recognition method based on feature enhancing CRNN
CN110276279A (en) * 2019-06-06 2019-09-24 华东师范大学 A kind of arbitrary shape scene text detection method based on image segmentation
CN110502655A (en) * 2019-07-31 2019-11-26 武汉大学 A kind of image nature descriptive statement generation method being embedded in scene text information
CN111274961A (en) * 2020-01-20 2020-06-12 华南理工大学 Character recognition and information analysis method for flexible IC substrate
WO2020177378A1 (en) * 2019-03-06 2020-09-10 平安科技(深圳)有限公司 Text information feature extraction method and device, computer apparatus, and storage medium
WO2021147817A1 (en) * 2020-01-21 2021-07-29 第四范式(北京)技术有限公司 Text positioning method and system, and text positioning model training method and system
WO2022046486A1 (en) * 2021-08-18 2022-03-03 Innopeak Technology, Inc. Scene text recognition model with text orientation or angle detection
US11295083B1 (en) * 2018-09-26 2022-04-05 Amazon Technologies, Inc. Neural models for named-entity recognition
US20220301244A1 (en) * 2021-03-22 2022-09-22 Adobe Inc. Customizing font bounding boxes for variable fonts

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163050B (en) * 2018-07-23 2022-09-27 腾讯科技(深圳)有限公司 Video processing method and device, terminal equipment, server and storage medium
US10997463B2 (en) * 2018-11-08 2021-05-04 Adobe Inc. Training text recognition systems
CN111444255B (en) * 2018-12-29 2023-09-22 杭州海康存储科技有限公司 Training method and device for data model
US10467500B1 (en) 2018-12-31 2019-11-05 Didi Research America, Llc Method and system for semantic segmentation involving multi-task convolutional neural network
WO2020176064A1 (en) 2018-12-31 2020-09-03 Didi Research America, Llc Method and system of annotation densification for semantic segmentation
CN109740618B (en) * 2019-01-14 2022-11-04 河南理工大学 Test paper score automatic statistical method and device based on FHOG characteristics
CN109977942B (en) * 2019-02-02 2021-07-23 浙江工业大学 Scene character recognition method based on scene classification and super-resolution
CN109871448B (en) * 2019-03-12 2023-08-15 苏州大学 Short text classification method and system
CN111695385B (en) * 2019-03-15 2023-09-26 杭州海康威视数字技术股份有限公司 Text recognition method, device and equipment
CN110210484A (en) * 2019-04-19 2019-09-06 成都三零凯天通信实业有限公司 System and method for detecting and identifying poor text of view image based on deep learning
CN110210581B (en) * 2019-04-28 2023-11-24 平安科技(深圳)有限公司 Handwriting text recognition method and device and electronic equipment
CN110378350A (en) * 2019-07-23 2019-10-25 中国工商银行股份有限公司 A kind of method, apparatus and system of Text region
CN110837838B (en) * 2019-11-06 2023-07-11 创新奇智(重庆)科技有限公司 End-to-end vehicle frame number identification system and identification method based on deep learning
CN113495533B (en) * 2020-04-01 2022-10-21 中国科学院沈阳自动化研究所 Automatic process tracing method and system for cast tube production line
CN111723798B (en) * 2020-05-27 2022-08-16 西安交通大学 Multi-instance natural scene text detection method based on relevance hierarchy residual errors
CN111832082B (en) * 2020-08-20 2023-02-24 支付宝(杭州)信息技术有限公司 Image-text integrity detection method and device
CN113469092B (en) * 2021-07-13 2023-09-08 深圳思谋信息科技有限公司 Character recognition model generation method, device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809481A (en) * 2015-05-21 2015-07-29 中南大学 Natural scene text detection method based on adaptive color clustering
US20150347861A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Object-Of-Interest Detection And Recognition With Split, Full-Resolution Image Processing Pipeline
US20170140240A1 (en) * 2015-07-27 2017-05-18 Salesforce.Com, Inc. Neural network combined image and text evaluator and classifier
US20170293638A1 (en) * 2016-04-12 2017-10-12 Microsoft Technology Licensing, Llc Multi-stage image querying
US20180024968A1 (en) * 2016-07-22 2018-01-25 Xerox Corporation System and method for domain adaptation using marginalized stacked denoising autoencoders with domain prediction regularization
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
US10007863B1 (en) * 2015-06-05 2018-06-26 Gracenote, Inc. Logo recognition in images and videos

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308451B1 (en) * 2001-09-04 2007-12-11 Stratify, Inc. Method and system for guided cluster based processing on prototypes
US7711663B2 (en) * 2006-03-27 2010-05-04 Board Of Trustees Of Michigan State University Multi-layer development network having in-place learning
US7724957B2 (en) * 2006-07-31 2010-05-25 Microsoft Corporation Two tiered text recognition
US8442319B2 (en) * 2009-07-10 2013-05-14 Palo Alto Research Center Incorporated System and method for classifying connected groups of foreground pixels in scanned document images according to the type of marking
US8649600B2 (en) * 2009-07-10 2014-02-11 Palo Alto Research Center Incorporated System and method for segmenting text lines in documents
US9245191B2 (en) * 2013-09-05 2016-01-26 Ebay, Inc. System and method for scene text recognition
US9449239B2 (en) * 2014-05-30 2016-09-20 Apple Inc. Credit card auto-fill
US20150347860A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Systems And Methods For Character Sequence Recognition With No Explicit Segmentation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150347861A1 (en) * 2014-05-30 2015-12-03 Apple Inc. Object-Of-Interest Detection And Recognition With Split, Full-Resolution Image Processing Pipeline
CN104809481A (en) * 2015-05-21 2015-07-29 中南大学 Natural scene text detection method based on adaptive color clustering
US10007863B1 (en) * 2015-06-05 2018-06-26 Gracenote, Inc. Logo recognition in images and videos
US20170140240A1 (en) * 2015-07-27 2017-05-18 Salesforce.Com, Inc. Neural network combined image and text evaluator and classifier
US20170293638A1 (en) * 2016-04-12 2017-10-12 Microsoft Technology Licensing, Llc Multi-stage image querying
US20180024968A1 (en) * 2016-07-22 2018-01-25 Xerox Corporation System and method for domain adaptation using marginalized stacked denoising autoencoders with domain prediction regularization
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Krishnan et al. "Deep feature embedding for accurate recognition and retrieval of handwritten text." Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on. IEEE, 2016. *
Petroski Such, Felipe. "Deep Learning Architectures for Novel Problems." Feb 2017 *
Poznanski et al. "Cnn-n-gram for handwriting word recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. *
Sakaguchi et al. "Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network." AAAI. Feb. 2017. *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11295083B1 (en) * 2018-09-26 2022-04-05 Amazon Technologies, Inc. Neural models for named-entity recognition
WO2020177378A1 (en) * 2019-03-06 2020-09-10 平安科技(深圳)有限公司 Text information feature extraction method and device, computer apparatus, and storage medium
CN110147788A (en) * 2019-05-27 2019-08-20 东北大学 A kind of metal plate and belt Product labelling character recognition method based on feature enhancing CRNN
CN110276279A (en) * 2019-06-06 2019-09-24 华东师范大学 A kind of arbitrary shape scene text detection method based on image segmentation
CN110502655A (en) * 2019-07-31 2019-11-26 武汉大学 A kind of image nature descriptive statement generation method being embedded in scene text information
CN111274961A (en) * 2020-01-20 2020-06-12 华南理工大学 Character recognition and information analysis method for flexible IC substrate
WO2021147817A1 (en) * 2020-01-21 2021-07-29 第四范式(北京)技术有限公司 Text positioning method and system, and text positioning model training method and system
US20220301244A1 (en) * 2021-03-22 2022-09-22 Adobe Inc. Customizing font bounding boxes for variable fonts
US11501477B2 (en) * 2021-03-22 2022-11-15 Adobe Inc. Customizing font bounding boxes for variable fonts
WO2022046486A1 (en) * 2021-08-18 2022-03-03 Innopeak Technology, Inc. Scene text recognition model with text orientation or angle detection

Also Published As

Publication number Publication date
US10163022B1 (en) 2018-12-25

Similar Documents

Publication Publication Date Title
US10163022B1 (en) Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same
Borisyuk et al. Rosetta: Large scale system for text detection and recognition in images
KR102036963B1 (en) Method and system for robust face dectection in wild environment based on cnn
JP6397986B2 (en) Image object region recognition method and apparatus
Fergus et al. Object class recognition by unsupervised scale-invariant learning
US8606022B2 (en) Information processing apparatus, method and program
Khayyat et al. Learning-based word spotting system for Arabic handwritten documents
KR20070055653A (en) Method for recognizing face and apparatus thereof
KR101175597B1 (en) Method, apparatus, and computer-readable recording medium for detecting location of face feature point using adaboost learning algorithm
Ye et al. Scene text detection via integrated discrimination of component appearance and consensus
El Kaddouhi et al. Eye detection based on the Viola-Jones method and corners points
Krishnan et al. Conditional distance based matching for one-shot gesture recognition
Alsawwaf et al. In your face: person identification through ratios and distances between facial features
Kohlakala et al. Ear-based biometric authentication through the detection of prominent contours
CN115497124A (en) Identity recognition method and device and storage medium
Booysens et al. Ear biometrics using deep learning: A survey
KR20200101521A (en) Semantic matchaing apparatus and method
EP2998928B1 (en) Apparatus and method for extracting high watermark image from continuously photographed images
Naseer et al. Meta‐feature based few‐shot Siamese learning for Urdu optical character recognition
KR20190107480A (en) Aparatus and method for face recognition
Sert et al. Recognizing facial expressions of emotion using action unit specific decision thresholds
JP2017084006A (en) Image processor and method thereof
Kobchaisawat et al. A method for multi-oriented Thai text localization in natural scene images using Convolutional Neural Network
Moumen et al. Real-time Arabic scene text detection using fully convolutional neural networks
Pramanik et al. Finding the optimum classifier: Classification of segmentable components in offline handwritten Devanagari words

Legal Events

Date Code Title Description
AS Assignment

Owner name: STRADVISION, INC., KOREA, REPUBLIC OF

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CHO, HOJIN;REEL/FRAME:042787/0385

Effective date: 20170619

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

Year of fee payment: 4