CN108009539B - Novel text recognition method based on counting focusing model - Google Patents
Novel text recognition method based on counting focusing model Download PDFInfo
- Publication number
- CN108009539B CN108009539B CN201711431988.7A CN201711431988A CN108009539B CN 108009539 B CN108009539 B CN 108009539B CN 201711431988 A CN201711431988 A CN 201711431988A CN 108009539 B CN108009539 B CN 108009539B
- Authority
- CN
- China
- Prior art keywords
- counting
- focusing
- scalars
- decoder
- level feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
Abstract
The invention relates to a novel text recognition method based on a counting focusing model, wherein the counting focusing model comprises an encoder and a decoder, and the recognition method comprises the following steps: s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map; s2, decoding characters from left to right from the high-level characteristic diagram in sequence by a decoder based on the long-short term memory network and the focusing mechanism.
Description
Technical Field
The invention belongs to the field of optical character recognition, and particularly relates to a novel text recognition method based on a counting focusing model.
Background
OCR single line text recognition is the process of text content recognition of an input image containing a single line of text. One of the mainstream models currently used on this task is the Attention/focus Model (Attention Model), whose recognition procedure is:
1) firstly, extracting a high-level feature map (feature map) of an input image by using a Convolutional Neural Network (CNN);
2) using long short-term memory network (LSTM) to carry out 'focusing' (attribute) on the high-level feature map for multiple times, and calculating focusing weights (attribute weights);
3) and carrying out weighted average on the high-level feature map by using the focusing weight, and predicting text characters needing to be output according to the obtained feature vector.
In the existing focus model, the inputs required by the module for calculating the focus weight generally include: CNN extracts the obtained feature map, the focusing weight of the previous step and the state vector of the previous step of the LSTM.
The existing focusing model does not assume the relative position of the focusing positions in sequence, so the existing model is more suitable for the Image presentation (Image capture) problem which is more general than the OCR text recognition. While graph language is a task of text description of an input image, an OCR text recognition task may be considered a specific graph language task. The sequencing of the focus positions in the pictorial tasks can be very flexible, whereas the sequencing of the focus positions in OCR text recognition is generally directional (left to right or top to bottom). Existing focus models do not explicitly model this directionality, requiring the model to learn to focus from left to right or top to bottom during the training process.
Meanwhile, the design of a module for calculating the focusing weight by the existing focusing model is too complex, and the requirement on code realization is higher.
Disclosure of Invention
The invention provides a novel text recognition method based on focusing weight, aiming at solving the technical defects that the prior art does not assume the relative position of the focusing positions successively, the training process is complicated because a model needs to learn focusing from left to right or from top to bottom in the training process, and the design of a module for calculating the focusing weight is too complicated.
In order to realize the purpose, the technical scheme is as follows:
a novel text recognition method based on a counting focus model, the counting focus model comprising an encoder and a decoder, the recognition method comprising the steps of:
s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map;
s2, decoding characters from left to right from a high-level characteristic diagram by a decoder based on a long-short term memory network and a focusing mechanism in sequence, and specifically, as shown in steps S21-S30:
s21, segmenting the high-level feature map from left to right along a transverse dimension to obtain W content vectors v _1, v _2, … and v _ W, wherein W is the width of the high-level feature map;
s22, respectively inputting the content vector sequences into a long LSTM module to obtain corresponding W state vectors s _1, s _2, … and s _ W;
s23, inputting the state vector sequence into a full connection layer, and ensuring the non-negative numerical value of the state vector sequence by using a linear rectification function to obtain W counting accumulation scalars c _1, c _2, … and c _ W;
s24, setting an initial counting scalar k _ 0;
s25, continuously superposing the accumulated scalar obtained in the step S23 on the counting scalar according to the direction from left to right to obtain W counting scalars, namely k _ W = k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;
s26, setting a maximum decoding length L representing the number of characters needing to be decoded from the high-level feature diagram by the decoder;
s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing fraction s _ w, namely: s _ W = - | k _ W-q |, W is more than or equal to 1 and is less than or equal to W;
s28, normalizing the W focusing scores by using a softmax function to obtain a focusing weight a _ W:
a_w = e (s_w) / [e (s_1) + e (s_2) + … + e (s_W)];
s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q = a _1 v _1+ a _2 v _2+ … + a _ W v _ W;
and S30, predicting the probability distribution of the q character from the o _ q by utilizing the full connection layer.
Compared with the prior art, the invention has the beneficial effects that:
1) the invention abandons the common form of the prior focusing model for the calculation flow of the focusing weight in the decoding stage, and simplifies and designs the sequential focusing characteristic in the OCR text recognition task. Accumulating counting variables by using an LSTM module, and comparing indexes with one of the counting variables to obtain focusing weight; the calculation mode is improved from the existing calculation mode.
2) The linear rectification function is used in the invention to ensure that the counting accumulation scalar is not negative, so that the counting scalar is increased progressively, thereby ensuring that the sequence of the focusing positions is from left to right at the beginning of model training, while the previous focusing model does not realize the point, and the sequence of the focusing positions has no directionality at the beginning of training.
3) The method is designed in a simplified way aiming at the problem of OCR single-line text recognition, and the requirement on code implementation is lower.
Drawings
FIG. 1 is an overall schematic diagram of the process.
Fig. 2 is a calculation flow chart of the decoder.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
The overall frame of the counting focus model is the same as that of the conventional focus model, and the counting focus model is composed of two parts: performing high-level feature extraction on an input image by an encoder (decoder) based on a Convolutional Neural Network (CNN) to obtain a high-level feature map (feature map); a decoder (decoder) based on a long short term memory network (LSTM) and a focusing Mechanism (Attention Mechanism) decodes characters from a high level feature map in sequence from left to right. As shown in particular in figure 1.
The encoder adopts a common CNN, and the process of extracting high-level features to obtain a high-level feature map is not improved compared with the prior art, the identification method provided by the invention has the main improvement point of the calculation flow of the decoder, as shown in FIG. 2, the calculation flow of the decoder is as follows:
s21, segmenting the high-level feature map from left to right along a transverse dimension to obtain W content vectors v _1, v _2, … and v _ W, wherein W is the width of the high-level feature map;
s22, respectively inputting the content vector sequences into a long LSTM module to obtain corresponding W state vectors s _1, s _2, … and s _ W;
s23, inputting the state vector sequence into a full connection layer, and ensuring the non-negative numerical value of the state vector sequence by using a linear rectification function to obtain W counting accumulation scalars c _1, c _2, … and c _ W;
s24, setting an initial counting scalar k _ 0;
s25, continuously superposing the accumulated scalar obtained in the step S23 on the counting scalar according to the direction from left to right to obtain W counting scalars, namely k _ W = k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;
s26, setting a maximum decoding length L representing the number of characters needing to be decoded from the high-level feature diagram by the decoder;
s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing fraction s _ w, namely: s _ W = - | k _ W-q |, W is more than or equal to 1 and is less than or equal to W;
s28, normalizing the W focusing scores by using a softmax function to obtain a focusing weight a _ W:
a_w = e (s_w) / [e (s_1) + e (s_2) + … + e (s_W)];
s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q = a _1 v _1+ a _2 v _2+ … + a _ W v _ W;
and S30, predicting the probability distribution of the q character from the o _ q by utilizing the full connection layer.
In fig. 2, the block marked M therein represents the alignment (match) operation in step S27. The training and using process of the model is not different from that of the prior focusing model.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (1)
1. A method of text recognition based on a counting focus model, the counting focus model comprising an encoder and a decoder, characterized by: the identification method comprises the following steps:
s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map;
s2, decoding characters from left to right from a high-level characteristic diagram by a decoder based on the long-short term memory network and the focusing mechanism in sequence, and particularly, as shown in steps S21-S30:
s21, segmenting the high-level feature map from left to right along a transverse dimension to obtain W content vectors v _1, v _2, … and v _ W, wherein W is the width of the high-level feature map;
s22, respectively inputting the content vector sequences into a long LSTM module to obtain corresponding W state vectors s _1, s _2, … and s _ W;
s23, inputting the state vector sequence into a full connection layer, and ensuring the non-negative numerical value of the state vector sequence by using a linear rectification function to obtain W counting accumulation scalars c _1, c _2, … and c _ W;
s24, setting an initial counting scalar k _ 0;
s25, continuously superposing the accumulated scalars obtained in the step S23 on the counting scalars in the left-to-right direction to obtain W counting scalars, namely k _ W ═ k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;
s26, setting a maximum decoding length L representing the number of characters needing to be decoded from the high-level feature diagram by the decoder;
s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing score _ w, namely: score _ W ═ k _ W-q |, W is not less than 1 and not more than W;
s28, normalizing the W focusing scores by using a softmax function to obtain a focusing weight a _ W:
a_w=e(score_w)/[e(s_1)+e(s_2)+…+e(s_W)];
s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q _1 v _1+ a _2 v _2+ … + a _ W v _ W;
and S30, predicting the probability distribution of the q character from the o _ q by utilizing the full connection layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711431988.7A CN108009539B (en) | 2017-12-26 | 2017-12-26 | Novel text recognition method based on counting focusing model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711431988.7A CN108009539B (en) | 2017-12-26 | 2017-12-26 | Novel text recognition method based on counting focusing model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108009539A CN108009539A (en) | 2018-05-08 |
CN108009539B true CN108009539B (en) | 2021-11-02 |
Family
ID=62061449
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711431988.7A Active CN108009539B (en) | 2017-12-26 | 2017-12-26 | Novel text recognition method based on counting focusing model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108009539B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108615036B (en) * | 2018-05-09 | 2021-10-01 | 中国科学技术大学 | Natural scene text recognition method based on convolution attention network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740909A (en) * | 2016-02-02 | 2016-07-06 | 华中科技大学 | Text recognition method under natural scene on the basis of spatial transformation |
CN105844239A (en) * | 2016-03-23 | 2016-08-10 | 北京邮电大学 | Method for detecting riot and terror videos based on CNN and LSTM |
CN106570456A (en) * | 2016-10-13 | 2017-04-19 | 华南理工大学 | Handwritten Chinese character recognition method based on full-convolution recursive network |
CN107509031A (en) * | 2017-08-31 | 2017-12-22 | 广东欧珀移动通信有限公司 | Image processing method, device, mobile terminal and computer-readable recording medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7120297B2 (en) * | 2002-04-25 | 2006-10-10 | Microsoft Corporation | Segmented layered image system |
US8036415B2 (en) * | 2007-01-03 | 2011-10-11 | International Business Machines Corporation | Method and system for nano-encoding and decoding information related to printed texts and images on paper and other surfaces |
-
2017
- 2017-12-26 CN CN201711431988.7A patent/CN108009539B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740909A (en) * | 2016-02-02 | 2016-07-06 | 华中科技大学 | Text recognition method under natural scene on the basis of spatial transformation |
CN105844239A (en) * | 2016-03-23 | 2016-08-10 | 北京邮电大学 | Method for detecting riot and terror videos based on CNN and LSTM |
CN106570456A (en) * | 2016-10-13 | 2017-04-19 | 华南理工大学 | Handwritten Chinese character recognition method based on full-convolution recursive network |
CN107509031A (en) * | 2017-08-31 | 2017-12-22 | 广东欧珀移动通信有限公司 | Image processing method, device, mobile terminal and computer-readable recording medium |
Non-Patent Citations (1)
Title |
---|
基于双通道LSTM的用户年龄识别方法;陈敬 等;《山东大学学报(理学版)》;20170731;第52卷(第7期);第91-96+110页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108009539A (en) | 2018-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543667B (en) | Text recognition method based on attention mechanism | |
CN103942550B (en) | A kind of scene text recognition methods based on sparse coding feature | |
CN111738251B (en) | Optical character recognition method and device fused with language model and electronic equipment | |
CN111916067A (en) | Training method and device of voice recognition model, electronic equipment and storage medium | |
CN109543181B (en) | Named entity model and system based on combination of active learning and deep learning | |
CN108229582A (en) | Entity recognition dual training method is named in a kind of multitask towards medical domain | |
CN109977918A (en) | A kind of target detection and localization optimization method adapted to based on unsupervised domain | |
CN109543722A (en) | A kind of emotion trend forecasting method based on sentiment analysis model | |
Ahuja et al. | Convolutional neural network based american sign language static hand gesture recognition | |
CN109214001A (en) | A kind of semantic matching system of Chinese and method | |
CN110197279B (en) | Transformation model training method, device, equipment and storage medium | |
CN103984943A (en) | Scene text identification method based on Bayesian probability frame | |
CN110135461B (en) | Hierarchical attention perception depth measurement learning-based emotion image retrieval method | |
CN109993164A (en) | A kind of natural scene character recognition method based on RCRNN neural network | |
CN110796018A (en) | Hand motion recognition method based on depth image and color image | |
CN104778230A (en) | Video data segmentation model training method, video data segmenting method, video data segmentation model training device and video data segmenting device | |
CN116205290A (en) | Knowledge distillation method and device based on intermediate feature knowledge fusion | |
Rao et al. | Selfie sign language recognition with multiple features on adaboost multilabel multiclass classifier | |
CN112818680A (en) | Corpus processing method and device, electronic equipment and computer-readable storage medium | |
CN108009539B (en) | Novel text recognition method based on counting focusing model | |
CN109190471B (en) | Attention model method for video monitoring pedestrian search based on natural language description | |
CN107346207A (en) | A kind of dynamic gesture cutting recognition methods based on HMM | |
CN112084788A (en) | Automatic marking method and system for implicit emotional tendency of image captions | |
CN110929013A (en) | Image question-answer implementation method based on bottom-up entry and positioning information fusion | |
CN112131879A (en) | Relationship extraction system, method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |