CN108009539B - Novel text recognition method based on counting focusing model - Google Patents

Novel text recognition method based on counting focusing model Download PDF

Info

Publication number
CN108009539B
CN108009539B CN201711431988.7A CN201711431988A CN108009539B CN 108009539 B CN108009539 B CN 108009539B CN 201711431988 A CN201711431988 A CN 201711431988A CN 108009539 B CN108009539 B CN 108009539B
Authority
CN
China
Prior art keywords
counting
focusing
scalars
decoder
level feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711431988.7A
Other languages
Chinese (zh)
Other versions
CN108009539A (en
Inventor
郑华滨
潘嵘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201711431988.7A priority Critical patent/CN108009539B/en
Publication of CN108009539A publication Critical patent/CN108009539A/en
Application granted granted Critical
Publication of CN108009539B publication Critical patent/CN108009539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Abstract

The invention relates to a novel text recognition method based on a counting focusing model, wherein the counting focusing model comprises an encoder and a decoder, and the recognition method comprises the following steps: s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map; s2, decoding characters from left to right from the high-level characteristic diagram in sequence by a decoder based on the long-short term memory network and the focusing mechanism.

Description

Novel text recognition method based on counting focusing model
Technical Field
The invention belongs to the field of optical character recognition, and particularly relates to a novel text recognition method based on a counting focusing model.
Background
OCR single line text recognition is the process of text content recognition of an input image containing a single line of text. One of the mainstream models currently used on this task is the Attention/focus Model (Attention Model), whose recognition procedure is:
1) firstly, extracting a high-level feature map (feature map) of an input image by using a Convolutional Neural Network (CNN);
2) using long short-term memory network (LSTM) to carry out 'focusing' (attribute) on the high-level feature map for multiple times, and calculating focusing weights (attribute weights);
3) and carrying out weighted average on the high-level feature map by using the focusing weight, and predicting text characters needing to be output according to the obtained feature vector.
In the existing focus model, the inputs required by the module for calculating the focus weight generally include: CNN extracts the obtained feature map, the focusing weight of the previous step and the state vector of the previous step of the LSTM.
The existing focusing model does not assume the relative position of the focusing positions in sequence, so the existing model is more suitable for the Image presentation (Image capture) problem which is more general than the OCR text recognition. While graph language is a task of text description of an input image, an OCR text recognition task may be considered a specific graph language task. The sequencing of the focus positions in the pictorial tasks can be very flexible, whereas the sequencing of the focus positions in OCR text recognition is generally directional (left to right or top to bottom). Existing focus models do not explicitly model this directionality, requiring the model to learn to focus from left to right or top to bottom during the training process.
Meanwhile, the design of a module for calculating the focusing weight by the existing focusing model is too complex, and the requirement on code realization is higher.
Disclosure of Invention
The invention provides a novel text recognition method based on focusing weight, aiming at solving the technical defects that the prior art does not assume the relative position of the focusing positions successively, the training process is complicated because a model needs to learn focusing from left to right or from top to bottom in the training process, and the design of a module for calculating the focusing weight is too complicated.
In order to realize the purpose, the technical scheme is as follows:
a novel text recognition method based on a counting focus model, the counting focus model comprising an encoder and a decoder, the recognition method comprising the steps of:
s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map;
s2, decoding characters from left to right from a high-level characteristic diagram by a decoder based on a long-short term memory network and a focusing mechanism in sequence, and specifically, as shown in steps S21-S30:
s21, segmenting the high-level feature map from left to right along a transverse dimension to obtain W content vectors v _1, v _2, … and v _ W, wherein W is the width of the high-level feature map;
s22, respectively inputting the content vector sequences into a long LSTM module to obtain corresponding W state vectors s _1, s _2, … and s _ W;
s23, inputting the state vector sequence into a full connection layer, and ensuring the non-negative numerical value of the state vector sequence by using a linear rectification function to obtain W counting accumulation scalars c _1, c _2, … and c _ W;
s24, setting an initial counting scalar k _ 0;
s25, continuously superposing the accumulated scalar obtained in the step S23 on the counting scalar according to the direction from left to right to obtain W counting scalars, namely k _ W = k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;
s26, setting a maximum decoding length L representing the number of characters needing to be decoded from the high-level feature diagram by the decoder;
s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing fraction s _ w, namely: s _ W = - | k _ W-q |, W is more than or equal to 1 and is less than or equal to W;
s28, normalizing the W focusing scores by using a softmax function to obtain a focusing weight a _ W:
a_w = e (s_w) / [e (s_1) + e (s_2) + … + e (s_W)];
s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q = a _1 v _1+ a _2 v _2+ … + a _ W v _ W;
and S30, predicting the probability distribution of the q character from the o _ q by utilizing the full connection layer.
Compared with the prior art, the invention has the beneficial effects that:
1) the invention abandons the common form of the prior focusing model for the calculation flow of the focusing weight in the decoding stage, and simplifies and designs the sequential focusing characteristic in the OCR text recognition task. Accumulating counting variables by using an LSTM module, and comparing indexes with one of the counting variables to obtain focusing weight; the calculation mode is improved from the existing calculation mode.
2) The linear rectification function is used in the invention to ensure that the counting accumulation scalar is not negative, so that the counting scalar is increased progressively, thereby ensuring that the sequence of the focusing positions is from left to right at the beginning of model training, while the previous focusing model does not realize the point, and the sequence of the focusing positions has no directionality at the beginning of training.
3) The method is designed in a simplified way aiming at the problem of OCR single-line text recognition, and the requirement on code implementation is lower.
Drawings
FIG. 1 is an overall schematic diagram of the process.
Fig. 2 is a calculation flow chart of the decoder.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
The overall frame of the counting focus model is the same as that of the conventional focus model, and the counting focus model is composed of two parts: performing high-level feature extraction on an input image by an encoder (decoder) based on a Convolutional Neural Network (CNN) to obtain a high-level feature map (feature map); a decoder (decoder) based on a long short term memory network (LSTM) and a focusing Mechanism (Attention Mechanism) decodes characters from a high level feature map in sequence from left to right. As shown in particular in figure 1.
The encoder adopts a common CNN, and the process of extracting high-level features to obtain a high-level feature map is not improved compared with the prior art, the identification method provided by the invention has the main improvement point of the calculation flow of the decoder, as shown in FIG. 2, the calculation flow of the decoder is as follows:
s21, segmenting the high-level feature map from left to right along a transverse dimension to obtain W content vectors v _1, v _2, … and v _ W, wherein W is the width of the high-level feature map;
s22, respectively inputting the content vector sequences into a long LSTM module to obtain corresponding W state vectors s _1, s _2, … and s _ W;
s23, inputting the state vector sequence into a full connection layer, and ensuring the non-negative numerical value of the state vector sequence by using a linear rectification function to obtain W counting accumulation scalars c _1, c _2, … and c _ W;
s24, setting an initial counting scalar k _ 0;
s25, continuously superposing the accumulated scalar obtained in the step S23 on the counting scalar according to the direction from left to right to obtain W counting scalars, namely k _ W = k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;
s26, setting a maximum decoding length L representing the number of characters needing to be decoded from the high-level feature diagram by the decoder;
s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing fraction s _ w, namely: s _ W = - | k _ W-q |, W is more than or equal to 1 and is less than or equal to W;
s28, normalizing the W focusing scores by using a softmax function to obtain a focusing weight a _ W:
a_w = e (s_w) / [e (s_1) + e (s_2) + … + e (s_W)];
s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q = a _1 v _1+ a _2 v _2+ … + a _ W v _ W;
and S30, predicting the probability distribution of the q character from the o _ q by utilizing the full connection layer.
In fig. 2, the block marked M therein represents the alignment (match) operation in step S27. The training and using process of the model is not different from that of the prior focusing model.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (1)

1. A method of text recognition based on a counting focus model, the counting focus model comprising an encoder and a decoder, characterized by: the identification method comprises the following steps:
s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map;
s2, decoding characters from left to right from a high-level characteristic diagram by a decoder based on the long-short term memory network and the focusing mechanism in sequence, and particularly, as shown in steps S21-S30:
s21, segmenting the high-level feature map from left to right along a transverse dimension to obtain W content vectors v _1, v _2, … and v _ W, wherein W is the width of the high-level feature map;
s22, respectively inputting the content vector sequences into a long LSTM module to obtain corresponding W state vectors s _1, s _2, … and s _ W;
s23, inputting the state vector sequence into a full connection layer, and ensuring the non-negative numerical value of the state vector sequence by using a linear rectification function to obtain W counting accumulation scalars c _1, c _2, … and c _ W;
s24, setting an initial counting scalar k _ 0;
s25, continuously superposing the accumulated scalars obtained in the step S23 on the counting scalars in the left-to-right direction to obtain W counting scalars, namely k _ W ═ k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;
s26, setting a maximum decoding length L representing the number of characters needing to be decoded from the high-level feature diagram by the decoder;
s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing score _ w, namely: score _ W ═ k _ W-q |, W is not less than 1 and not more than W;
s28, normalizing the W focusing scores by using a softmax function to obtain a focusing weight a _ W:
a_w=e(score_w)/[e(s_1)+e(s_2)+…+e(s_W)];
s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q _1 v _1+ a _2 v _2+ … + a _ W v _ W;
and S30, predicting the probability distribution of the q character from the o _ q by utilizing the full connection layer.
CN201711431988.7A 2017-12-26 2017-12-26 Novel text recognition method based on counting focusing model Active CN108009539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711431988.7A CN108009539B (en) 2017-12-26 2017-12-26 Novel text recognition method based on counting focusing model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711431988.7A CN108009539B (en) 2017-12-26 2017-12-26 Novel text recognition method based on counting focusing model

Publications (2)

Publication Number Publication Date
CN108009539A CN108009539A (en) 2018-05-08
CN108009539B true CN108009539B (en) 2021-11-02

Family

ID=62061449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711431988.7A Active CN108009539B (en) 2017-12-26 2017-12-26 Novel text recognition method based on counting focusing model

Country Status (1)

Country Link
CN (1) CN108009539B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615036B (en) * 2018-05-09 2021-10-01 中国科学技术大学 Natural scene text recognition method based on convolution attention network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740909A (en) * 2016-02-02 2016-07-06 华中科技大学 Text recognition method under natural scene on the basis of spatial transformation
CN105844239A (en) * 2016-03-23 2016-08-10 北京邮电大学 Method for detecting riot and terror videos based on CNN and LSTM
CN106570456A (en) * 2016-10-13 2017-04-19 华南理工大学 Handwritten Chinese character recognition method based on full-convolution recursive network
CN107509031A (en) * 2017-08-31 2017-12-22 广东欧珀移动通信有限公司 Image processing method, device, mobile terminal and computer-readable recording medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7120297B2 (en) * 2002-04-25 2006-10-10 Microsoft Corporation Segmented layered image system
US8036415B2 (en) * 2007-01-03 2011-10-11 International Business Machines Corporation Method and system for nano-encoding and decoding information related to printed texts and images on paper and other surfaces

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740909A (en) * 2016-02-02 2016-07-06 华中科技大学 Text recognition method under natural scene on the basis of spatial transformation
CN105844239A (en) * 2016-03-23 2016-08-10 北京邮电大学 Method for detecting riot and terror videos based on CNN and LSTM
CN106570456A (en) * 2016-10-13 2017-04-19 华南理工大学 Handwritten Chinese character recognition method based on full-convolution recursive network
CN107509031A (en) * 2017-08-31 2017-12-22 广东欧珀移动通信有限公司 Image processing method, device, mobile terminal and computer-readable recording medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于双通道LSTM的用户年龄识别方法;陈敬 等;《山东大学学报(理学版)》;20170731;第52卷(第7期);第91-96+110页 *

Also Published As

Publication number Publication date
CN108009539A (en) 2018-05-08

Similar Documents

Publication Publication Date Title
CN109543667B (en) Text recognition method based on attention mechanism
CN103942550B (en) A kind of scene text recognition methods based on sparse coding feature
CN111738251B (en) Optical character recognition method and device fused with language model and electronic equipment
CN111916067A (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN108229582A (en) Entity recognition dual training method is named in a kind of multitask towards medical domain
CN109977918A (en) A kind of target detection and localization optimization method adapted to based on unsupervised domain
CN109543722A (en) A kind of emotion trend forecasting method based on sentiment analysis model
Ahuja et al. Convolutional neural network based american sign language static hand gesture recognition
CN109214001A (en) A kind of semantic matching system of Chinese and method
CN110197279B (en) Transformation model training method, device, equipment and storage medium
CN103984943A (en) Scene text identification method based on Bayesian probability frame
CN110135461B (en) Hierarchical attention perception depth measurement learning-based emotion image retrieval method
CN109993164A (en) A kind of natural scene character recognition method based on RCRNN neural network
CN110796018A (en) Hand motion recognition method based on depth image and color image
CN104778230A (en) Video data segmentation model training method, video data segmenting method, video data segmentation model training device and video data segmenting device
CN116205290A (en) Knowledge distillation method and device based on intermediate feature knowledge fusion
Rao et al. Selfie sign language recognition with multiple features on adaboost multilabel multiclass classifier
CN112818680A (en) Corpus processing method and device, electronic equipment and computer-readable storage medium
CN108009539B (en) Novel text recognition method based on counting focusing model
CN109190471B (en) Attention model method for video monitoring pedestrian search based on natural language description
CN107346207A (en) A kind of dynamic gesture cutting recognition methods based on HMM
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN112131879A (en) Relationship extraction system, method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant