CN108009539B

CN108009539B - Novel text recognition method based on counting focusing model

Info

Publication number: CN108009539B
Application number: CN201711431988.7A
Authority: CN
Inventors: 郑华滨; 潘嵘
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2021-11-02
Anticipated expiration: 2037-12-26
Also published as: CN108009539A

Abstract

The invention relates to a novel text recognition method based on a counting focusing model, wherein the counting focusing model comprises an encoder and a decoder, and the recognition method comprises the following steps: s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map; s2, decoding characters from left to right from the high-level characteristic diagram in sequence by a decoder based on the long-short term memory network and the focusing mechanism.

Description

Novel text recognition method based on counting focusing model

Technical Field

The invention belongs to the field of optical character recognition, and particularly relates to a novel text recognition method based on a counting focusing model.

Background

OCR single line text recognition is the process of text content recognition of an input image containing a single line of text. One of the mainstream models currently used on this task is the Attention/focus Model (Attention Model), whose recognition procedure is:

1) firstly, extracting a high-level feature map (feature map) of an input image by using a Convolutional Neural Network (CNN);

2) using long short-term memory network (LSTM) to carry out 'focusing' (attribute) on the high-level feature map for multiple times, and calculating focusing weights (attribute weights);

3) and carrying out weighted average on the high-level feature map by using the focusing weight, and predicting text characters needing to be output according to the obtained feature vector.

In the existing focus model, the inputs required by the module for calculating the focus weight generally include: CNN extracts the obtained feature map, the focusing weight of the previous step and the state vector of the previous step of the LSTM.

The existing focusing model does not assume the relative position of the focusing positions in sequence, so the existing model is more suitable for the Image presentation (Image capture) problem which is more general than the OCR text recognition. While graph language is a task of text description of an input image, an OCR text recognition task may be considered a specific graph language task. The sequencing of the focus positions in the pictorial tasks can be very flexible, whereas the sequencing of the focus positions in OCR text recognition is generally directional (left to right or top to bottom). Existing focus models do not explicitly model this directionality, requiring the model to learn to focus from left to right or top to bottom during the training process.

Meanwhile, the design of a module for calculating the focusing weight by the existing focusing model is too complex, and the requirement on code realization is higher.

Disclosure of Invention

The invention provides a novel text recognition method based on focusing weight, aiming at solving the technical defects that the prior art does not assume the relative position of the focusing positions successively, the training process is complicated because a model needs to learn focusing from left to right or from top to bottom in the training process, and the design of a module for calculating the focusing weight is too complicated.

In order to realize the purpose, the technical scheme is as follows:

a novel text recognition method based on a counting focus model, the counting focus model comprising an encoder and a decoder, the recognition method comprising the steps of:

s1, extracting high-level features of an input image by adopting a convolutional neural network-based encoder to obtain a high-level feature map;

s2, decoding characters from left to right from a high-level characteristic diagram by a decoder based on a long-short term memory network and a focusing mechanism in sequence, and specifically, as shown in steps S21-S30:

s21, segmenting the high-level feature map from left to right along a transverse dimension to obtain W content vectors v _1, v _2, … and v _ W, wherein W is the width of the high-level feature map;

s22, respectively inputting the content vector sequences into a long LSTM module to obtain corresponding W state vectors s _1, s _2, … and s _ W;

s23, inputting the state vector sequence into a full connection layer, and ensuring the non-negative numerical value of the state vector sequence by using a linear rectification function to obtain W counting accumulation scalars c _1, c _2, … and c _ W;

s24, setting an initial counting scalar k _ 0;

s25, continuously superposing the accumulated scalar obtained in the step S23 on the counting scalar according to the direction from left to right to obtain W counting scalars, namely k _ W = k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;

s26, setting a maximum decoding length L representing the number of characters needing to be decoded from the high-level feature diagram by the decoder;

s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing fraction s _ w, namely: s _ W = - | k _ W-q |, W is more than or equal to 1 and is less than or equal to W;

s28, normalizing the W focusing scores by using a softmax function to obtain a focusing weight a _ W:

a_w = e^(s_w) / [e ^(s_1)+ e ^(s_2)+ … + e ^(s_W)]；

s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q = a _1 v _1+ a _2 v _2+ … + a _ W v _ W;

and S30, predicting the probability distribution of the q character from the o _ q by utilizing the full connection layer.

Compared with the prior art, the invention has the beneficial effects that:

1) the invention abandons the common form of the prior focusing model for the calculation flow of the focusing weight in the decoding stage, and simplifies and designs the sequential focusing characteristic in the OCR text recognition task. Accumulating counting variables by using an LSTM module, and comparing indexes with one of the counting variables to obtain focusing weight; the calculation mode is improved from the existing calculation mode.

2) The linear rectification function is used in the invention to ensure that the counting accumulation scalar is not negative, so that the counting scalar is increased progressively, thereby ensuring that the sequence of the focusing positions is from left to right at the beginning of model training, while the previous focusing model does not realize the point, and the sequence of the focusing positions has no directionality at the beginning of training.

3) The method is designed in a simplified way aiming at the problem of OCR single-line text recognition, and the requirement on code implementation is lower.

Drawings

FIG. 1 is an overall schematic diagram of the process.

Fig. 2 is a calculation flow chart of the decoder.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

The overall frame of the counting focus model is the same as that of the conventional focus model, and the counting focus model is composed of two parts: performing high-level feature extraction on an input image by an encoder (decoder) based on a Convolutional Neural Network (CNN) to obtain a high-level feature map (feature map); a decoder (decoder) based on a long short term memory network (LSTM) and a focusing Mechanism (Attention Mechanism) decodes characters from a high level feature map in sequence from left to right. As shown in particular in figure 1.

The encoder adopts a common CNN, and the process of extracting high-level features to obtain a high-level feature map is not improved compared with the prior art, the identification method provided by the invention has the main improvement point of the calculation flow of the decoder, as shown in FIG. 2, the calculation flow of the decoder is as follows:

s24, setting an initial counting scalar k _ 0;

a_w = e^(s_w) / [e ^(s_1)+ e ^(s_2)+ … + e ^(s_W)]；

In fig. 2, the block marked M therein represents the alignment (match) operation in step S27. The training and using process of the model is not different from that of the prior focusing model.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method of text recognition based on a counting focus model, the counting focus model comprising an encoder and a decoder, characterized by: the identification method comprises the following steps:

s2, decoding characters from left to right from a high-level characteristic diagram by a decoder based on the long-short term memory network and the focusing mechanism in sequence, and particularly, as shown in steps S21-S30:

s24, setting an initial counting scalar k _ 0;

s25, continuously superposing the accumulated scalars obtained in the step S23 on the counting scalars in the left-to-right direction to obtain W counting scalars, namely k _ W ═ k _ { W-1 } + c _ k, wherein W is more than or equal to 1 and less than or equal to W;

s27, decoding the q-th character, wherein q is less than or equal to L, respectively comparing the index q with all counting scalars, and calculating the inverse of the absolute value of the difference value of the index q and all counting scalars to obtain a focusing score _ w, namely: score _ W ═ k _ W-q |, W is not less than 1 and not more than W;

a_w＝e^(score_w)/[e^(s_1)+e^(s_2)+…+e^(s_W)]；

s29, carrying out weighted summation on the content vectors by using the focusing weight to obtain a characteristic vector o _ q corresponding to the qth character: o _ q _1 v _1+ a _2 v _2+ … + a _ W v _ W;