CN114299510A

CN114299510A - Handwritten English line recognition system

Info

Publication number: CN114299510A
Application number: CN202210217783.3A
Authority: CN
Inventors: 许信顺; 谭玉慧; 马磊; 陈义学
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-04-08

Abstract

The invention provides a handwritten English line recognition system, and belongs to the technical field of text recognition. The method comprises the following steps: the system comprises a vision module, a semantic module and a fusion module; the visual module is used for extracting the spatial characteristics of a text image of a handwritten English line by using a ResNet network, decoding by using a model based on CTC and Attention, and outputting character-level decoding and word-level decoding; the semantic module is used for using the output probability of the Attention-based character-level decoding output in the visual module as input and explicitly learning potential semantic information in an English line in a mode of correcting a prediction sequence by using a gradient truncation strategy; and the fusion module is used for combining the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism so as to generate a prediction result.

Description

Handwritten English line recognition system

Technical Field

The invention relates to the technical field of text recognition, in particular to a handwritten English line recognition system.

Background

Text recognition is a very active area of research in computer vision and pattern recognition. For some handwritten texts, a very large storage space is needed if the handwritten texts are stored in a picture form after being scanned, and the contents in the pictures are transcribed by using a text recognition technology and then are stored more conveniently; and sometimes, manual entry systems for the handwritten texts are needed, and automatic entry by using a text recognition technology saves a lot of human resources.

Text recognition methods are mainly divided into two main categories: one is based on segmentation and one is based on no segmentation. The segmentation-based recognition method first needs to locate the position of each character from the text picture, then uses a character classifier to recognize each character, and finally combines all characters to obtain the final recognition result. The method has certain limitations that the position of each character needs to be accurately positioned, namely the final recognition result depends on the quality of the segmented characters to a great extent, and each character is regarded as an independent individual by the method, so that some additional information among the characters cannot be utilized. The whole text image is regarded as a whole based on a segmentation-free recognition method, and the method aims to learn a mapping relation from the text image to a target character sequence, so that the segmentation of characters can be avoided. In this type of method, the decoding method can be subdivided into a CTC-based identification method and an Attention-based identification method. Searching all possible alignment modes in the prediction process by using a CTC-based identification method, and training without aligning a text image and an output sequence in advance; the identification method based on the Attention can selectively focus on the relevant part of the feature code during decoding, the alignment between the text image and the output sequence is learned through the historical output of the target character and the feature code, and the decoding mode is more flexible.

At present, the existing text recognition method is mainly aimed at natural scenes, and the recognition of English lines cannot fully utilize information with different granularities in the lines, so that the waste of information in the text is caused. Meanwhile, different from English words, abundant semantic information exists in English lines. The existing method only utilizes visual information of a text image, and the recognition effect is poor when handwritten English lines are recognized.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a handwritten English line recognition system which can fuse semantic information with visual information after display modeling and then make final prediction, thereby effectively improving the recognition effect of handwritten English lines.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a handwritten english line recognition system comprising: the system comprises a vision module, a semantic module and a fusion module;

the visual module is used for extracting the spatial characteristics of a text image of a handwritten English line by using a ResNet network, decoding by using a model based on CTC and Attention, and outputting character-level decoding and word-level decoding as visual information;

the semantic module is used for using the output probability of the Attention-based character-level decoding output in the visual module as input and explicitly learning potential semantic information in an English line in a mode of correcting a prediction sequence by using a gradient truncation strategy;

and the fusion module is used for combining the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism, and predicting by using a preset formula to generate a prediction result.

Further, the vision module includes: the device comprises a preprocessing unit, an image characteristic encoding unit and a decoding unit;

the preprocessing unit is used for preprocessing the text image and the label of the text image;

the image feature coding unit is used for updating an underlying network by increasing shortcut by using a ResNet network, adding a channel attention module in the ResNet network, performing Squeeze operation and Excitation operation to extract global feature representation of an image, and finally extracting time sequence features in the text image by using a two-layer bidirectional LSTM network;

the decoding unit is used for decoding the time sequence characteristics by using the CTC-based model and the Attention-based model so as to obtain corresponding characters and words.

Further, the semantic module comprises an encoder based on a bidirectional LSTM network and a decoder based on the LSTM network; and using an output probability vector of character-level decoding based on Attention as an input, and utilizing a strategy of truncating the gradient stream to model potential semantic relations in handwritten English lines in the correction process of the predicted text.

Further, the fusion module is specifically configured to:

automatically learning the alignment between the visual information and the semantic information by using a door mechanism;

the adopted preset formula is specifically as follows:

wherein f is_v、f_sRespectively representing visual features and semantic features, wherein F is a feature after fusion; and finally, obtaining a final prediction result through full connection and a softmax method.

Further, the preprocessing unit is specifically configured to:

respectively setting the width and the height of the text image as a width preset value and a height preset value, and carrying out normalization processing on the text image;

converting the text image into a gray graph form, so that each pixel point only has one component;

for labels of text images, dividing the labels into character grades and word grades according to different granularities in English lines, and simultaneously constructing a character dictionary containing all upper and lower case letters, numbers and all punctuations and a word dictionary containing all words in a data set;

and mapping the labels of the image according to the character dictionary and the word dictionary to obtain two kinds of labels and using the labels as the supervision information of the model.

Further, the label of the text image is a fixed length, and when the label length does not reach the fixed length, the End symbol is used for filling.

Further, the Squeeze operation includes: extracting global feature representation of the text image, and obtaining global features of the feature map at a channel level by using global average pooling; for a H × W × C feature map F, wherein H, W, C represents the height, width and number of channels of the feature map, respectively, after global flattening pooling of the H × W feature map at the channel level, a 1 × C feature map is obtained; the formula used is as follows:

wherein f is_cFeatures, S, representing the size H x W of the feature pattern F on the c-th channel_cShows the 1 x 1 receptive field on the c-th channel after global mean pooling.

Further, the specification operation includes:

predicting the importance degree of each channel by using a full connection layer, and obtaining the correlation among the channels; the formula used is as follows:

wherein, σ and δ respectively represent ReLU and Sigmoid activation functions, W₁、W₂Representing a full connection layer, wherein r represents a dimensionality reduction coefficient and is a hyper-parameter; the final dimension of E is 1 × C, which represents the weight values of C channels, and different weight values represent the importance degrees of the corresponding channels;

finally, after obtaining the weight value of each channel, the final result can be obtained by weighting the channels, and the formula adopted is as follows:

wherein f is_cCharacteristic of the c-th channel, S_cThe weight corresponding to the c-th channel is expressed, and the weight is multiplied by the weight to obtain a feature map F with channel weight.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the method, the channel attention module is added in the feature extraction process, so that the importance of each channel of the feature map can be learned, and higher weight can be given to the channel with a key effect on identification;

2. the invention combines different decoding modes in the characteristic decoding process, can give full play to the advantages of each decoding mode, and ensures that the model can be optimized towards a more correct direction;

3. the invention fully utilizes the information with different granularities in English lines in the characteristic decoding process, including character level and word level. The character-level decoding is aligned in an auxiliary mode by the word-level decoding in a weak supervision mode, so that the character-level decoding can pay more attention to detailed information;

4. the invention not only uses the visual information of the text image, but also can explicitly model the potential semantic information existing in the English line, and has strong interpretability.

5. The invention uses a door mechanism to fuse the visual characteristics and the semantic characteristics of different modes, thereby obtaining more accurate recognition results.

Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a system block diagram of an embodiment of the present invention.

FIG. 2 is a schematic diagram of a channel attention module in accordance with an embodiment of the present invention.

Fig. 3 is a schematic diagram of a bi-directional LSTM network in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings.

A handwritten english line recognition system as shown in fig. 1 includes: the system comprises a vision module, a semantic module and a fusion module.

The visual module is used for extracting the spatial characteristics of a text image of a handwritten English line by using a ResNet network, decoding by using a model based on CTC and Attention, and outputting character-level decoding and word-level decoding as visual information.

Specifically, the vision module includes: the device comprises a preprocessing unit, an image characteristic coding unit and a decoding unit.

And the preprocessing unit is used for preprocessing the text image and the label of the text image. Because the length of the English line is different, the length and the width of each image are different, in order to realize the batch processing of the text images, the preprocessing unit firstly unifies the length and the width of all the text images: and respectively setting the width and the height of the text image as a width preset value and a height preset value, and carrying out normalization processing on the text image. Meanwhile, the handwritten text image is a three-channel color image, the color of each pixel in the color image is determined by R, G, B three components, and each component has 255 values, so that more than 160 tens of thousands of color change ranges can be obtained for one pixel point. In view of the particularity of the handwritten text image, the values of each pixel point of the scanned image on the three components are equal, so that the handwritten text image is converted into a gray-scale image form, each pixel point only has one component, the subsequent image calculation amount can be reduced, and the overall effect cannot be influenced.

The preprocessing unit divides labels of the text image into character grades and word grades according to different granularities in an English line, and simultaneously constructs a character dictionary containing all upper and lower case letters, numbers and all punctuations and a word dictionary containing all words in a data set. The labels of the image are then mapped according to a character dictionary and a word dictionary, from which two labels can be derived and used as supervisory information for the model. Similar to image processing, to implement batch processing, the tags are unified to a fixed length, and the insufficient length is filled with a special 'End' symbol.

And the image feature coding unit is used for updating an underlying network by increasing shortcut by using a ResNet network, adding a channel attention module in the ResNet network, performing Squeeze operation and Excitation operation to extract global feature representation of the image, and finally extracting time sequence features in the text image by using a two-layer bidirectional LSTM network.

Generally, a convolutional neural network can acquire detail information of an image at a shallow level and can acquire high-level abstract features at a deep level. In order to fully extract various information in the image, a network with a deeper level needs to be designed. The common convolutional neural network has the problem of gradient diffusion along with the deepening of the layer number, so that the gradient is reduced by multiple orders in the back propagation process, the gradient transmitted to a shallow part is very small, and the network weight of the shallow part cannot be trained well. In order to solve the problem, ResNet is used as a backbone network in the design of the image feature coding unit, and a shortcut can be established for gradient propagation between layers by increasing shortcut, so that the gradient can be directly transmitted to the shallow layer, and the network of the shallow layer can be well updated.

In order to improve the accuracy of information extraction, the image feature encoding unit adds a channel attention module (SE) to the ResNet network, and the structure of the SE is shown in fig. 2. The ordinary convolution operation carries out summation operation on convolution results of all channels, so that the characteristic relation of the channel level and the spatial relation learned by the convolution kernel are mixed together, and the information of the channel level cannot be utilized independently. To this end, the relationship between the channels and the different channel weights may be learned using the SE module. Specifically, the SE module is divided into the operations of Squeeze and Excitation:

squeeze operation: this operation is used to extract a Global feature representation of the image, which makes the feature map get Global features at the channel level by using Global Average Pooling (Global Average potential). For a H × W × C-sized feature F, where H, W, C represents the height, width and number of channels of the feature, respectively, after global flattening pooling of the H × W feature at the channel level, a 1 × C feature is obtained, and the range of the receptive field becomes wider, i.e., a 1 × 1 global receptive field is obtained on each channel. The formula is as follows:

An Excitation operation: after the global feature representation is obtained by the above Squeeze operation, the relationship between channels is obtained by using the Excitation operation. The importance degree of each channel is predicted by using the full-link layer, and the correlation between the channels is obtained. The formula is as follows:

wherein, σ and δ respectively represent ReLU and Sigmoid activation functions, W₁、W₂Representing a full connection layer, wherein r represents a dimensionality reduction coefficient and is a hyper-parameter; the final E dimension is 1 × C, representing the weight values of C channels, with different weight values representing the importance of the respective channels.

Finally, after the weight value of each channel is obtained, the channel is weighted, and then the final result can be obtained. The formula is as follows:

wherein f is_cCharacteristic of the c-th channel, S_cThe weight corresponding to the c-th channel is expressed, and the weight is multiplied by the weight to obtain a feature map F with channel weight. This allows the model to focus more on channel features that are large in information and that are relevant to the recognition of key actions, while suppressing those channel features that are not important. The sensitivity of the model to the channel characteristics is improved by embedding the SE module in the ResNet network, and the module is very light in weight, and can bring about the improvement of the performance only by little calculation.

Since important time sequence information such as the arrangement sequence of characters and words exists in english, the time sequence information is very helpful for the subsequent recognition process, and the convolutional neural network has limited expression capability on the time sequence information, so that the time sequence information used for extracting texts by the recurrent neural network is added after the convolutional neural network. Meanwhile, similarly to the english finalization filling, it is necessary to combine not only the content in front of the blank but also the content behind, and therefore the image feature encoding unit uses a bidirectional Long-Short Term Memory network (LSTM).

The structure of the bidirectional LSTM is shown in FIG. 3, { f₀,f₁,…,f_nThe feature vector extracted by the convolutional neural network is { h }₀,h₁,…,h_nHidden state at each time in forward process of bidirectional LSTM from front to back, { h₀ ^’,h₁ ^’,…,h_n ^’The hidden state of each time in the backward and forward reversal process of the bidirectional LSTM, m₀,m₁,…,m_nThe feature vector with forward and backward time sequence information after bidirectional LSTM coding is adopted. In order to obtain more refined features, the image feature coding unit uses a two-layer bidirectional LSTM structure. The specific formula is as follows:

after finding the final feature vector m_iWhen, combine the hidden states h in two directions_iAnd h_i ^’Therefore, the model can use more sufficient information in prediction, and a more accurate prediction result is obtained.

A decoding unit for decoding the time-series characteristics using the CTC-based and Attention-based models to obtain corresponding characters and words.

The decoding unit is specifically configured to decode the obtained timing characteristic. There are generally two ways in decoding, one is a decoding method based on Connection Timing Classification (CTC), and the other is a decoding method based on Attention mechanism (Attention).

The CTC-based decoding method finds all possible alignment ways, whose output dimensions are the number of classes of all characters plus one 'blank' character, and maps the path pi to the final sequence l using a function B by removing duplicate characters and removing 'blank'. E.g. path pi₁‘--stta-t--e’、π₂The 'sst-aa-t-e' ('-' stands for 'blank') is finally mapped to the sequence l 'state'. All CTC needs to do is find all paths pi that are followed by sequence i after B transformation. The specific formula is as follows:

wherein

Indicating the presence of a character pi at time step t_tThe probability of (n) is multiplied at each time step to finally obtain the probability p (pi | x) of the whole path pi.

Which represents the path pi followed by the sequence l after the B transformation, the total probability p (l | x) of the sequence l is finally obtained by summing all such paths. Then through minimizationThe negative log-likelihood of the conditional likelihood probability of the group truth tag sequence 1 allows the model to be optimized.

The decoding method based on the Attention is usually embedded in the LSTM, and the common LSTM decoding mode uses the same time sequence characteristics with fixed length at each decoding time step, so that deeper information cannot be mined. After the Attention is added, the relevant part of the time sequence characteristic can be selectively concerned at each decoding time step, so that the relevant information which is beneficial to improving the identification accuracy rate can be mined, and the model can be more rapidly converged. The specific formula is as follows:

wherein S_t-1、h_jRespectively representing the eigenvectors during decoding and during encoding, e_t,jCalculating the alignment score, alpha, of both_t,jIs an attention weight indicating how much attention should be paid to the jth coding feature at the tth decoding step; attention is then weighted by a_t,jAnd the coding feature vector h_jMultiplying to obtain a context vector c_t(ii) a Then the final prediction result y is obtained_t。

At present, a great deal of English recognition work uses an Attention-based method, and a good effect is achieved because the method can obtain some extra context information, but the Attention is more flexible, so that the problem that the Attention Drift (Attention Drift) can not be aligned sometimes occurs; meanwhile, the decoding process of the CTC-based method is simple, and although the left-to-right constraint can be established, only a single character is focused in the decoding process and context information cannot be modeled. Therefore, the decoding unit combines two decoding modes based on CTC and Attention, so that the advantages of respective decoding can be fully exerted, and more accurate identification results can be obtained. The decoding unit also adds word level decoding, i.e. using two Attention-based decoders, one for decoding characters and the other for decoding words. Since collecting and labeling handwritten english data in real scenes is itself a rather time-consuming task, and in this way it is possible to maximize the use of information of different granularity in the english lines. The word-level decoding can realize auxiliary alignment of character-level decoding in a weak supervision mode, for example, the result of word-level decoding prediction is 'I have an apre' (the group term is 'I have an applet'), at this time, the model can judge the word 'apre' by mistake instead of only judging the character 'l', at this time, a more obvious error signal can be given to the character-level decoder, so that the character-level decoder can pay more attention to detail information in the recognition process in the next decoding. Therefore, the branch of the word-level decoder can be added to play a good auxiliary role, so that the branch part of the character-level decoder is trained more effectively, and more accurate prediction can be made.

And the semantic module is used for using the output probability of the Attention-based character-level decoding output in the visual module as input and explicitly learning potential semantic information in the English line in a mode of correcting a prediction sequence by using a gradient truncation strategy.

The long texts such as English lines contain rich semantic information, such as grammar of sentences, characters, front-back position relation of words and the like, and similarly, when people read and understand English, if an unfamiliar word is met, people need to judge the meaning of the word by combining the ideas of the whole sentence, so that the semantic information plays an essential role in correctly predicting the sequence. Although the decoding method based on Attention can simply capture some context-related information, the information is coupled in the model, and whether the information is learned or not and the degree of learning are unknown, and the interpretability is not strong. Therefore, the invention is provided with a semantic module with an explicit semantic modeling function, the module consists of an encoder based on the bidirectional LSTM and a decoder based on the LSTM, the output probability vector of character-level decoding based on the Attention in the visual module is used as input, and a strategy of truncating the gradient stream is used, so that the module is forced to model the potential semantic relationship in English lines in the correction process of the predicted text. Because the semantic module is explicitly modeled, the content that the semantic module can learn is knowable and has strong interpretability.

Because the visual features and the semantic features extracted by the model are very important for making accurate prediction, the fusion module fuses the two kinds of information, so that the two kinds of information can be used simultaneously during prediction. To align the two types of information, a Gated Mechanism (Mechanism) is used, by which the model can automatically learn the alignment between the visual and semantic features. The specific formula is as follows:

Additionally, as an example, the present system may also configure the model optimization module to implement the model optimization functionality. The method is specifically used for: performing end-to-end training, wherein the objective function is as follows:

the objective function is composed of five parts, where L_tRepresenting the loss, L, obtained using a CTC-based decoder_c、L_w、L_s、L_vsRepresenting the decoding loss after using the Attention-based character-level decoding, word-level decoding, semantic decoding, and feature fusion, respectively. Lambda [ alpha ]_t、λ_c、λ_w、λ_s、λ_vsAre respectively the corresponding rightsAnd (4) heavy.

It can be seen that the present system consists of three main parts: the system comprises a vision module, a semantic module and a fusion module. The visual module part firstly extracts the spatial features of the text image by using ResNet, and a channel attention SE module is added in a ResNet block; the advantages of two decoding modes of CTC and Attention are combined during decoding, information with different granularities in English lines is fully utilized during decoding based on Attention, and character-level decoding and word-level decoding are output. The semantic module uses the output probability of the character-level decoding based on the Attention in the visual module as input, and the module can learn the latent semantic information in the English line in an explicit mode by correcting a prediction sequence by using a gradient truncation strategy, so that the semantic module has strong interpretability. The fusion module combines the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism, and can use information of different modes during prediction, thereby being beneficial to obtaining more accurate prediction results.

In the embodiments provided by the present invention, it should be understood that the disclosed system, system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit.

Similarly, each processing unit in the embodiments of the present invention may be integrated into one functional module, or each processing unit may exist physically, or two or more processing units are integrated into one functional module.

The invention is further described with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.

Claims

1. A handwritten english line recognition system, comprising: the system comprises a vision module, a semantic module and a fusion module;

2. The handwritten english line recognition system according to claim 1, wherein said visual module comprises: the device comprises a preprocessing unit, an image characteristic encoding unit and a decoding unit;

3. The handwritten English line recognition system according to claim 1, wherein the semantic module comprises an encoder based on a bidirectional LSTM network and a decoder based on an LSTM network; and using an output probability vector of character-level decoding based on Attention as an input, and utilizing a strategy of truncating the gradient stream to model potential semantic relations in handwritten English lines in the correction process of the predicted text.

4. The handwritten English line recognition system according to claim 1, wherein the fusion module is specifically configured to:

the adopted preset formula is specifically as follows:

5. The handwritten english line recognition system of claim 2, wherein the preprocessing unit is specifically configured to:

6. The handwritten english line recognition system according to claim 5, wherein the label of said text image is of a fixed length, and when the label length does not reach the fixed length, the label is filled with an End symbol.

7. The handwritten english line recognition system according to claim 2, wherein said Squeeze operation comprises: extracting global feature representation of the text image, and obtaining global features of the feature map at a channel level by using global average pooling; for a H × W × C feature map F, wherein H, W, C represents the height, width and number of channels of the feature map, respectively, after global flattening pooling of the H × W feature map at the channel level, a 1 × C feature map is obtained; the formula used is as follows:

8. The handwritten English line recognition system according to claim 7, wherein said specification operation comprises: