CN114299510A - Handwritten English line recognition system - Google Patents

Handwritten English line recognition system Download PDF

Info

Publication number
CN114299510A
CN114299510A CN202210217783.3A CN202210217783A CN114299510A CN 114299510 A CN114299510 A CN 114299510A CN 202210217783 A CN202210217783 A CN 202210217783A CN 114299510 A CN114299510 A CN 114299510A
Authority
CN
China
Prior art keywords
module
semantic
decoding
channel
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210217783.3A
Other languages
Chinese (zh)
Inventor
许信顺
谭玉慧
马磊
陈义学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Original Assignee
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG SHANDA OUMA SOFTWARE CO Ltd filed Critical SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority to CN202210217783.3A priority Critical patent/CN114299510A/en
Publication of CN114299510A publication Critical patent/CN114299510A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a handwritten English line recognition system, and belongs to the technical field of text recognition. The method comprises the following steps: the system comprises a vision module, a semantic module and a fusion module; the visual module is used for extracting the spatial characteristics of a text image of a handwritten English line by using a ResNet network, decoding by using a model based on CTC and Attention, and outputting character-level decoding and word-level decoding; the semantic module is used for using the output probability of the Attention-based character-level decoding output in the visual module as input and explicitly learning potential semantic information in an English line in a mode of correcting a prediction sequence by using a gradient truncation strategy; and the fusion module is used for combining the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism so as to generate a prediction result.

Description

Handwritten English line recognition system
Technical Field
The invention relates to the technical field of text recognition, in particular to a handwritten English line recognition system.
Background
Text recognition is a very active area of research in computer vision and pattern recognition. For some handwritten texts, a very large storage space is needed if the handwritten texts are stored in a picture form after being scanned, and the contents in the pictures are transcribed by using a text recognition technology and then are stored more conveniently; and sometimes, manual entry systems for the handwritten texts are needed, and automatic entry by using a text recognition technology saves a lot of human resources.
Text recognition methods are mainly divided into two main categories: one is based on segmentation and one is based on no segmentation. The segmentation-based recognition method first needs to locate the position of each character from the text picture, then uses a character classifier to recognize each character, and finally combines all characters to obtain the final recognition result. The method has certain limitations that the position of each character needs to be accurately positioned, namely the final recognition result depends on the quality of the segmented characters to a great extent, and each character is regarded as an independent individual by the method, so that some additional information among the characters cannot be utilized. The whole text image is regarded as a whole based on a segmentation-free recognition method, and the method aims to learn a mapping relation from the text image to a target character sequence, so that the segmentation of characters can be avoided. In this type of method, the decoding method can be subdivided into a CTC-based identification method and an Attention-based identification method. Searching all possible alignment modes in the prediction process by using a CTC-based identification method, and training without aligning a text image and an output sequence in advance; the identification method based on the Attention can selectively focus on the relevant part of the feature code during decoding, the alignment between the text image and the output sequence is learned through the historical output of the target character and the feature code, and the decoding mode is more flexible.
At present, the existing text recognition method is mainly aimed at natural scenes, and the recognition of English lines cannot fully utilize information with different granularities in the lines, so that the waste of information in the text is caused. Meanwhile, different from English words, abundant semantic information exists in English lines. The existing method only utilizes visual information of a text image, and the recognition effect is poor when handwritten English lines are recognized.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a handwritten English line recognition system which can fuse semantic information with visual information after display modeling and then make final prediction, thereby effectively improving the recognition effect of handwritten English lines.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a handwritten english line recognition system comprising: the system comprises a vision module, a semantic module and a fusion module;
the visual module is used for extracting the spatial characteristics of a text image of a handwritten English line by using a ResNet network, decoding by using a model based on CTC and Attention, and outputting character-level decoding and word-level decoding as visual information;
the semantic module is used for using the output probability of the Attention-based character-level decoding output in the visual module as input and explicitly learning potential semantic information in an English line in a mode of correcting a prediction sequence by using a gradient truncation strategy;
and the fusion module is used for combining the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism, and predicting by using a preset formula to generate a prediction result.
Further, the vision module includes: the device comprises a preprocessing unit, an image characteristic encoding unit and a decoding unit;
the preprocessing unit is used for preprocessing the text image and the label of the text image;
the image feature coding unit is used for updating an underlying network by increasing shortcut by using a ResNet network, adding a channel attention module in the ResNet network, performing Squeeze operation and Excitation operation to extract global feature representation of an image, and finally extracting time sequence features in the text image by using a two-layer bidirectional LSTM network;
the decoding unit is used for decoding the time sequence characteristics by using the CTC-based model and the Attention-based model so as to obtain corresponding characters and words.
Further, the semantic module comprises an encoder based on a bidirectional LSTM network and a decoder based on the LSTM network; and using an output probability vector of character-level decoding based on Attention as an input, and utilizing a strategy of truncating the gradient stream to model potential semantic relations in handwritten English lines in the correction process of the predicted text.
Further, the fusion module is specifically configured to:
automatically learning the alignment between the visual information and the semantic information by using a door mechanism;
the adopted preset formula is specifically as follows:
Figure DEST_PATH_IMAGE001
wherein f isv、fsRespectively representing visual features and semantic features, wherein F is a feature after fusion; and finally, obtaining a final prediction result through full connection and a softmax method.
Further, the preprocessing unit is specifically configured to:
respectively setting the width and the height of the text image as a width preset value and a height preset value, and carrying out normalization processing on the text image;
converting the text image into a gray graph form, so that each pixel point only has one component;
for labels of text images, dividing the labels into character grades and word grades according to different granularities in English lines, and simultaneously constructing a character dictionary containing all upper and lower case letters, numbers and all punctuations and a word dictionary containing all words in a data set;
and mapping the labels of the image according to the character dictionary and the word dictionary to obtain two kinds of labels and using the labels as the supervision information of the model.
Further, the label of the text image is a fixed length, and when the label length does not reach the fixed length, the End symbol is used for filling.
Further, the Squeeze operation includes: extracting global feature representation of the text image, and obtaining global features of the feature map at a channel level by using global average pooling; for a H × W × C feature map F, wherein H, W, C represents the height, width and number of channels of the feature map, respectively, after global flattening pooling of the H × W feature map at the channel level, a 1 × C feature map is obtained; the formula used is as follows:
Figure 466515DEST_PATH_IMAGE002
wherein f iscFeatures, S, representing the size H x W of the feature pattern F on the c-th channelcShows the 1 x 1 receptive field on the c-th channel after global mean pooling.
Further, the specification operation includes:
predicting the importance degree of each channel by using a full connection layer, and obtaining the correlation among the channels; the formula used is as follows:
Figure DEST_PATH_IMAGE003
wherein, σ and δ respectively represent ReLU and Sigmoid activation functions, W1、W2Representing a full connection layer, wherein r represents a dimensionality reduction coefficient and is a hyper-parameter; the final dimension of E is 1 × C, which represents the weight values of C channels, and different weight values represent the importance degrees of the corresponding channels;
finally, after obtaining the weight value of each channel, the final result can be obtained by weighting the channels, and the formula adopted is as follows:
Figure 17582DEST_PATH_IMAGE004
wherein f iscCharacteristic of the c-th channel, ScThe weight corresponding to the c-th channel is expressed, and the weight is multiplied by the weight to obtain a feature map F with channel weight.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the method, the channel attention module is added in the feature extraction process, so that the importance of each channel of the feature map can be learned, and higher weight can be given to the channel with a key effect on identification;
2. the invention combines different decoding modes in the characteristic decoding process, can give full play to the advantages of each decoding mode, and ensures that the model can be optimized towards a more correct direction;
3. the invention fully utilizes the information with different granularities in English lines in the characteristic decoding process, including character level and word level. The character-level decoding is aligned in an auxiliary mode by the word-level decoding in a weak supervision mode, so that the character-level decoding can pay more attention to detailed information;
4. the invention not only uses the visual information of the text image, but also can explicitly model the potential semantic information existing in the English line, and has strong interpretability.
5. The invention uses a door mechanism to fuse the visual characteristics and the semantic characteristics of different modes, thereby obtaining more accurate recognition results.
Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a system block diagram of an embodiment of the present invention.
FIG. 2 is a schematic diagram of a channel attention module in accordance with an embodiment of the present invention.
Fig. 3 is a schematic diagram of a bi-directional LSTM network in accordance with an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings.
A handwritten english line recognition system as shown in fig. 1 includes: the system comprises a vision module, a semantic module and a fusion module.
The visual module is used for extracting the spatial characteristics of a text image of a handwritten English line by using a ResNet network, decoding by using a model based on CTC and Attention, and outputting character-level decoding and word-level decoding as visual information.
Specifically, the vision module includes: the device comprises a preprocessing unit, an image characteristic coding unit and a decoding unit.
And the preprocessing unit is used for preprocessing the text image and the label of the text image. Because the length of the English line is different, the length and the width of each image are different, in order to realize the batch processing of the text images, the preprocessing unit firstly unifies the length and the width of all the text images: and respectively setting the width and the height of the text image as a width preset value and a height preset value, and carrying out normalization processing on the text image. Meanwhile, the handwritten text image is a three-channel color image, the color of each pixel in the color image is determined by R, G, B three components, and each component has 255 values, so that more than 160 tens of thousands of color change ranges can be obtained for one pixel point. In view of the particularity of the handwritten text image, the values of each pixel point of the scanned image on the three components are equal, so that the handwritten text image is converted into a gray-scale image form, each pixel point only has one component, the subsequent image calculation amount can be reduced, and the overall effect cannot be influenced.
The preprocessing unit divides labels of the text image into character grades and word grades according to different granularities in an English line, and simultaneously constructs a character dictionary containing all upper and lower case letters, numbers and all punctuations and a word dictionary containing all words in a data set. The labels of the image are then mapped according to a character dictionary and a word dictionary, from which two labels can be derived and used as supervisory information for the model. Similar to image processing, to implement batch processing, the tags are unified to a fixed length, and the insufficient length is filled with a special 'End' symbol.
And the image feature coding unit is used for updating an underlying network by increasing shortcut by using a ResNet network, adding a channel attention module in the ResNet network, performing Squeeze operation and Excitation operation to extract global feature representation of the image, and finally extracting time sequence features in the text image by using a two-layer bidirectional LSTM network.
Generally, a convolutional neural network can acquire detail information of an image at a shallow level and can acquire high-level abstract features at a deep level. In order to fully extract various information in the image, a network with a deeper level needs to be designed. The common convolutional neural network has the problem of gradient diffusion along with the deepening of the layer number, so that the gradient is reduced by multiple orders in the back propagation process, the gradient transmitted to a shallow part is very small, and the network weight of the shallow part cannot be trained well. In order to solve the problem, ResNet is used as a backbone network in the design of the image feature coding unit, and a shortcut can be established for gradient propagation between layers by increasing shortcut, so that the gradient can be directly transmitted to the shallow layer, and the network of the shallow layer can be well updated.
In order to improve the accuracy of information extraction, the image feature encoding unit adds a channel attention module (SE) to the ResNet network, and the structure of the SE is shown in fig. 2. The ordinary convolution operation carries out summation operation on convolution results of all channels, so that the characteristic relation of the channel level and the spatial relation learned by the convolution kernel are mixed together, and the information of the channel level cannot be utilized independently. To this end, the relationship between the channels and the different channel weights may be learned using the SE module. Specifically, the SE module is divided into the operations of Squeeze and Excitation:
squeeze operation: this operation is used to extract a Global feature representation of the image, which makes the feature map get Global features at the channel level by using Global Average Pooling (Global Average potential). For a H × W × C-sized feature F, where H, W, C represents the height, width and number of channels of the feature, respectively, after global flattening pooling of the H × W feature at the channel level, a 1 × C feature is obtained, and the range of the receptive field becomes wider, i.e., a 1 × 1 global receptive field is obtained on each channel. The formula is as follows:
Figure 161118DEST_PATH_IMAGE005
wherein f iscFeatures, S, representing the size H x W of the feature pattern F on the c-th channelcShows the 1 x 1 receptive field on the c-th channel after global mean pooling.
An Excitation operation: after the global feature representation is obtained by the above Squeeze operation, the relationship between channels is obtained by using the Excitation operation. The importance degree of each channel is predicted by using the full-link layer, and the correlation between the channels is obtained. The formula is as follows:
Figure 215662DEST_PATH_IMAGE006
wherein, σ and δ respectively represent ReLU and Sigmoid activation functions, W1、W2Representing a full connection layer, wherein r represents a dimensionality reduction coefficient and is a hyper-parameter; the final E dimension is 1 × C, representing the weight values of C channels, with different weight values representing the importance of the respective channels.
Finally, after the weight value of each channel is obtained, the channel is weighted, and then the final result can be obtained. The formula is as follows:
Figure 301298DEST_PATH_IMAGE007
wherein f iscCharacteristic of the c-th channel, ScThe weight corresponding to the c-th channel is expressed, and the weight is multiplied by the weight to obtain a feature map F with channel weight. This allows the model to focus more on channel features that are large in information and that are relevant to the recognition of key actions, while suppressing those channel features that are not important. The sensitivity of the model to the channel characteristics is improved by embedding the SE module in the ResNet network, and the module is very light in weight, and can bring about the improvement of the performance only by little calculation.
Since important time sequence information such as the arrangement sequence of characters and words exists in english, the time sequence information is very helpful for the subsequent recognition process, and the convolutional neural network has limited expression capability on the time sequence information, so that the time sequence information used for extracting texts by the recurrent neural network is added after the convolutional neural network. Meanwhile, similarly to the english finalization filling, it is necessary to combine not only the content in front of the blank but also the content behind, and therefore the image feature encoding unit uses a bidirectional Long-Short Term Memory network (LSTM).
The structure of the bidirectional LSTM is shown in FIG. 3, { f0,f1,…,fnThe feature vector extracted by the convolutional neural network is { h }0,h1,…,hnHidden state at each time in forward process of bidirectional LSTM from front to back, { h0 ,h1 ,…,hn The hidden state of each time in the backward and forward reversal process of the bidirectional LSTM, m0,m1,…,mnThe feature vector with forward and backward time sequence information after bidirectional LSTM coding is adopted. In order to obtain more refined features, the image feature coding unit uses a two-layer bidirectional LSTM structure. The specific formula is as follows:
Figure 808503DEST_PATH_IMAGE008
after finding the final feature vector miWhen, combine the hidden states h in two directionsiAnd hi Therefore, the model can use more sufficient information in prediction, and a more accurate prediction result is obtained.
A decoding unit for decoding the time-series characteristics using the CTC-based and Attention-based models to obtain corresponding characters and words.
The decoding unit is specifically configured to decode the obtained timing characteristic. There are generally two ways in decoding, one is a decoding method based on Connection Timing Classification (CTC), and the other is a decoding method based on Attention mechanism (Attention).
The CTC-based decoding method finds all possible alignment ways, whose output dimensions are the number of classes of all characters plus one 'blank' character, and maps the path pi to the final sequence l using a function B by removing duplicate characters and removing 'blank'. E.g. path pi1‘--stta-t--e’、π2The 'sst-aa-t-e' ('-' stands for 'blank') is finally mapped to the sequence l 'state'. All CTC needs to do is find all paths pi that are followed by sequence i after B transformation. The specific formula is as follows:
Figure 755731DEST_PATH_IMAGE009
wherein
Figure 664781DEST_PATH_IMAGE010
Indicating the presence of a character pi at time step ttThe probability of (n) is multiplied at each time step to finally obtain the probability p (pi | x) of the whole path pi.
Figure 157204DEST_PATH_IMAGE011
Which represents the path pi followed by the sequence l after the B transformation, the total probability p (l | x) of the sequence l is finally obtained by summing all such paths. Then through minimizationThe negative log-likelihood of the conditional likelihood probability of the group truth tag sequence 1 allows the model to be optimized.
The decoding method based on the Attention is usually embedded in the LSTM, and the common LSTM decoding mode uses the same time sequence characteristics with fixed length at each decoding time step, so that deeper information cannot be mined. After the Attention is added, the relevant part of the time sequence characteristic can be selectively concerned at each decoding time step, so that the relevant information which is beneficial to improving the identification accuracy rate can be mined, and the model can be more rapidly converged. The specific formula is as follows:
Figure 417284DEST_PATH_IMAGE012
wherein St-1、hjRespectively representing the eigenvectors during decoding and during encoding, et,jCalculating the alignment score, alpha, of botht,jIs an attention weight indicating how much attention should be paid to the jth coding feature at the tth decoding step; attention is then weighted by at,jAnd the coding feature vector hjMultiplying to obtain a context vector ct(ii) a Then the final prediction result y is obtainedt
At present, a great deal of English recognition work uses an Attention-based method, and a good effect is achieved because the method can obtain some extra context information, but the Attention is more flexible, so that the problem that the Attention Drift (Attention Drift) can not be aligned sometimes occurs; meanwhile, the decoding process of the CTC-based method is simple, and although the left-to-right constraint can be established, only a single character is focused in the decoding process and context information cannot be modeled. Therefore, the decoding unit combines two decoding modes based on CTC and Attention, so that the advantages of respective decoding can be fully exerted, and more accurate identification results can be obtained. The decoding unit also adds word level decoding, i.e. using two Attention-based decoders, one for decoding characters and the other for decoding words. Since collecting and labeling handwritten english data in real scenes is itself a rather time-consuming task, and in this way it is possible to maximize the use of information of different granularity in the english lines. The word-level decoding can realize auxiliary alignment of character-level decoding in a weak supervision mode, for example, the result of word-level decoding prediction is 'I have an apre' (the group term is 'I have an applet'), at this time, the model can judge the word 'apre' by mistake instead of only judging the character 'l', at this time, a more obvious error signal can be given to the character-level decoder, so that the character-level decoder can pay more attention to detail information in the recognition process in the next decoding. Therefore, the branch of the word-level decoder can be added to play a good auxiliary role, so that the branch part of the character-level decoder is trained more effectively, and more accurate prediction can be made.
And the semantic module is used for using the output probability of the Attention-based character-level decoding output in the visual module as input and explicitly learning potential semantic information in the English line in a mode of correcting a prediction sequence by using a gradient truncation strategy.
The long texts such as English lines contain rich semantic information, such as grammar of sentences, characters, front-back position relation of words and the like, and similarly, when people read and understand English, if an unfamiliar word is met, people need to judge the meaning of the word by combining the ideas of the whole sentence, so that the semantic information plays an essential role in correctly predicting the sequence. Although the decoding method based on Attention can simply capture some context-related information, the information is coupled in the model, and whether the information is learned or not and the degree of learning are unknown, and the interpretability is not strong. Therefore, the invention is provided with a semantic module with an explicit semantic modeling function, the module consists of an encoder based on the bidirectional LSTM and a decoder based on the LSTM, the output probability vector of character-level decoding based on the Attention in the visual module is used as input, and a strategy of truncating the gradient stream is used, so that the module is forced to model the potential semantic relationship in English lines in the correction process of the predicted text. Because the semantic module is explicitly modeled, the content that the semantic module can learn is knowable and has strong interpretability.
And the fusion module is used for combining the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism, and predicting by using a preset formula to generate a prediction result.
Because the visual features and the semantic features extracted by the model are very important for making accurate prediction, the fusion module fuses the two kinds of information, so that the two kinds of information can be used simultaneously during prediction. To align the two types of information, a Gated Mechanism (Mechanism) is used, by which the model can automatically learn the alignment between the visual and semantic features. The specific formula is as follows:
Figure 761678DEST_PATH_IMAGE013
wherein f isv、fsRespectively representing visual features and semantic features, wherein F is a feature after fusion; and finally, obtaining a final prediction result through full connection and a softmax method.
Additionally, as an example, the present system may also configure the model optimization module to implement the model optimization functionality. The method is specifically used for: performing end-to-end training, wherein the objective function is as follows:
Figure 666180DEST_PATH_IMAGE014
the objective function is composed of five parts, where LtRepresenting the loss, L, obtained using a CTC-based decoderc、Lw、Ls、LvsRepresenting the decoding loss after using the Attention-based character-level decoding, word-level decoding, semantic decoding, and feature fusion, respectively. Lambda [ alpha ]t、λc、λw、λs、λvsAre respectively the corresponding rightsAnd (4) heavy.
It can be seen that the present system consists of three main parts: the system comprises a vision module, a semantic module and a fusion module. The visual module part firstly extracts the spatial features of the text image by using ResNet, and a channel attention SE module is added in a ResNet block; the advantages of two decoding modes of CTC and Attention are combined during decoding, information with different granularities in English lines is fully utilized during decoding based on Attention, and character-level decoding and word-level decoding are output. The semantic module uses the output probability of the character-level decoding based on the Attention in the visual module as input, and the module can learn the latent semantic information in the English line in an explicit mode by correcting a prediction sequence by using a gradient truncation strategy, so that the semantic module has strong interpretability. The fusion module combines the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism, and can use information of different modes during prediction, thereby being beneficial to obtaining more accurate prediction results.
In the embodiments provided by the present invention, it should be understood that the disclosed system, system and method can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit.
Similarly, each processing unit in the embodiments of the present invention may be integrated into one functional module, or each processing unit may exist physically, or two or more processing units are integrated into one functional module.
The invention is further described with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.

Claims (8)

1. A handwritten english line recognition system, comprising: the system comprises a vision module, a semantic module and a fusion module;
the visual module is used for extracting the spatial characteristics of a text image of a handwritten English line by using a ResNet network, decoding by using a model based on CTC and Attention, and outputting character-level decoding and word-level decoding as visual information;
the semantic module is used for using the output probability of the Attention-based character-level decoding output in the visual module as input and explicitly learning potential semantic information in an English line in a mode of correcting a prediction sequence by using a gradient truncation strategy;
and the fusion module is used for combining the visual information extracted by the visual module and the semantic information extracted by the semantic module by using a door mechanism, and predicting by using a preset formula to generate a prediction result.
2. The handwritten english line recognition system according to claim 1, wherein said visual module comprises: the device comprises a preprocessing unit, an image characteristic encoding unit and a decoding unit;
the preprocessing unit is used for preprocessing the text image and the label of the text image;
the image feature coding unit is used for updating an underlying network by increasing shortcut by using a ResNet network, adding a channel attention module in the ResNet network, performing Squeeze operation and Excitation operation to extract global feature representation of an image, and finally extracting time sequence features in the text image by using a two-layer bidirectional LSTM network;
the decoding unit is used for decoding the time sequence characteristics by using the CTC-based model and the Attention-based model so as to obtain corresponding characters and words.
3. The handwritten English line recognition system according to claim 1, wherein the semantic module comprises an encoder based on a bidirectional LSTM network and a decoder based on an LSTM network; and using an output probability vector of character-level decoding based on Attention as an input, and utilizing a strategy of truncating the gradient stream to model potential semantic relations in handwritten English lines in the correction process of the predicted text.
4. The handwritten English line recognition system according to claim 1, wherein the fusion module is specifically configured to:
automatically learning the alignment between the visual information and the semantic information by using a door mechanism;
the adopted preset formula is specifically as follows:
Figure 832510DEST_PATH_IMAGE001
wherein f isv、fsRespectively representing visual features and semantic features, wherein F is a feature after fusion; and finally, obtaining a final prediction result through full connection and a softmax method.
5. The handwritten english line recognition system of claim 2, wherein the preprocessing unit is specifically configured to:
respectively setting the width and the height of the text image as a width preset value and a height preset value, and carrying out normalization processing on the text image;
converting the text image into a gray graph form, so that each pixel point only has one component;
for labels of text images, dividing the labels into character grades and word grades according to different granularities in English lines, and simultaneously constructing a character dictionary containing all upper and lower case letters, numbers and all punctuations and a word dictionary containing all words in a data set;
and mapping the labels of the image according to the character dictionary and the word dictionary to obtain two kinds of labels and using the labels as the supervision information of the model.
6. The handwritten english line recognition system according to claim 5, wherein the label of said text image is of a fixed length, and when the label length does not reach the fixed length, the label is filled with an End symbol.
7. The handwritten english line recognition system according to claim 2, wherein said Squeeze operation comprises: extracting global feature representation of the text image, and obtaining global features of the feature map at a channel level by using global average pooling; for a H × W × C feature map F, wherein H, W, C represents the height, width and number of channels of the feature map, respectively, after global flattening pooling of the H × W feature map at the channel level, a 1 × C feature map is obtained; the formula used is as follows:
Figure 771647DEST_PATH_IMAGE002
wherein f iscFeatures, S, representing the size H x W of the feature pattern F on the c-th channelcShows the 1 x 1 receptive field on the c-th channel after global mean pooling.
8. The handwritten English line recognition system according to claim 7, wherein said specification operation comprises:
predicting the importance degree of each channel by using a full connection layer, and obtaining the correlation among the channels; the formula used is as follows:
Figure 501706DEST_PATH_IMAGE003
wherein, σ and δ respectively represent ReLU and Sigmoid activation functions, W1、W2Representing a full connection layer, wherein r represents a dimensionality reduction coefficient and is a hyper-parameter; the final dimension of E is 1 × C, which represents the weight values of C channels, and different weight values represent the importance degrees of the corresponding channels;
finally, after obtaining the weight value of each channel, the final result can be obtained by weighting the channels, and the formula adopted is as follows:
Figure 62262DEST_PATH_IMAGE004
wherein f iscCharacteristic of the c-th channel, ScThe weight corresponding to the c-th channel is expressed, and the weight is multiplied by the weight to obtain a feature map F with channel weight.
CN202210217783.3A 2022-03-08 2022-03-08 Handwritten English line recognition system Pending CN114299510A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210217783.3A CN114299510A (en) 2022-03-08 2022-03-08 Handwritten English line recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210217783.3A CN114299510A (en) 2022-03-08 2022-03-08 Handwritten English line recognition system

Publications (1)

Publication Number Publication Date
CN114299510A true CN114299510A (en) 2022-04-08

Family

ID=80978519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210217783.3A Pending CN114299510A (en) 2022-03-08 2022-03-08 Handwritten English line recognition system

Country Status (1)

Country Link
CN (1) CN114299510A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581906A (en) * 2022-05-06 2022-06-03 山东大学 Text recognition method and system for natural scene image

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN111368088A (en) * 2020-03-31 2020-07-03 成都信息工程大学 Text emotion classification method based on deep learning
CN111967272A (en) * 2020-06-23 2020-11-20 合肥工业大学 Visual dialog generation system based on semantic alignment
CN112257426A (en) * 2020-10-14 2021-01-22 北京一览群智数据科技有限责任公司 Character recognition method, system, training method, storage medium and equipment
CN112633079A (en) * 2020-12-02 2021-04-09 山东山大鸥玛软件股份有限公司 Handwritten English word recognition method and system
CN113609285A (en) * 2021-08-09 2021-11-05 福州大学 Multi-mode text summarization system based on door control fusion mechanism
CN114092930A (en) * 2022-01-07 2022-02-25 中科视语(北京)科技有限公司 Character recognition method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN111368088A (en) * 2020-03-31 2020-07-03 成都信息工程大学 Text emotion classification method based on deep learning
CN111967272A (en) * 2020-06-23 2020-11-20 合肥工业大学 Visual dialog generation system based on semantic alignment
CN112257426A (en) * 2020-10-14 2021-01-22 北京一览群智数据科技有限责任公司 Character recognition method, system, training method, storage medium and equipment
CN112633079A (en) * 2020-12-02 2021-04-09 山东山大鸥玛软件股份有限公司 Handwritten English word recognition method and system
CN113609285A (en) * 2021-08-09 2021-11-05 福州大学 Multi-mode text summarization system based on door control fusion mechanism
CN114092930A (en) * 2022-01-07 2022-02-25 中科视语(北京)科技有限公司 Character recognition method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MOHAMMAD MERAJ KHAN ET AL: "A squeeze and excitation ResNeXt-based deep learning model for Bangla handwritten compound character recognition", 《JOURNAL OF KING SAUD UNIVERSITY》 *
唐伟成: "手写英文字符识别系统", 《万方学位论文全文数据库》 *
汪洪涛等: "基于STN-CRNN的自然场景英文文本识别研究", 《武汉理工大学学报(信息与管理工程版)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581906A (en) * 2022-05-06 2022-06-03 山东大学 Text recognition method and system for natural scene image
CN114581906B (en) * 2022-05-06 2022-08-05 山东大学 Text recognition method and system for natural scene image

Similar Documents

Publication Publication Date Title
Ahmed et al. Handwritten Urdu character recognition using one-dimensional BLSTM classifier
JP6351689B2 (en) Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering
WO2022147965A1 (en) Arithmetic question marking system based on mixnet-yolov3 and convolutional recurrent neural network (crnn)
Li et al. Truncation cross entropy loss for remote sensing image captioning
CN110363049B (en) Method and device for detecting, identifying and determining categories of graphic elements
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN111160348A (en) Text recognition method for natural scene, storage device and computer equipment
CN110502655B (en) Method for generating image natural description sentences embedded with scene character information
CN110347857B (en) Semantic annotation method of remote sensing image based on reinforcement learning
CN111553350A (en) Attention mechanism text recognition method based on deep learning
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
Feng et al. Focal CTC loss for Chinese optical character recognition on unbalanced datasets
Wang et al. From object detection to text detection and recognition: A brief evolution history of optical character recognition
CN114328934A (en) Attention mechanism-based multi-label text classification method and system
CN112836702A (en) Text recognition method based on multi-scale feature extraction
CN115690245A (en) Method for generating image based on attribute-driven GAN text
CN114299512A (en) Zero-sample small seal character recognition method based on Chinese character etymon structure
CN116311323A (en) Pre-training document model alignment optimization method based on contrast learning
Zhang et al. C2st: Cross-modal contextualized sequence transduction for continuous sign language recognition
CN114299510A (en) Handwritten English line recognition system
US11494431B2 (en) Generating accurate and natural captions for figures
He et al. Few-shot font generation by learning style difference and similarity
Chu et al. IterVM: iterative vision modeling module for scene text recognition
CN116362242A (en) Small sample slot value extraction method, device, equipment and storage medium
CN110889284A (en) Multi-task learning Chinese language disease diagnosis method based on bidirectional long-time and short-time memory network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220408

RJ01 Rejection of invention patent application after publication