CN115797952B - Deep learning-based handwriting English line recognition method and system - Google Patents

Deep learning-based handwriting English line recognition method and system Download PDF

Info

Publication number
CN115797952B
CN115797952B CN202310084850.3A CN202310084850A CN115797952B CN 115797952 B CN115797952 B CN 115797952B CN 202310084850 A CN202310084850 A CN 202310084850A CN 115797952 B CN115797952 B CN 115797952B
Authority
CN
China
Prior art keywords
module
input end
english
handwriting
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310084850.3A
Other languages
Chinese (zh)
Other versions
CN115797952A (en
Inventor
许信顺
初宛晴
马磊
陈义学
李溢欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Original Assignee
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG SHANDA OUMA SOFTWARE CO Ltd filed Critical SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority to CN202310084850.3A priority Critical patent/CN115797952B/en
Publication of CN115797952A publication Critical patent/CN115797952A/en
Application granted granted Critical
Publication of CN115797952B publication Critical patent/CN115797952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of image processing, in particular to a handwriting English line recognition method and system based on deep learning; wherein the method comprises: acquiring a handwritten English image to be identified; preprocessing a handwritten English image to be recognized; processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result; wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line. The invention has accurate recognition result for English hand-written lines.

Description

Deep learning-based handwriting English line recognition method and system
Technical Field
The invention relates to the technical field of image processing, in particular to a handwriting English line recognition method and system based on deep learning.
Background
The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.
Text, as an expression form for transferring information between humans, is self-evident as a visual coding of a language for the necessity and the universality of daily life of humans. The diversity and increasing visual form of text, coupled with the tremendous differences in human handwriting style, all result in the complex nature of handwritten text.
Based on the above situation, it is increasingly important to process text intelligently. The text is automatically transcribed and stored, so that time, manpower and material resources can be efficiently saved, and subsequent processing and application can be facilitated.
Text recognition is an important computer vision task that has been developed to address the above tasks. Currently, a segmentation-free single-stage recognition method is mostly adopted in the text recognition technology, namely a text picture is regarded as a whole, and a sequence transcription mode for carrying out fine granularity alignment between a source sequence in an original picture and an output target sequence is sought.
The method mainly has a trend: using the "encoder-decoder" architecture with attention mechanisms, the text picture is first mapped in its entirety by the encoder into a token vector, and then transcribed by the decoder into a sequence of consecutive characters based on this token. The method comprises the steps of obtaining a weight matrix through neural network learning, wherein the value of the weight matrix represents the importance of corresponding context information to current time step prediction, so that selective alignment between a representation coding sequence and a decoding sequence is realized. However, there are several problems with this architecture:
(1) Missing or redundant characters may exist in the text picture, which may cause a false alignment accumulation between the tag sequence and the attention prediction, and mislead the training process, so that the training process is difficult to learn from the beginning, and the effect is poor in a long text line scene;
(2) The attention mechanism relies on a complex attention module, which creates additional network parameters and runtime as well as a large amount of memory requirements.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a handwriting English line recognition method and a handwriting English line recognition system based on deep learning.
In a first aspect, the present invention provides a method for identifying a handwritten english line based on deep learning, where the method can implement rapid and accurate identification of the handwritten english line, and includes:
acquiring a handwritten English image to be identified;
preprocessing a handwritten English image to be recognized;
processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;
wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.
In a second aspect, the invention provides a handwriting English line recognition system based on deep learning; the invention can realize the rapid and accurate recognition of the handwritten English line, and the system comprises:
an acquisition module configured to: acquiring a handwritten English image to be identified;
a preprocessing module configured to: preprocessing a handwritten English image to be recognized;
an identification module configured to: processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;
wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention provides a simple neural network architecture with complete convolution, which only uses an efficient convolution to replace the traditional regular convolution, only uses feedforward connection and no cyclic connection, and realizes high data and calculation efficiency. It can train on the text image of variable size using the line-level transcription label of variable length, without preprocessing such as character segmentation, horizontal normalization, etc.;
(2) The invention provides an effective iterative estimation method of a target sequence, and provides a novel gate unit, which fully extracts characteristics through unit stacking, and simultaneously utilizes the advantages of a gate mechanism on characteristic processing and fusion modes, thereby being beneficial to obtaining more accurate recognition results of a model;
(3) The model provided by the invention adopts various characteristic transformation modes, and can extract characteristic representation with strong information from training data;
(4) The model provided by the invention regularizes the network by using a plurality of normalization modes and Dropout, and accelerates model convergence by using a plurality of regularization methods, thereby effectively relieving overfitting.
(5) The model provided by the invention does not use a full connection layer, the main calculation block is separable convolution, and the model can obtain comparable performance by using the full connection layer instead of the traditional regular convolution, so that the parameter calculation amount is greatly reduced, the training and convergence of the model are accelerated, and the storage space is saved;
(6) The invention provides a novel statistical loss function, provides more supervision information for a model, assists CTC to jointly optimize an objective function, and is beneficial to correct identification.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flow chart of a method according to a first embodiment;
fig. 2 is a diagram showing an internal network structure of a trained handwriting english line recognition model according to the first embodiment;
FIG. 3 is an internal structure diagram of an encoding module according to the first embodiment;
FIG. 4 is a block diagram of the interior of a stacked door module according to the first embodiment;
FIG. 5 is a diagram showing an internal structure of a decoding module according to the first embodiment;
FIG. 6 is an internal block diagram of a first depth separable convolutional network of the first embodiment;
fig. 7 is a diagram showing an internal structure of a first door unit according to the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
All data acquisition in the embodiment is legal application of the data on the basis of meeting laws and regulations and agreements of users.
Based on the CTC (Connectionist Temporal Classification) architecture, it calculates the probability distribution of all possible output sequences by considering all possible alignment paths for each input frame in the prediction process, and then greedily finds the output sequence with the highest probability from it. Therefore, CTC alignment is more accurate, training convergence of the model is faster, and the method is suitable for one-dimensional sequence prediction tasks without additional processing. Considering that the application scenario of this embodiment is directed to long text lines, and integrating the descriptions in this section, CTCs are adopted as the decoder of the model in this embodiment.
Currently, many techniques based on deep learning are proposed to solve the text recognition task, but Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based methods are dominant all the time, and become a general processing architecture. Nevertheless, this structure has a drawback: the sequential processing nature of RNNs introduces some delay to model prediction and thus RNNs are not a good choice in some cases. In view of the foregoing, this embodiment only uses a fully convolutional CNN architecture as the feature extractor for the model.
According to the two sections, the embodiment uses a CNN+CTC architecture, but unlike the existing work, the embodiment uses a novel CNN architecture, can process sequences with any length, does not need preprocessing such as character segmentation, horizontal standardization and the like, and can realize advanced performance on handwriting English line pictures.
Example 1
The embodiment provides a handwriting English line recognition method based on deep learning;
as shown in fig. 1, the handwriting english line recognition method based on deep learning includes:
s101: acquiring a handwritten English image to be identified;
s102: preprocessing a handwritten English image to be recognized;
s103: processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;
wherein, the trained handwriting English line recognition model comprises:
extracting features of the preprocessed image to obtain preliminary visual features;
extracting depth features from the preliminary visual features to obtain depth visual features;
and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.
Further, the step S101: and acquiring the handwritten English image to be identified, photographing the handwritten English image by using a camera, and acquiring the handwritten English image to be identified in a photographing mode.
Further, the step S102: preprocessing a handwritten English image to be recognized, specifically comprising the following steps:
s102-1: performing size normalization processing on the handwritten English image to be identified;
s102-2: and carrying out graying treatment on the handwritten English image subjected to the size normalization treatment.
It should be understood that, since the lengths of the english lines are different to cause the length and width of each image to be inconsistent, in order to implement batch processing of text images, the present embodiment first performs unified and normalized processing on the length and width of all text images. Meanwhile, the handwritten text image is a three-channel color image, the color of each pixel is determined by R, G, B three components, and the numerical values of each pixel point of a scanned image are equal on the three components in view of the specificity of the handwritten text image, so that the embodiment converts the handwritten text image into a gray-scale image form, so that each pixel point only has one component, the subsequent image calculation amount is reduced, and the overall effect is not influenced.
Further, as shown in fig. 2, the trained handwriting english line recognition model has a network structure including:
the coding module, the stacking gate module and the decoding module are connected in sequence;
as shown in fig. 3, the encoding module includes: a first depth separable convolutional network (Depthwiseseparable convolution), a first layer normalization module LN (layer normalization), and a connector concat connected in sequence; the input end of the depth separable convolution network is connected with the input end of the connector in a residual way; the input end of the first depth separable convolution network is used as the input end of the coding module; the output end of the connector concat is used as the output end of the coding module; the encoding module is used for extracting the characteristics of the preprocessed image to obtain the preliminary visual characteristics.
Further, as shown in fig. 4, the stacking gate module includes: the input end of the first gate unit is connected with the output end of the connector concat, and the output end of the last gate unit is connected with the input end of the second-layer standardized module; the input end of the first gate unit is used as the input end of the stacking gate module, and the output end of the second layer of standardized module is used as the output end of the stacking gate module; the stacking door module is used for carrying out depth feature extraction on the preliminary visual features to obtain the depth visual features.
Further, as shown in fig. 5, the decoding module includes:
a second depth separable convolutional network, an exponential linear Unit (ELU, exponentialLinear Unit), a third layer standardization module and a decoder connected in sequence; the input end of the second depth separable convolution network is used as the input end of the decoding module, the input end of the second depth separable convolution network is connected with the output end of the second layer standardization module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.
Further, the internal network structure of the first depth separable convolutional network and the second depth separable convolutional network are consistent.
Further, as shown in fig. 6, the first depth separable convolution network includes:
a first channel-wise Convolution layer (Depth-wise Convolution) and a first Point-wise Convolution layer (Point-wise Convolution) are connected in sequence.
Further, all the door unit internal structures are uniform.
Further, as shown in fig. 7, the first door unit has an internal structure including:
the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; the input ends of the other gate units except the first gate unit are used for inputting the characteristic diagram output by the previous gate unit;
the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;
the first branch circuit comprises: a second channel-wise Convolution layer (Depth-wise Convolution), a second Point-wise Convolution layer (Point-wise Convolution), an exponential linear Unit (ELU, exponentialLinear Unit), and a first multiplier connected in sequence; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;
the second branch includes: the input end of the sigmoid activation function layer is connected with the input end of the first gate unit;
the third branch includes: the input end of the second multiplier is connected with the input end of the first gate unit;
the input end of the first multiplier is connected with the output end of the sigmoid activation function layer; the output end of the first multiplier is connected with the input end of the first adder; the output layer of the sigmoid activation function layer outputs a weight value a;
processing the weight value to obtain 1-a; inputting 1-a to the input of the second multiplier;
the output end of the second multiplier is connected with the input end of the first adder;
the output end of the first adder is connected with the input end of the second adder;
the input end of the second adder is also connected with the input end of the first gate unit;
the output of the second adder serves as the output of the first gate unit.
Further, the first gate unit includes:
the first branch is used for carrying out nonlinear conversion on input, and conversion characteristics are obtained; the second channel-by-channel convolution layer, the second point-by-point convolution layer, and the exponential linear unit can be regarded as a combination to jointly form a conversion
Figure SMS_1
Conversion characteristics can be obtained>
Figure SMS_2
Taking it as input of the first multiplier; wherein the conversion feature->
Figure SMS_3
Refers to the input feature->
Figure SMS_4
Nonlinear conversion via the first branch +.>
Figure SMS_5
The resulting features thereafter; "original feature" refers to the input feature +.>
Figure SMS_6
A third branch for reserving input
Figure SMS_7
I.e. the original characteristic, is taken as input to the second multiplier.
A second branch defining two gating signals added to one for modeling the input
Figure SMS_8
To be converted with->
Figure SMS_9
The relation between the two is explored and the matching degree is explored; learning by means of an activation function layer results in a weight a, switching gate->
Figure SMS_10
The weight value of the conversion feature obtained by the first branch is used as the weight value of the conversion feature, which represents the importance of the conversion feature to the output feature and controls the degree to which the converted feature information is carried into the output feature information; then, a and 1 are differenced to obtain a weight of 1-a, the retention gate +.>
Figure SMS_11
The weight value of the original feature obtained by taking the weight value as the third branch represents the importance of the original feature to the output feature, and controls the degree to which the input feature information is carried into the output feature information. The larger the weight value, the larger the influence of the characteristic on the output result of the gate unit.
The output of each gate unit is given an "output characteristic
Figure SMS_12
And "pass it to the next gate unit as input to the next gate unit.
Further, the first multiplier and the second multiplier are both used for multiplying the elements at the positions corresponding to the two matrixes, and the first multiplier multiplies the conversion characteristics of the first branch with the conversion gate of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch with a retention gate of the second branch to realize weighting;
further, the first adder and the second adder are both used for adding elements at positions corresponding to the two matrixes, and the first adder sums the conversion characteristics of the weighted first branch and the original characteristics of the weighted third branch to obtain fusion characteristics; the second adder can be seen as a kind of residual connection, adding the original input to the output result as well.
It should be appreciated that the encoding module is used to extract the initial visual features of the picture. The coding module is based on a complete convolution network and consists of depth convolution, point level convolution and interlayer residual connection. Wherein the training process is regularized using batch and layer normalization and Dropout using ELU as an activation function.
Generally, the more network parameters are, the better the performance is, but the network parameters cause large calculation amount, high memory requirement, often unsatisfactory time performance, high training cost and difficult efficient deployment to practical application scenes. Therefore, this embodiment uses a lightweight small-scale separable convolution instead of the conventional regular convolution, and makes a compromise between recognition accuracy and recognition delay, thereby alleviating the above problems.
Depth separable convolution (seplablecon volume) is a decomposable convolution that can be decomposed into two atomic operations: channel-by-channel convolution and point-by-point convolution.
Channel-by-channel convolution (Depth-wisecontense), which is a convolution performed in the Depth dimension (Depth-wise). Unlike conventional convolution, which acts on all input channels, the depth convolution uses different convolution kernels for each input channel, which are not shared between channels.
Point-wise convolution (Point-wise convolution), similar to conventional convolution, differs in that the convolution kernel has a height and width of 1x1 and the number of channels is equal to the number of channels of the input feature map.
In this way, the depth separable convolution firstly adopts the convolution operation of the channel by channel to respectively carry out the convolution operation of the horizontal direction on different input channels, and then adopts the convolution operation of the point by point to combine the convolution results in the vertical direction, so that the obtained overall effect is almost the same as that of a traditional convolution, but the calculated amount and the model parameter quantity can be greatly reduced.
It should be appreciated that the stacked gate module uses resolvable convolution instead of conventional convolution and residual connection, which can reduce the amount of parameter calculation and speed up convergence, and can reduce the memory requirement. Meanwhile, strong characterization is extracted by using multiple feature transfer functions, the network is regularized by using two modes of batch normalization and hierarchical normalization, the network is regularized by using Dropout at the end of the whole coding module and the end of the whole stacking gate module, fitting is effectively relieved, and training and convergence of the model are accelerated.
The entire model of the present embodiment uses a variety of feature transfer functions. Specifically, the encoding module uses an ELU activation function; the stacking gate module uses ELU and Sigmoid activation functions; the decoding module uses a Softmax activation function.
The stacked gate module is the main calculation block of the model of the embodiment, and sends the visual features extracted by the encoder into a stacked structure, and a series of gate units are used for performing depth feature processing to extract strong feature representations so as to complete the modeling task of an input sequence. The iterative estimation method developed on the target sequence can be regarded as a filter for sensing the characteristic information.
For a feed-forward neural network, each layer thereof can be considered as an input-pair
Figure SMS_13
A transition is made>
Figure SMS_14
Finally, output->
Figure SMS_15
Thus, the network can be understood as modeling input +.>
Figure SMS_16
To be converted with->
Figure SMS_17
The relationship between the two. To explore this degree of matching, the present embodiment devised a novel gate unit as the basic computation block of the stacked gate module.
Unlike GRU, which establishes a time sequence relationship between the time of the history step and the current step to learn the importance of the history feature and the current feature to the output feature, the present embodiment establishes a time sequence relationship between the layers after the original and the conversion to learn the importance of the original feature and the conversion feature to the output feature. The original features are then connected to the transformed features using two gates added together to obtain the final fused feature, which is passed on to the next gate unit. This series of repeated operations can be seen as a deep extraction of features with different responses to different inputs.
For this purpose, the present embodiment defines two gating signals, one being a retention gate and one being a switching gate. The retention gate is used for controlling the degree to which the input characteristic information is carried into the output characteristic information, the conversion gate is used for controlling the degree to which the characteristic information after conversion is carried into the output characteristic information, and the larger the gate value is, the more the carrying-in is, the larger the influence on the output is.
The operation formula of the structure of each gate unit can be expressed as:
Figure SMS_18
(2.1)
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_19
is each doorOutput of unit->
Figure SMS_20
Is its original input, < >>
Figure SMS_21
And->
Figure SMS_22
Representing a hold gate and a switch gate, respectively, +.>
Figure SMS_23
To adaptively learn the weights between the input features and the conversion features. The present embodiment uses a combination of depth convolution, point level convolution, ELU activation functions as the nonlinear conversion of the gate unit +.>
Figure SMS_24
As a neural network is composed of a plurality of neurons, a stacked gate module is composed of a plurality of gate units. The door mechanism provided by the embodiment is beneficial to controlling information flow transmission among layers, so that the model automatically learns the characteristic importance of time sequence before and after conversion, and can filter unimportant characteristic signals and strengthen important characteristic signals at the same time so as to excavate the optimal matching relation among the important characteristic signals.
It will be appreciated that the exponential linear unit ELU, as an activation function, has the advantage that: on the one hand, ELU is zero-centered, the average activation value of the network unit can be pushed to a value closer to zero, interlayer accumulated deviation caused by deepening of layers is reduced, gradient propagation is more stable, convergence speed is accelerated, a function similar to batch normalization is exerted, and computational complexity is lower; on the other hand, for smaller inputs, the ELU will saturate to a negative value with smaller parameters, the soft saturation characteristic reduces the influence of the inactivating unit on the change of characteristic information in forward propagation, weakens the correlation of the inactivating unit to the next layer network, strengthens the characteristic importance of the activating unit, and the dependency relationship among the units is easier to model and interpret, so that the model has stronger robustness to noise, and the network is allowed to learn a more stable characterization.
The function and derivative formula of ELUs are as follows:
Figure SMS_25
(1.1)
Figure SMS_26
(1.2)
wherein the super parameter
Figure SMS_27
The saturation value of the ELU at negative inputs, i.e. when the negative part of the ELU is saturated, is controlled. It can be seen that when the input is in a positive state, the ELU mitigates the gradient vanishing by a constant positive equation; when the input is in a negative state, ELU saturates to a negative value, reducing the variation of the inactivating unit and the information propagated to the next layer, limiting it to a small area fluctuation, thereby improving the robustness against noise.
It should be appreciated that the decoder module performs a subsequent decoding process on the visual features obtained by the stacking gate module, using CTCs as the decoder for that module, and outputting the final text sequence.
Further, the training process of the trained handwriting English line recognition model comprises the following steps:
s103-1: constructing a training set, wherein the training set is a handwriting English image with known handwriting English line recognition results;
s103-2: and inputting the training set into the handwriting English line recognition model, training the model, and stopping training when the total loss function value of the model is not reduced any more, so as to obtain the trained handwriting English line recognition model.
Further, the total loss function of the model is a weighted sum of the loss function of the decoder and the statistical loss function.
Further, the total loss function of the model is:
Figure SMS_28
(4.1)
the total loss function consists of two parts, wherein,
Figure SMS_29
indicating CTC loss of decoder, +.>
Figure SMS_30
Loss, denoted "statistical loss function",>
Figure SMS_31
、/>
Figure SMS_32
respectively, the corresponding weights.
Figure SMS_33
The specific formula is as follows: />
Figure SMS_34
(3.1)
Figure SMS_35
(3.2)
Figure SMS_36
(3.3)
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_40
is indicated at the time step->
Figure SMS_44
With characters->
Figure SMS_45
Based on the prediction result of each frame +.>
Figure SMS_39
And a dictionary set by integrating the probability at each time step within the pathMultiplying to obtain the whole path +.>
Figure SMS_43
Probability of->
Figure SMS_47
Figure SMS_51
Indicating pass->
Figure SMS_37
The transformation is followed by the sequence->
Figure SMS_42
Is>
Figure SMS_48
Because of the presence of multiple paths->
Figure SMS_49
The same sequence can be obtained, so that the probability of the final sequence is equal to all paths +.>
Figure SMS_38
To obtain the final sequence +.>
Figure SMS_41
Is the total probability of (2)
Figure SMS_46
. The final objective function takes the conditional probability of tag sequence probability +.>
Figure SMS_50
Is a negative log-likelihood of (c).
Statistical loss function "
Figure SMS_52
The calculation process of (2) is as follows:
Figure SMS_53
(3.4)
Figure SMS_54
(3.5)
Figure SMS_55
(3.6)
Figure SMS_56
(3.7)
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_64
and->
Figure SMS_61
Respectively represent tag sequences->
Figure SMS_69
And predicted sequence->
Figure SMS_63
Statistical probability distribution of all character classes in the list; />
Figure SMS_68
Representing an input image +.>
Figure SMS_70
Representing training set->
Figure SMS_73
Representing a character dictionary; />
Figure SMS_60
Indicate->
Figure SMS_72
Seed character category->
Figure SMS_57
In tag sequence->
Figure SMS_71
The secondary occurrence of (3)A number; />
Figure SMS_59
Indicate->
Figure SMS_66
Seed character category->
Figure SMS_62
In the predicted sequence->
Figure SMS_67
By aggregating the predicted probabilities of each character along the time dimension, accumulating all time steps +.>
Figure SMS_58
Prediction probability of +.>
Figure SMS_65
And then the product is obtained.
CTC loss function, calculating probability distribution of all possible output sequences by considering all possible alignment paths of each input frame in the prediction process, then greedily finding the output sequence with the highest probability for the input sequence therefrom, finally using a function
Figure SMS_74
To remove redundant characters and spaces, thereby routing +.>
Figure SMS_75
Mapping to the final sequence->
Figure SMS_76
Training, namely training the whole model end to end by minimizing CTC loss between text labels corresponding to the original pictures; and in the reasoning stage, the final visual features obtained by the visual feature extraction module are transcribed and decoded, and the output text sequence is identified.
"statistical-based loss function", the dataset of this example has the following problems: the text line has a plurality of special characters and spaces, and the number of spaces marked by people is not consistent with the number of real spaces at the image pixel level, so that the division among different words is not clear, the recognition result can be caused to have the phenomenon that the word interval of a label is different from the predicted word interval, the number of characters of the label and the predicted result is not corresponding, and the sequence lengths of the label and the predicted result are not consistent. Therefore, the embodiment provides statistical information based on the number of characters as additional supervision information for the model, and allows the network to learn the number of all characters in the text line, so that the generation of a predicted sequence can be better constrained by accurately predicting the number of characters of each class in the label, and more supervision information is helpful for correct recognition.
Furthermore, since the alignment between the tag characters and the model predictions may be unclear, the way CTCs directly estimate the conditional probabilities accurately is challenging, and using CTCs alone as loss supervision may introduce errors. Therefore, this embodiment additionally adds a supervision mode to assist CTCs to co-supervise the model and jointly optimize the objective function, facilitating correct recognition.
For the two points, the embodiment provides a novel statistical loss function. Unlike CTC for probability prediction methods, which have the network predict a probability distribution for each category for each time step, the "statistical loss function" counts the number of occurrences of each character category by considering the number of each character category, and learns the cumulative probability of each character category over all time steps, so that it is essentially the network that has to predict the number of each category, i.e., without considering the endian information in the tag sequence. For example, in the word "hello" the character "l" appears twice, then its cumulative prediction probability over all time steps should be exactly 2, and the corresponding two predictors should both be close to 1.
Thus, considering that the statistical prediction result and the label form two new distributions, the present embodiment uses cross entropy as the calculation method in order to measure the "distance" between the two new probability distributions.
It can be seen that the "statistical loss function" provided in this embodiment only involves simple operations, no additional parameters, and the calculated amount and the memory consumption are negligible, and can be implemented only by making little modification to the original model.
Further, the specific process of constructing the training set comprises the following steps:
s103-11: performing size normalization processing on the handwritten English line image;
s103-12: and processing the label of the handwritten English line image.
Further, the processing of the label of the handwritten English line image specifically comprises the following steps:
s103-121: constructing a character dictionary; the character dictionary includes: one-to-one correspondence between characters and indexes; the character comprises: uppercase letters, lowercase letters, numbers, punctuation marks;
s103-122: constructing a statistical dictionary; the statistical dictionary comprises: the number of all characters in the text label and the number of each type of characters; the statistical dictionary comprises: one-to-one correspondence between characters and numbers;
s103-123: mapping the text labels according to the character dictionary, and establishing a corresponding relation between the characters and the indexes;
s103-124: and (3) after S103-121-S103-123, obtaining labels corresponding to all the handwritten English line images.
It should be appreciated that a character dictionary is constructed: the embodiment adopts a novel dictionary creation mode, and all the characters appearing in the data set are counted and repeated characters are filtered, so that the model can identify English characters containing case letters, numbers and punctuation marks, can also identify Chinese characters and special symbols under a Chinese input method, and has universality;
it should be appreciated that a statistical dictionary is built: in the embodiment, the number of all characters in the text label and the number of each type of characters are counted, and a statistical corresponding relation is established between the characters and the number;
it should be appreciated that mapping text labels according to a character dictionary establishes correspondence between "character-indices".
Example two
The embodiment provides a handwriting English line recognition system based on deep learning;
a deep learning based handwriting english line recognition system comprising:
an acquisition module configured to: acquiring a handwritten English image to be identified;
a preprocessing module configured to: preprocessing a handwritten English image to be recognized;
an identification module configured to: processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;
wherein, the trained handwriting English line recognition model comprises:
extracting features of the preprocessed image to obtain preliminary visual features;
extracting depth features from the preliminary visual features to obtain depth visual features;
and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.
It should be noted that the above-mentioned obtaining module, preprocessing module and identifying module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. The handwriting English line recognition method based on deep learning is characterized by comprising the following steps of:
acquiring a handwritten English image to be identified;
preprocessing a handwritten English image to be recognized;
processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;
wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; decoding the depth visual characteristics to obtain a recognition result of the handwritten English line;
wherein, the handwriting English line recognition model after training has a network structure comprising:
the coding module, the stacking gate module and the decoding module are connected in sequence;
the encoding module comprises: the first depth separable convolution network, the first layer standardized module and the connector are connected in sequence; the input end of the first depth separable convolution network is connected with the input end of the connector in a residual way; the input end of the first depth separable convolution network is used as the input end of the coding module; the output end of the connector is used as the output end of the coding module; the coding module is used for extracting the characteristics of the preprocessed image to obtain preliminary visual characteristics;
the stack door module includes: the input end of the first gate unit is connected with the output end of the connector, and the output end of the last gate unit is connected with the input end of the second-layer standardization module; the input end of the first gate unit is used as the input end of the stacking gate module, and the output end of the second layer of standardized module is used as the output end of the stacking gate module; the stacking door module is used for extracting depth features of the preliminary visual features to obtain the depth visual features;
the decoding module comprises: the second depth separable convolution network, the index linear unit, the third layer standardization module and the decoder are connected in sequence; the input end of the second depth separable convolution network is used as the input end of the decoding module, the input end of the second depth separable convolution network is connected with the output end of the second layer standardization module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.
2. The deep learning-based handwritten english line recognition method of claim 1, wherein the first gate unit has an internal structure comprising:
the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; the input ends of the other gate units except the first gate unit are used for inputting the characteristic diagram output by the previous gate unit; the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;
the first branch circuit comprises: the second channel-by-channel convolution layer, the second point-by-point convolution layer, the exponential linear unit and the first multiplier are sequentially connected; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;
the second branch includes: the input end of the activation function layer is connected with the input end of the first gate unit;
the third branch includes: the input end of the second multiplier is connected with the input end of the first gate unit;
the input end of the first multiplier is connected with the output end of the activation function layer; the output end of the first multiplier is connected with the input end of the first adder; the output layer of the activation function layer outputs a weight value a; processing the weight value to obtain 1-a; inputting 1-a to the input of the second multiplier; the output end of the second multiplier is connected with the input end of the first adder; the output end of the first adder is connected with the input end of the second adder; the input end of the second adder is also connected with the input end of the first gate unit; the output of the second adder serves as the output of the first gate unit.
3. The deep learning-based handwritten English line recognition method of claim 2, wherein the first branch is used for performing nonlinear conversion on input to obtain conversion characteristics; the second channel-by-channel convolution layer, the second point-by-point convolution layer and the exponential linear unit can be regarded as a combination, and the combination forms conversion together to obtain conversion characteristics which are used as the input of the first multiplier; the conversion feature refers to a feature obtained after nonlinear conversion of the input feature through the first branch.
4. The deep learning based handwritten English line recognition method of claim 2, wherein a second branch defines two gating signals added to one for modeling input
Figure QLYQS_1
To be converted with->
Figure QLYQS_2
The relation between the two is explored and the matching degree is explored; the weight a is learned through an activation function layer and used as a weight value of the conversion feature obtained by the first branch, the weight value represents the importance of the conversion feature to the output feature, and the degree that the feature information after conversion is carried into the output feature information is controlled; then, a and 1 are subjected to difference to obtain a weight 1-a, the weight is taken as a weight value of an original feature obtained by a third branch, the importance of the original feature to the output feature is represented, the degree that input feature information is carried into the output feature information is controlled, the weight a is obtained through learning by an activation function layer, and the process is regarded as a conversion gate; the difference between a and 1 yields a weight of 1-a, which is considered as a hold gate.
5. The method for recognizing handwritten English line based on deep learning according to claim 4, wherein the first multiplier and the second multiplier are both used for multiplying elements at corresponding positions of two matrixes, and the first multiplier multiplies the conversion characteristics of the first branch by the conversion gate of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch with a retention gate of the second branch to realize weighting;
the first adder and the second adder are both used for adding elements at positions corresponding to the two matrixes, and the first adder sums the conversion characteristics of the weighted first branch and the original characteristics of the weighted third branch to obtain fusion characteristics; the second adder is seen as a residual connection, adding the original input to the output result as well.
6. The deep learning-based handwriting english line recognition method of claim 1, wherein the training process of the trained handwriting english line recognition model comprises:
constructing a training set, wherein the training set is a handwriting English image with known handwriting English line recognition results;
inputting the training set into a handwriting English line recognition model, training the model, and stopping training when the total loss function value of the model is not reduced any more, so as to obtain a trained handwriting English line recognition model;
the total loss function of the model is a weighted sum of the loss function of the decoder and the statistical loss function.
7. The handwriting English line recognition system based on deep learning is characterized by comprising:
an acquisition module configured to: acquiring a handwritten English image to be identified;
a preprocessing module configured to: preprocessing a handwritten English image to be recognized;
an identification module configured to: processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;
wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; decoding the depth visual characteristics to obtain a recognition result of the handwritten English line;
wherein, the handwriting English line recognition model after training has a network structure comprising:
the coding module, the stacking gate module and the decoding module are connected in sequence;
the encoding module comprises: the first depth separable convolution network, the first layer standardized module and the connector are connected in sequence; the input end of the first depth separable convolution network is connected with the input end of the connector in a residual way; the input end of the first depth separable convolution network is used as the input end of the coding module; the output end of the connector is used as the output end of the coding module; the coding module is used for extracting the characteristics of the preprocessed image to obtain preliminary visual characteristics;
the stack door module includes: the input end of the first gate unit is connected with the output end of the connector, and the output end of the last gate unit is connected with the input end of the second-layer standardization module; the input end of the first gate unit is used as the input end of the stacking gate module, and the output end of the second layer of standardized module is used as the output end of the stacking gate module; the stacking door module is used for extracting depth features of the preliminary visual features to obtain the depth visual features;
the decoding module comprises: the second depth separable convolution network, the index linear unit, the third layer standardization module and the decoder are connected in sequence; the input end of the second depth separable convolution network is used as the input end of the decoding module, the input end of the second depth separable convolution network is connected with the output end of the second layer standardization module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.
CN202310084850.3A 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system Active CN115797952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310084850.3A CN115797952B (en) 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310084850.3A CN115797952B (en) 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system

Publications (2)

Publication Number Publication Date
CN115797952A CN115797952A (en) 2023-03-14
CN115797952B true CN115797952B (en) 2023-05-05

Family

ID=85430576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310084850.3A Active CN115797952B (en) 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system

Country Status (1)

Country Link
CN (1) CN115797952B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824597B (en) * 2023-07-03 2024-05-24 金陵科技学院 Dynamic image segmentation and parallel learning hand-written identity card number and identity recognition method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications
CN113592045A (en) * 2021-09-30 2021-11-02 杭州一知智能科技有限公司 Model adaptive text recognition method and system from printed form to handwritten form

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10936862B2 (en) * 2016-11-14 2021-03-02 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
CN107704859A (en) * 2017-11-01 2018-02-16 哈尔滨工业大学深圳研究生院 A kind of character recognition method based on deep learning training framework
EP3598339A1 (en) * 2018-07-19 2020-01-22 Tata Consultancy Services Limited Systems and methods for end-to-end handwritten text recognition using neural networks
CN112686345B (en) * 2020-12-31 2024-03-15 江南大学 Offline English handwriting recognition method based on attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications
CN113592045A (en) * 2021-09-30 2021-11-02 杭州一知智能科技有限公司 Model adaptive text recognition method and system from printed form to handwritten form

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
闵锋 ; 叶显一 ; 张彦铎 ; .基于改进主成分分析网络的手写数字识别方法.华中科技大学学报(自然科学版).2018,(第12期),全文. *

Also Published As

Publication number Publication date
CN115797952A (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN110738090B (en) System and method for end-to-end handwritten text recognition using neural networks
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
Zhang et al. Multi-scale attention with dense encoder for handwritten mathematical expression recognition
CN109947912B (en) Model method based on intra-paragraph reasoning and joint question answer matching
CN108664589B (en) Text information extraction method, device, system and medium based on domain self-adaptation
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN110598713A (en) Intelligent image automatic description method based on deep neural network
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN111582141A (en) Face recognition model training method, face recognition method and device
CN111860193B (en) Text-based pedestrian retrieval self-supervision visual representation learning system and method
US11568140B2 (en) Optical character recognition using a combination of neural network models
CN115797952B (en) Deep learning-based handwriting English line recognition method and system
CN111639186A (en) Multi-class multi-label text classification model and device dynamically embedded with projection gate
CN111651993A (en) Chinese named entity recognition method fusing local-global character level association features
Gupta et al. CNN-LSTM hybrid real-time IoT-based cognitive approaches for ISLR with WebRTC: auditory impaired assistive technology
CN110298046B (en) Translation model training method, text translation method and related device
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN111242114A (en) Character recognition method and device
CN115862015A (en) Training method and device of character recognition system, and character recognition method and device
CN112651242B (en) Text classification method based on internal and external attention mechanism and variable scale convolution
CN114357166A (en) Text classification method based on deep learning
CN113159071A (en) Cross-modal image-text association anomaly detection method
Hallyal et al. Optimized recognition of CAPTCHA through attention models
Huang et al. Spatial Aggregation for Scene Text Recognition.
Chen et al. CaptchaGG: A linear graphical CAPTCHA recognition model based on CNN and RNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant