CN115797952B

CN115797952B - Deep learning-based handwriting English line recognition method and system

Info

Publication number: CN115797952B
Application number: CN202310084850.3A
Authority: CN
Inventors: 许信顺; 初宛晴; 马磊; 陈义学; 李溢欢
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-05-05
Anticipated expiration: 2043-02-09
Also published as: CN115797952A

Abstract

The invention relates to the technical field of image processing, in particular to a handwriting English line recognition method and system based on deep learning; wherein the method comprises: acquiring a handwritten English image to be identified; preprocessing a handwritten English image to be recognized; processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result; wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line. The invention has accurate recognition result for English hand-written lines.

Description

Deep learning-based handwriting English line recognition method and system

Technical Field

The invention relates to the technical field of image processing, in particular to a handwriting English line recognition method and system based on deep learning.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

Text, as an expression form for transferring information between humans, is self-evident as a visual coding of a language for the necessity and the universality of daily life of humans. The diversity and increasing visual form of text, coupled with the tremendous differences in human handwriting style, all result in the complex nature of handwritten text.

Based on the above situation, it is increasingly important to process text intelligently. The text is automatically transcribed and stored, so that time, manpower and material resources can be efficiently saved, and subsequent processing and application can be facilitated.

Text recognition is an important computer vision task that has been developed to address the above tasks. Currently, a segmentation-free single-stage recognition method is mostly adopted in the text recognition technology, namely a text picture is regarded as a whole, and a sequence transcription mode for carrying out fine granularity alignment between a source sequence in an original picture and an output target sequence is sought.

The method mainly has a trend: using the "encoder-decoder" architecture with attention mechanisms, the text picture is first mapped in its entirety by the encoder into a token vector, and then transcribed by the decoder into a sequence of consecutive characters based on this token. The method comprises the steps of obtaining a weight matrix through neural network learning, wherein the value of the weight matrix represents the importance of corresponding context information to current time step prediction, so that selective alignment between a representation coding sequence and a decoding sequence is realized. However, there are several problems with this architecture:

(1) Missing or redundant characters may exist in the text picture, which may cause a false alignment accumulation between the tag sequence and the attention prediction, and mislead the training process, so that the training process is difficult to learn from the beginning, and the effect is poor in a long text line scene;

(2) The attention mechanism relies on a complex attention module, which creates additional network parameters and runtime as well as a large amount of memory requirements.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a handwriting English line recognition method and a handwriting English line recognition system based on deep learning.

In a first aspect, the present invention provides a method for identifying a handwritten english line based on deep learning, where the method can implement rapid and accurate identification of the handwritten english line, and includes:

acquiring a handwritten English image to be identified;

preprocessing a handwritten English image to be recognized;

processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;

wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.

In a second aspect, the invention provides a handwriting English line recognition system based on deep learning; the invention can realize the rapid and accurate recognition of the handwritten English line, and the system comprises:

an acquisition module configured to: acquiring a handwritten English image to be identified;

a preprocessing module configured to: preprocessing a handwritten English image to be recognized;

an identification module configured to: processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention provides a simple neural network architecture with complete convolution, which only uses an efficient convolution to replace the traditional regular convolution, only uses feedforward connection and no cyclic connection, and realizes high data and calculation efficiency. It can train on the text image of variable size using the line-level transcription label of variable length, without preprocessing such as character segmentation, horizontal normalization, etc.;

(2) The invention provides an effective iterative estimation method of a target sequence, and provides a novel gate unit, which fully extracts characteristics through unit stacking, and simultaneously utilizes the advantages of a gate mechanism on characteristic processing and fusion modes, thereby being beneficial to obtaining more accurate recognition results of a model;

(3) The model provided by the invention adopts various characteristic transformation modes, and can extract characteristic representation with strong information from training data;

(4) The model provided by the invention regularizes the network by using a plurality of normalization modes and Dropout, and accelerates model convergence by using a plurality of regularization methods, thereby effectively relieving overfitting.

(5) The model provided by the invention does not use a full connection layer, the main calculation block is separable convolution, and the model can obtain comparable performance by using the full connection layer instead of the traditional regular convolution, so that the parameter calculation amount is greatly reduced, the training and convergence of the model are accelerated, and the storage space is saved;

(6) The invention provides a novel statistical loss function, provides more supervision information for a model, assists CTC to jointly optimize an objective function, and is beneficial to correct identification.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of a method according to a first embodiment;

fig. 2 is a diagram showing an internal network structure of a trained handwriting english line recognition model according to the first embodiment;

FIG. 3 is an internal structure diagram of an encoding module according to the first embodiment;

FIG. 4 is a block diagram of the interior of a stacked door module according to the first embodiment;

FIG. 5 is a diagram showing an internal structure of a decoding module according to the first embodiment;

FIG. 6 is an internal block diagram of a first depth separable convolutional network of the first embodiment;

fig. 7 is a diagram showing an internal structure of a first door unit according to the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

All data acquisition in the embodiment is legal application of the data on the basis of meeting laws and regulations and agreements of users.

Based on the CTC (Connectionist Temporal Classification) architecture, it calculates the probability distribution of all possible output sequences by considering all possible alignment paths for each input frame in the prediction process, and then greedily finds the output sequence with the highest probability from it. Therefore, CTC alignment is more accurate, training convergence of the model is faster, and the method is suitable for one-dimensional sequence prediction tasks without additional processing. Considering that the application scenario of this embodiment is directed to long text lines, and integrating the descriptions in this section, CTCs are adopted as the decoder of the model in this embodiment.

Currently, many techniques based on deep learning are proposed to solve the text recognition task, but Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based methods are dominant all the time, and become a general processing architecture. Nevertheless, this structure has a drawback: the sequential processing nature of RNNs introduces some delay to model prediction and thus RNNs are not a good choice in some cases. In view of the foregoing, this embodiment only uses a fully convolutional CNN architecture as the feature extractor for the model.

According to the two sections, the embodiment uses a CNN+CTC architecture, but unlike the existing work, the embodiment uses a novel CNN architecture, can process sequences with any length, does not need preprocessing such as character segmentation, horizontal standardization and the like, and can realize advanced performance on handwriting English line pictures.

Example 1

The embodiment provides a handwriting English line recognition method based on deep learning;

as shown in fig. 1, the handwriting english line recognition method based on deep learning includes:

s101: acquiring a handwritten English image to be identified;

s102: preprocessing a handwritten English image to be recognized;

s103: processing the preprocessed image by adopting a trained handwriting English line recognition model to obtain a handwriting English line recognition result;

wherein, the trained handwriting English line recognition model comprises:

extracting features of the preprocessed image to obtain preliminary visual features;

extracting depth features from the preliminary visual features to obtain depth visual features;

and decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.

Further, the step S101: and acquiring the handwritten English image to be identified, photographing the handwritten English image by using a camera, and acquiring the handwritten English image to be identified in a photographing mode.

Further, the step S102: preprocessing a handwritten English image to be recognized, specifically comprising the following steps:

s102-1: performing size normalization processing on the handwritten English image to be identified;

s102-2: and carrying out graying treatment on the handwritten English image subjected to the size normalization treatment.

It should be understood that, since the lengths of the english lines are different to cause the length and width of each image to be inconsistent, in order to implement batch processing of text images, the present embodiment first performs unified and normalized processing on the length and width of all text images. Meanwhile, the handwritten text image is a three-channel color image, the color of each pixel is determined by R, G, B three components, and the numerical values of each pixel point of a scanned image are equal on the three components in view of the specificity of the handwritten text image, so that the embodiment converts the handwritten text image into a gray-scale image form, so that each pixel point only has one component, the subsequent image calculation amount is reduced, and the overall effect is not influenced.

Further, as shown in fig. 2, the trained handwriting english line recognition model has a network structure including:

the coding module, the stacking gate module and the decoding module are connected in sequence;

as shown in fig. 3, the encoding module includes: a first depth separable convolutional network (Depthwiseseparable convolution), a first layer normalization module LN (layer normalization), and a connector concat connected in sequence; the input end of the depth separable convolution network is connected with the input end of the connector in a residual way; the input end of the first depth separable convolution network is used as the input end of the coding module; the output end of the connector concat is used as the output end of the coding module; the encoding module is used for extracting the characteristics of the preprocessed image to obtain the preliminary visual characteristics.

Further, as shown in fig. 4, the stacking gate module includes: the input end of the first gate unit is connected with the output end of the connector concat, and the output end of the last gate unit is connected with the input end of the second-layer standardized module; the input end of the first gate unit is used as the input end of the stacking gate module, and the output end of the second layer of standardized module is used as the output end of the stacking gate module; the stacking door module is used for carrying out depth feature extraction on the preliminary visual features to obtain the depth visual features.

Further, as shown in fig. 5, the decoding module includes:

a second depth separable convolutional network, an exponential linear Unit (ELU, exponentialLinear Unit), a third layer standardization module and a decoder connected in sequence; the input end of the second depth separable convolution network is used as the input end of the decoding module, the input end of the second depth separable convolution network is connected with the output end of the second layer standardization module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.

Further, the internal network structure of the first depth separable convolutional network and the second depth separable convolutional network are consistent.

Further, as shown in fig. 6, the first depth separable convolution network includes:

a first channel-wise Convolution layer (Depth-wise Convolution) and a first Point-wise Convolution layer (Point-wise Convolution) are connected in sequence.

Further, all the door unit internal structures are uniform.

Further, as shown in fig. 7, the first door unit has an internal structure including:

the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; the input ends of the other gate units except the first gate unit are used for inputting the characteristic diagram output by the previous gate unit;

the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;

the first branch circuit comprises: a second channel-wise Convolution layer (Depth-wise Convolution), a second Point-wise Convolution layer (Point-wise Convolution), an exponential linear Unit (ELU, exponentialLinear Unit), and a first multiplier connected in sequence; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;

the second branch includes: the input end of the sigmoid activation function layer is connected with the input end of the first gate unit;

the third branch includes: the input end of the second multiplier is connected with the input end of the first gate unit;

the input end of the first multiplier is connected with the output end of the sigmoid activation function layer; the output end of the first multiplier is connected with the input end of the first adder; the output layer of the sigmoid activation function layer outputs a weight value a;

processing the weight value to obtain 1-a; inputting 1-a to the input of the second multiplier;

the output end of the second multiplier is connected with the input end of the first adder;

the output end of the first adder is connected with the input end of the second adder;

the input end of the second adder is also connected with the input end of the first gate unit;

the output of the second adder serves as the output of the first gate unit.

Further, the first gate unit includes:

the first branch is used for carrying out nonlinear conversion on input, and conversion characteristics are obtained; the second channel-by-channel convolution layer, the second point-by-point convolution layer, and the exponential linear unit can be regarded as a combination to jointly form a conversion

Conversion characteristics can be obtained>

Taking it as input of the first multiplier; wherein the conversion feature->

Refers to the input feature->

Nonlinear conversion via the first branch +.>

The resulting features thereafter; "original feature" refers to the input feature +.>

。

A third branch for reserving input

I.e. the original characteristic, is taken as input to the second multiplier.

A second branch defining two gating signals added to one for modeling the input

To be converted with->

The relation between the two is explored and the matching degree is explored; learning by means of an activation function layer results in a weight a, switching gate->

The weight value of the conversion feature obtained by the first branch is used as the weight value of the conversion feature, which represents the importance of the conversion feature to the output feature and controls the degree to which the converted feature information is carried into the output feature information; then, a and 1 are differenced to obtain a weight of 1-a, the retention gate +.>

The weight value of the original feature obtained by taking the weight value as the third branch represents the importance of the original feature to the output feature, and controls the degree to which the input feature information is carried into the output feature information. The larger the weight value, the larger the influence of the characteristic on the output result of the gate unit.

The output of each gate unit is given an "output characteristic

And "pass it to the next gate unit as input to the next gate unit.

Further, the first multiplier and the second multiplier are both used for multiplying the elements at the positions corresponding to the two matrixes, and the first multiplier multiplies the conversion characteristics of the first branch with the conversion gate of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch with a retention gate of the second branch to realize weighting;

further, the first adder and the second adder are both used for adding elements at positions corresponding to the two matrixes, and the first adder sums the conversion characteristics of the weighted first branch and the original characteristics of the weighted third branch to obtain fusion characteristics; the second adder can be seen as a kind of residual connection, adding the original input to the output result as well.

It should be appreciated that the encoding module is used to extract the initial visual features of the picture. The coding module is based on a complete convolution network and consists of depth convolution, point level convolution and interlayer residual connection. Wherein the training process is regularized using batch and layer normalization and Dropout using ELU as an activation function.

Generally, the more network parameters are, the better the performance is, but the network parameters cause large calculation amount, high memory requirement, often unsatisfactory time performance, high training cost and difficult efficient deployment to practical application scenes. Therefore, this embodiment uses a lightweight small-scale separable convolution instead of the conventional regular convolution, and makes a compromise between recognition accuracy and recognition delay, thereby alleviating the above problems.

Depth separable convolution (seplablecon volume) is a decomposable convolution that can be decomposed into two atomic operations: channel-by-channel convolution and point-by-point convolution.

Channel-by-channel convolution (Depth-wisecontense), which is a convolution performed in the Depth dimension (Depth-wise). Unlike conventional convolution, which acts on all input channels, the depth convolution uses different convolution kernels for each input channel, which are not shared between channels.

Point-wise convolution (Point-wise convolution), similar to conventional convolution, differs in that the convolution kernel has a height and width of 1x1 and the number of channels is equal to the number of channels of the input feature map.

In this way, the depth separable convolution firstly adopts the convolution operation of the channel by channel to respectively carry out the convolution operation of the horizontal direction on different input channels, and then adopts the convolution operation of the point by point to combine the convolution results in the vertical direction, so that the obtained overall effect is almost the same as that of a traditional convolution, but the calculated amount and the model parameter quantity can be greatly reduced.

It should be appreciated that the stacked gate module uses resolvable convolution instead of conventional convolution and residual connection, which can reduce the amount of parameter calculation and speed up convergence, and can reduce the memory requirement. Meanwhile, strong characterization is extracted by using multiple feature transfer functions, the network is regularized by using two modes of batch normalization and hierarchical normalization, the network is regularized by using Dropout at the end of the whole coding module and the end of the whole stacking gate module, fitting is effectively relieved, and training and convergence of the model are accelerated.

The entire model of the present embodiment uses a variety of feature transfer functions. Specifically, the encoding module uses an ELU activation function; the stacking gate module uses ELU and Sigmoid activation functions; the decoding module uses a Softmax activation function.

The stacked gate module is the main calculation block of the model of the embodiment, and sends the visual features extracted by the encoder into a stacked structure, and a series of gate units are used for performing depth feature processing to extract strong feature representations so as to complete the modeling task of an input sequence. The iterative estimation method developed on the target sequence can be regarded as a filter for sensing the characteristic information.

For a feed-forward neural network, each layer thereof can be considered as an input-pair

A transition is made>

Finally, output->

Thus, the network can be understood as modeling input +.>

To be converted with->

The relationship between the two. To explore this degree of matching, the present embodiment devised a novel gate unit as the basic computation block of the stacked gate module.

Unlike GRU, which establishes a time sequence relationship between the time of the history step and the current step to learn the importance of the history feature and the current feature to the output feature, the present embodiment establishes a time sequence relationship between the layers after the original and the conversion to learn the importance of the original feature and the conversion feature to the output feature. The original features are then connected to the transformed features using two gates added together to obtain the final fused feature, which is passed on to the next gate unit. This series of repeated operations can be seen as a deep extraction of features with different responses to different inputs.

For this purpose, the present embodiment defines two gating signals, one being a retention gate and one being a switching gate. The retention gate is used for controlling the degree to which the input characteristic information is carried into the output characteristic information, the conversion gate is used for controlling the degree to which the characteristic information after conversion is carried into the output characteristic information, and the larger the gate value is, the more the carrying-in is, the larger the influence on the output is.

The operation formula of the structure of each gate unit can be expressed as:

（2.1）

wherein, the liquid crystal display device comprises a liquid crystal display device,

is each doorOutput of unit->

Is its original input, < >>

And->

Representing a hold gate and a switch gate, respectively, +.>

To adaptively learn the weights between the input features and the conversion features. The present embodiment uses a combination of depth convolution, point level convolution, ELU activation functions as the nonlinear conversion of the gate unit +.>

。

As a neural network is composed of a plurality of neurons, a stacked gate module is composed of a plurality of gate units. The door mechanism provided by the embodiment is beneficial to controlling information flow transmission among layers, so that the model automatically learns the characteristic importance of time sequence before and after conversion, and can filter unimportant characteristic signals and strengthen important characteristic signals at the same time so as to excavate the optimal matching relation among the important characteristic signals.

It will be appreciated that the exponential linear unit ELU, as an activation function, has the advantage that: on the one hand, ELU is zero-centered, the average activation value of the network unit can be pushed to a value closer to zero, interlayer accumulated deviation caused by deepening of layers is reduced, gradient propagation is more stable, convergence speed is accelerated, a function similar to batch normalization is exerted, and computational complexity is lower; on the other hand, for smaller inputs, the ELU will saturate to a negative value with smaller parameters, the soft saturation characteristic reduces the influence of the inactivating unit on the change of characteristic information in forward propagation, weakens the correlation of the inactivating unit to the next layer network, strengthens the characteristic importance of the activating unit, and the dependency relationship among the units is easier to model and interpret, so that the model has stronger robustness to noise, and the network is allowed to learn a more stable characterization.

The function and derivative formula of ELUs are as follows:

（1.1）

（1.2）

wherein the super parameter

The saturation value of the ELU at negative inputs, i.e. when the negative part of the ELU is saturated, is controlled. It can be seen that when the input is in a positive state, the ELU mitigates the gradient vanishing by a constant positive equation; when the input is in a negative state, ELU saturates to a negative value, reducing the variation of the inactivating unit and the information propagated to the next layer, limiting it to a small area fluctuation, thereby improving the robustness against noise.

It should be appreciated that the decoder module performs a subsequent decoding process on the visual features obtained by the stacking gate module, using CTCs as the decoder for that module, and outputting the final text sequence.

Further, the training process of the trained handwriting English line recognition model comprises the following steps:

s103-1: constructing a training set, wherein the training set is a handwriting English image with known handwriting English line recognition results;

s103-2: and inputting the training set into the handwriting English line recognition model, training the model, and stopping training when the total loss function value of the model is not reduced any more, so as to obtain the trained handwriting English line recognition model.

Further, the total loss function of the model is a weighted sum of the loss function of the decoder and the statistical loss function.

Further, the total loss function of the model is:

(4.1)

the total loss function consists of two parts, wherein,

indicating CTC loss of decoder, +.>

Loss, denoted "statistical loss function",>

、/>

respectively, the corresponding weights.

The specific formula is as follows: />

（3.1）

（3.2）

（3.3）

is indicated at the time step->

With characters->

Based on the prediction result of each frame +.>

And a dictionary set by integrating the probability at each time step within the pathMultiplying to obtain the whole path +.>

Probability of->

。

Indicating pass->

The transformation is followed by the sequence->

Is>

Because of the presence of multiple paths->

The same sequence can be obtained, so that the probability of the final sequence is equal to all paths +.>

To obtain the final sequence +.>

Is the total probability of (2)

. The final objective function takes the conditional probability of tag sequence probability +.>

Is a negative log-likelihood of (c).

Statistical loss function "

The calculation process of (2) is as follows:

（3.4）

（3.5）

（3.6）

（3.7）

and->

Respectively represent tag sequences->

And predicted sequence->

Statistical probability distribution of all character classes in the list; />

Representing an input image +.>

Representing training set->

Representing a character dictionary; />

Indicate->

Seed character category->

In tag sequence->

The secondary occurrence of (3)A number; />

Indicate->

Seed character category->

In the predicted sequence->

By aggregating the predicted probabilities of each character along the time dimension, accumulating all time steps +.>

Prediction probability of +.>

And then the product is obtained.

CTC loss function, calculating probability distribution of all possible output sequences by considering all possible alignment paths of each input frame in the prediction process, then greedily finding the output sequence with the highest probability for the input sequence therefrom, finally using a function

To remove redundant characters and spaces, thereby routing +.>

Mapping to the final sequence->

。

Training, namely training the whole model end to end by minimizing CTC loss between text labels corresponding to the original pictures; and in the reasoning stage, the final visual features obtained by the visual feature extraction module are transcribed and decoded, and the output text sequence is identified.

"statistical-based loss function", the dataset of this example has the following problems: the text line has a plurality of special characters and spaces, and the number of spaces marked by people is not consistent with the number of real spaces at the image pixel level, so that the division among different words is not clear, the recognition result can be caused to have the phenomenon that the word interval of a label is different from the predicted word interval, the number of characters of the label and the predicted result is not corresponding, and the sequence lengths of the label and the predicted result are not consistent. Therefore, the embodiment provides statistical information based on the number of characters as additional supervision information for the model, and allows the network to learn the number of all characters in the text line, so that the generation of a predicted sequence can be better constrained by accurately predicting the number of characters of each class in the label, and more supervision information is helpful for correct recognition.

Furthermore, since the alignment between the tag characters and the model predictions may be unclear, the way CTCs directly estimate the conditional probabilities accurately is challenging, and using CTCs alone as loss supervision may introduce errors. Therefore, this embodiment additionally adds a supervision mode to assist CTCs to co-supervise the model and jointly optimize the objective function, facilitating correct recognition.

For the two points, the embodiment provides a novel statistical loss function. Unlike CTC for probability prediction methods, which have the network predict a probability distribution for each category for each time step, the "statistical loss function" counts the number of occurrences of each character category by considering the number of each character category, and learns the cumulative probability of each character category over all time steps, so that it is essentially the network that has to predict the number of each category, i.e., without considering the endian information in the tag sequence. For example, in the word "hello" the character "l" appears twice, then its cumulative prediction probability over all time steps should be exactly 2, and the corresponding two predictors should both be close to 1.

Thus, considering that the statistical prediction result and the label form two new distributions, the present embodiment uses cross entropy as the calculation method in order to measure the "distance" between the two new probability distributions.

It can be seen that the "statistical loss function" provided in this embodiment only involves simple operations, no additional parameters, and the calculated amount and the memory consumption are negligible, and can be implemented only by making little modification to the original model.

Further, the specific process of constructing the training set comprises the following steps:

s103-11: performing size normalization processing on the handwritten English line image;

s103-12: and processing the label of the handwritten English line image.

Further, the processing of the label of the handwritten English line image specifically comprises the following steps:

s103-121: constructing a character dictionary; the character dictionary includes: one-to-one correspondence between characters and indexes; the character comprises: uppercase letters, lowercase letters, numbers, punctuation marks;

s103-122: constructing a statistical dictionary; the statistical dictionary comprises: the number of all characters in the text label and the number of each type of characters; the statistical dictionary comprises: one-to-one correspondence between characters and numbers;

s103-123: mapping the text labels according to the character dictionary, and establishing a corresponding relation between the characters and the indexes;

s103-124: and (3) after S103-121-S103-123, obtaining labels corresponding to all the handwritten English line images.

It should be appreciated that a character dictionary is constructed: the embodiment adopts a novel dictionary creation mode, and all the characters appearing in the data set are counted and repeated characters are filtered, so that the model can identify English characters containing case letters, numbers and punctuation marks, can also identify Chinese characters and special symbols under a Chinese input method, and has universality;

it should be appreciated that a statistical dictionary is built: in the embodiment, the number of all characters in the text label and the number of each type of characters are counted, and a statistical corresponding relation is established between the characters and the number;

it should be appreciated that mapping text labels according to a character dictionary establishes correspondence between "character-indices".

Example two

The embodiment provides a handwriting English line recognition system based on deep learning;

a deep learning based handwriting english line recognition system comprising:

wherein, the trained handwriting English line recognition model comprises:

It should be noted that the above-mentioned obtaining module, preprocessing module and identifying module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The handwriting English line recognition method based on deep learning is characterized by comprising the following steps of:

acquiring a handwritten English image to be identified;

preprocessing a handwritten English image to be recognized;

wherein, the trained handwriting English line recognition model comprises: extracting features of the preprocessed image to obtain preliminary visual features; extracting depth features from the preliminary visual features to obtain depth visual features; decoding the depth visual characteristics to obtain a recognition result of the handwritten English line;

wherein, the handwriting English line recognition model after training has a network structure comprising:

the encoding module comprises: the first depth separable convolution network, the first layer standardized module and the connector are connected in sequence; the input end of the first depth separable convolution network is connected with the input end of the connector in a residual way; the input end of the first depth separable convolution network is used as the input end of the coding module; the output end of the connector is used as the output end of the coding module; the coding module is used for extracting the characteristics of the preprocessed image to obtain preliminary visual characteristics;

the stack door module includes: the input end of the first gate unit is connected with the output end of the connector, and the output end of the last gate unit is connected with the input end of the second-layer standardization module; the input end of the first gate unit is used as the input end of the stacking gate module, and the output end of the second layer of standardized module is used as the output end of the stacking gate module; the stacking door module is used for extracting depth features of the preliminary visual features to obtain the depth visual features;

the decoding module comprises: the second depth separable convolution network, the index linear unit, the third layer standardization module and the decoder are connected in sequence; the input end of the second depth separable convolution network is used as the input end of the decoding module, the input end of the second depth separable convolution network is connected with the output end of the second layer standardization module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual characteristics to obtain the recognition result of the handwritten English line.

2. The deep learning-based handwritten english line recognition method of claim 1, wherein the first gate unit has an internal structure comprising:

the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; the input ends of the other gate units except the first gate unit are used for inputting the characteristic diagram output by the previous gate unit; the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;

the first branch circuit comprises: the second channel-by-channel convolution layer, the second point-by-point convolution layer, the exponential linear unit and the first multiplier are sequentially connected; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;

the second branch includes: the input end of the activation function layer is connected with the input end of the first gate unit;

the input end of the first multiplier is connected with the output end of the activation function layer; the output end of the first multiplier is connected with the input end of the first adder; the output layer of the activation function layer outputs a weight value a; processing the weight value to obtain 1-a; inputting 1-a to the input of the second multiplier; the output end of the second multiplier is connected with the input end of the first adder; the output end of the first adder is connected with the input end of the second adder; the input end of the second adder is also connected with the input end of the first gate unit; the output of the second adder serves as the output of the first gate unit.

3. The deep learning-based handwritten English line recognition method of claim 2, wherein the first branch is used for performing nonlinear conversion on input to obtain conversion characteristics; the second channel-by-channel convolution layer, the second point-by-point convolution layer and the exponential linear unit can be regarded as a combination, and the combination forms conversion together to obtain conversion characteristics which are used as the input of the first multiplier; the conversion feature refers to a feature obtained after nonlinear conversion of the input feature through the first branch.

4. The deep learning based handwritten English line recognition method of claim 2, wherein a second branch defines two gating signals added to one for modeling input

To be converted with->

The relation between the two is explored and the matching degree is explored; the weight a is learned through an activation function layer and used as a weight value of the conversion feature obtained by the first branch, the weight value represents the importance of the conversion feature to the output feature, and the degree that the feature information after conversion is carried into the output feature information is controlled; then, a and 1 are subjected to difference to obtain a weight 1-a, the weight is taken as a weight value of an original feature obtained by a third branch, the importance of the original feature to the output feature is represented, the degree that input feature information is carried into the output feature information is controlled, the weight a is obtained through learning by an activation function layer, and the process is regarded as a conversion gate; the difference between a and 1 yields a weight of 1-a, which is considered as a hold gate.

5. The method for recognizing handwritten English line based on deep learning according to claim 4, wherein the first multiplier and the second multiplier are both used for multiplying elements at corresponding positions of two matrixes, and the first multiplier multiplies the conversion characteristics of the first branch by the conversion gate of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch with a retention gate of the second branch to realize weighting;

the first adder and the second adder are both used for adding elements at positions corresponding to the two matrixes, and the first adder sums the conversion characteristics of the weighted first branch and the original characteristics of the weighted third branch to obtain fusion characteristics; the second adder is seen as a residual connection, adding the original input to the output result as well.

6. The deep learning-based handwriting english line recognition method of claim 1, wherein the training process of the trained handwriting english line recognition model comprises:

constructing a training set, wherein the training set is a handwriting English image with known handwriting English line recognition results;

inputting the training set into a handwriting English line recognition model, training the model, and stopping training when the total loss function value of the model is not reduced any more, so as to obtain a trained handwriting English line recognition model;

the total loss function of the model is a weighted sum of the loss function of the decoder and the statistical loss function.

7. The handwriting English line recognition system based on deep learning is characterized by comprising: