CN114529908A

CN114529908A - Offline handwritten chemical reaction type image recognition technology

Info

Publication number: CN114529908A
Application number: CN202111629716.4A
Authority: CN
Inventors: 顾志文; 许磊磊; 徐华建; 袁顺杰; 施炎; 崔文冰; 汤敏伟; 李�真
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-05-24

Abstract

The invention discloses an off-line handwritten chemical reaction type image recognition technology, provides an effective recognition means aiming at the problem of complex off-line handwritten chemical reaction type recognition, and has higher accuracy and better robustness. The invention can directly identify the complicated handwritten chemical formula, while the traditional network structure needs to divide and identify characters in advance; meanwhile, the method has higher recognition accuracy rate on the problems of adhesion, deformation, various fonts and the like which often occur in the handwritten text, and has good recognition capability on the space corner marks and special symbols in the chemical reaction formula; by applying an attention mechanism and distributing different weights to the image features, semantic information of the network structure in the current identification process is enhanced, and the identification accuracy is improved.

Description

Offline handwritten chemical reaction type image recognition technology

Technical Field

The invention relates to the technical field of electronic information, in particular to an offline handwritten chemical reaction type image recognition technology.

Background

With the popularization of electronic informatization in the field of education, chemical research and online education, the demand for identification of off-line handwritten chemical reaction formulae has become more vigorous. In the related field of handwritten complex chemical reaction type identification, currently, a connection time classification technology is mainly adopted, and comprises a CNN (convolutional neural network) + RNN (cyclic neural network) + CTC algorithm, wherein the convolutional neural network is responsible for extracting characteristic information of an image to form a characteristic matrix; the cyclic neural network is responsible for outputting characters and accords by using the features extracted by the convolutional neural network. The CTC algorithm is a loss function calculation method, and the problem that the sample and the label need to be strictly aligned in the training process can be solved by using the CTC algorithm to replace a Softmax loss function. But as neural networks under attention emerge, the problems of CTC networks are also emerging. Compared with an attention network, due to the fact that the chemical reaction formula has spatial structures such as corner marks and reaction conditions, the CTC network has the problems that the model identification accuracy is not high, the model is complex, the training time is long, the robustness is poor and the like, and the model cannot be successfully applied in the actual environment.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides an offline handwritten chemical reaction type image recognition technology, provides an effective recognition means aiming at the problem of complex offline handwritten chemical reaction type recognition, and has higher accuracy and better robustness.

The invention provides the following technical scheme:

the invention provides an off-line handwritten chemical reaction type image recognition technology, which comprises the following steps:

s1: performing unified preprocessing on the offline image to ensure that the offline image has a fixed size and a fixed channel after the processing is finished;

s2: using an image coding module to perform feature extraction on the image processed in the step S1 to generate an image feature matrix;

s3: a decoupling image attention generating module is used, an S2 image feature matrix is used as input, and corresponding attention features are generated through a convolutional neural network;

s4: combining the image feature matrix in S2 and the attention features in S3 to generate image feature vectors containing different attentions; corresponding to assigning corresponding importance to different image features;

s5: and (3) a decoupling characteristic decoding module is used, the module combines the image characteristics containing different attention in S4 and the output of the previous time step as input by combining a planning sampling technology on the basis of a recurrent neural network, each sequence bit character output is obtained through an RNN recurrent neural network and a full-link network, and finally the whole sequence output is obtained.

As a preferred embodiment of the present invention, step S1 includes:

s1.1: and scanning the original off-line image by using a convolutional neural network to obtain a characteristic matrix containing image information.

As a preferred embodiment of the present invention, step S2 includes:

s2.1: and (4) sending the feature matrix obtained in the step (S1.1) to an attention feature extraction module, and obtaining a corresponding attention matrix through convolution and inverse convolution operations.

As a preferred embodiment of the present invention, step S3 includes:

s3.1: performing inner product summation on the image feature matrix and the attention matrix obtained in the steps S1.1 and S2.1 respectively to obtain image feature intermediate vectors with different attentions;

s3.2: selecting one of the output of a time step or the real label value on the recurrent neural network for coding by using a plan sampling technology and a continuously attenuated probability value, and updating the hidden state vector in the recurrent neural network by taking the selected value and the intermediate vector in S3.1 as the input of the current time step;

s3.3: the hidden state vector in the S3.2 is subjected to full-connection neural network, the probability value of each character is output, and the maximum probability is selected as the current character to be output;

s3.4: and connecting all character outputs as a final chemical equation recognition result.

Compared with the prior art, the invention has the following beneficial effects:

compared with the prior art, the method can directly identify the complex handwritten chemical formula, and the traditional network structure needs to segment and identify the characters in advance. Meanwhile, the method has higher identification accuracy rate on the problems of adhesion, deformation, various fonts and the like which often occur in the handwritten text, and has good identification capability on the space corner marks and special symbols in the chemical reaction formula.

According to the invention, an attention mechanism is applied, and different weights are distributed to the image characteristics, so that semantic information of a network structure in the current identification process is enhanced, and the identification accuracy is improved. However, in the conventional attention framework, attention is generated based on similarity calculation between the current input features and the historical output features of the decoding unit, that is, once the historical output of the decoding unit is wrong, the weight calculation of the current features is also wrong, so that inevitable chain errors are accumulated;

in the network structure, a decoupling attention mechanism is adopted, and an attention characteristic extraction module is used instead by disconnecting the coupling relation between the output information of the previous time step and the attention characteristic calculation process, and only image characteristics are used as input to generate an attention matrix, so that even if the output error in the current RNN time step occurs, the calculation of the attention matrix cannot be influenced, and the next character prediction process cannot be influenced;

furthermore, the output of the preceding time step of the recurrent neural network is required to be used as the input of the next time step, and the recurrent neural network is circulated until all the text lines are recognized. This results in the occurrence of a current character recognition error and input to the next time step, which results in a full error output at the subsequent time step and error accumulation again. In the network, by adopting a plan sampling technology, in training, the output of the last time step is selected by the RNN module according to a certain probability as the input or the true value (namely the label) corresponding to the current time, so that the network can learn the error correction capability under the condition of the output error of the last time step in the training process. The training speed can be accelerated, and the training precision can also be improved;

finally, the invention combines a residual network structure, a recurrent neural network structure, an attention mechanism and a planning sampling technology, applies the combination to the identification process of a chemical reaction type sequence, utilizes a convolution module to extract the characteristics of the original image, does not need to carry out complex image processing and artificial characteristic extraction on a character area, and only needs to scale the image to a fixed size and carry out gray level normalization. The method has the advantages of small network parameters, short training time, high recognition accuracy and high stability compared with the prior CRNN + CTC network structure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a general block diagram of the network of the present invention;

FIG. 2 is a parameter diagram of an image feature extraction module of the present invention;

FIG. 3 is a block diagram of attention feature extraction of the present invention;

FIG. 4 is a block diagram of a text feature decoding module of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation. Wherein like reference numerals refer to like parts throughout.

Example 1

As shown in fig. 1 to 4, in the method for identifying an offline handwritten chemical reaction formula based on the combination of the decoupling attention and the plan sampling technology provided in the embodiment of the present invention, the identification of the chemical reaction formula is regarded as a text sequence identification problem, so that a CNN convolutional neural network is used to perform feature extraction on an image, an RNN recurrent neural network is used to perform extraction on context semantics in the text sequence, and different weight coefficients are assigned to image features in combination with an attention mechanism, so as to improve the identification accuracy. In specific implementation, in order to express the corner marks with spatial relationship existing in the chemical reaction formula and other special symbols expressing the conditions of the reaction formula, the invention adopts some other common symbols for expression, such as: ' H₂"in the presence of a subscript designation" H_{_2}"; reaction condition symbol

Expressed as "-"; sink object symbol "↓" is represented as "! "and the like.

Fig. 1 is a general flow chart of a network, as shown, comprising three modules: the system comprises an image feature extraction module, an attention feature extraction module and a text feature decoding module. Respectively corresponding to image feature extraction, feature attention weight assignment and feature-based sequence recognition. The method comprises the following specific steps:

s1 inputting the chemical reaction type picture in training set, converting the picture into single channel with size of (w, h, 1)

S2: the input picture is scaled to ensure that it is less than 2048 long or less than 192 wide. And filling the zoomed blank area with ground color to ensure that all the pictures are (192, 2048, 1) in size.

S3: dividing the picture into a training sample and a testing sample, and randomly splitting the training sample and the testing sample into two parts, wherein the two parts of samples are marked with corresponding label values.

S4: the training sample picture is input to a CNN convolution module for encoding, and the module structure is shown in fig. 2. Compared with the popular Resnet50, the current network has no obvious difference in recognition accuracy through experimental verification, but has the advantages of simple network structure, small parameters and high network convergence speed. The input and output of the network are fixed, the input is the single-channel image processed in the step S2, the size is (192, 2048, 1), the output is the image feature matrix extracted by the network, and the size is (3, 128, 512), wherein (3, 128) is the feature matrix size, and 512 is the number of channels.

S4: the image features extracted in S3 are sent to an attention feature extraction module, and the module structure is shown in fig. 3. Except that the last layer of the module adopts an inverse convolution layer on the Sigmoid activation function, other layers are of a symmetrical network structure. In the first half section of network, the feature extraction is carried out on the input features again through a positive convolution layer with a ReLU activation function, and in the second half section of network, the output of the previous layer and the output of the first half section of network with the same size are combined to be used as the current input to carry out the deconvolution operation. And finally, obtaining a final attention matrix by applying a deconvolution layer on the Sigmoid activation function. The notice matrix size is (3, 128, maxT), where maxT refers to the length of the text in the current input image label, i.e., the step size of the maximum time step in the RNN module.

S5: combining the image feature matrix obtained in the step S3 with the attention moment matrix obtained in the step S4, specifically combining the image feature matrix with the attention moment matrix in the following manner:

wherein, c_tThe middle "t" represents the time step in the RNN, and also represents the attention weight for recognizing the t-th character in the text. A. the_t，x，yDenotes the attention matrix, F, in S4_x，yThe image feature matrix obtained in S3 is represented.

S6: the image feature information with different weights obtained in the step S5 is sent to the RNN Recurrent neural network Unit, and specifically, the invention adopts a GRU (Gate recovery Unit) as a feature decoder, and compared with the conventional RNN structure, the GRU can effectively solve the phenomena of gradient extinction and gradient explosion existing in the RNN, thereby saving more context semantic information with longer intervals. Compared with an LSTM (Long-Short term memory), the GRU has the advantages of simple structure and high convergence rate.

Specifically, as shown in fig. 4, there are two pieces of information input to the GRU unit, one is that the output or true tag at a time step of the GRU unit is re-encoded, which is denoted as "e_t-1In the present invention, since the faced character categories are English characters and special symbols with fixed quantity, and the quantity is not large, the one-hot coding technique is adopted to code the characters output at the last time step. Another input is the image feature with different attention obtained in S5, denoted as "c_t”。

Specifically, in the present invention, a planned sampling technique is adopted to solve the problem that the output of the last time step of the GRU unit and the real tag are selected in the current step. In the conventional RNN recurrent neural network structure, since the output of the previous time step is used as the input of the current time step, once the output of the previous time step is wrong, the input of the subsequent time step unit is also wrong, resulting in all the subsequent units being wrong. In order to enhance the robustness of the network, the error correction capability of the current time step under the condition that the error is output at the last time step is learned in the training phase. In the invention, a plan sampling technology is adopted, in a training stage, a real label value is selected according to a continuously attenuated probability value epsilon, and the output of the previous time step is selected according to a 1-epsilon probability value to be coded and used as part input of the next time step. So as to enhance the robustness of the network and eliminate the phenomenon of partial error accumulation. In the invention, "epsilon" is a linear descent attenuation function selected as follows:

ε_i＝max(∈,k-ci)

where e is a value between 0 and 1, representing the minimum probability value for selecting a real tag, k represents the intercept, c represents the rate of descent of the function, and i represents the number of iterations of the model.

S7: inputting the GRU unit in S6 as a hidden state vector into a fully-connected neural network, and outputting the prediction probability of each character through a softmax function, wherein the output model prediction probability is as follows:

wherein p is_kAnd (3) representing the output probability of the current classification category k, and n represents the total number of categories. exp (x) indicates that the element in parentheses is indexed,

the sum of the index values representing the output scores of all classification categories over a fully connected network.

And after the model prediction softmax probability of the current input character is obtained, selecting the model prediction softmax probability with the maximum probability value as the optimal output of the current input character. And finally, calculating the accuracy of model prediction through a loss function, feeding back the result to the previous network layer through a back propagation algorithm, and updating the weight parameters of the network units.

Specifically, the invention uses a logarithmic loss function (log _ loss) in the network as a basis for measuring the distance between the real result and the predicted result. It is expressed mathematically as follows:

where θ represents all of the network trainable parameters, g_tRepresenting the true label, T representing the number of characters in the current text, and I representing the given current input feature. Since the log-loss function is differentiable, it can be converged using a gradient descent method. A smaller loss value indicates that the predicted sequence is closer to the true sequence. In the specific training process, the weight and bias of each neuron can be continuously adjusted by using an Adma gradient method, so that the loss is reducedThe loss function converges rapidly to a minimum.

S8: and after training of all training samples is finished, inputting a test set sample, calculating the average identification accuracy, repeating the steps S1-S7, continuously repeating training and test verification until the identification rate meets the requirement, and when the accuracy of the test sample is stable, storing the current model parameters and settings to complete model construction.

In this embodiment, as a preferred implementation, the linear decreasing attenuation function selected in step S6, in other embodiments, other attenuation functions may be adopted according to the actual application requirements.

Furthermore, the invention mainly uses a frame based on the combination of a decoupling attention mechanism and a plan sampling technology to carry out text recognition on an offline handwritten chemical formula, and the technical key points of the invention are as follows:

1. in the identification scheme of the offline handwritten chemical reaction formula, an attention mechanism is applied, so that the model not only considers the information of the original image, but also gives different weights to different image characteristics, namely, extra decoding information is introduced into the model, and the identification accuracy and stability of the model are improved. Meanwhile, in the face of the inherent error accumulation phenomenon in the attention mechanism network, a decoupling attention module and a plan sampling technology are added, the dependency relationship between the attention vector and historical decoding information is disconnected, the attenuation probability is adopted to obtain a real label or the output of the previous decoding unit as the input of the next decoding unit, and therefore the countermeasure capability of the network on the error accumulation phenomenon is enhanced. The core of the invention is different from other inventions in an off-line handwriting chemical reaction type recognition scheme, and the invention is also the foundation that the recognition effect of the invention is better than other inventions.

The invention relates to an offline end-to-end text recognition scheme, which can accurately recognize upper and lower corner marks with spatial information and reaction condition symbols in a chemical reaction formula, and is not possessed by other similar inventions.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An off-line handwritten chemical reaction type image recognition technology is characterized by comprising the following steps:

s2: performing feature extraction on the image processed by the S1 by using an image coding module to generate an image feature matrix;

2. The off-line handwritten chemical reaction image recognition technique of claim 1, wherein step S1 includes:

3. The off-line handwritten chemical reaction image recognition technique of claim 2, wherein step S2 includes:

4. The off-line handwritten chemical reaction image recognition technique of claim 3, wherein step S3 includes: