CN114463760B

CN114463760B - Character image writing track recovery method based on double-stream coding

Info

Publication number: CN114463760B
Application number: CN202210363354.7A
Authority: CN
Inventors: 黄双萍; 陈洲楠; 杨代辉; 梁景麟; 彭政华
Original assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Current assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-06-28
Anticipated expiration: 2042-04-08
Also published as: CN114463760A

Abstract

The invention discloses a method for recovering writing tracks of character images based on double-stream coding, which comprises the following steps: adjusting the character image to a preset size and carrying out binarization processing; constructing a double-stream coding network, wherein the input of the double-stream coding network is character images, and the output of the double-stream coding network is the character image of double-stream fusion coding characteristics

(ii) a Constructing a decoding network, wherein the input of the decoding network is a double-stream fusion coding characteristic

Outputting a predicted character writing track sequence; jointly training a double-flow coding network and a decoding network to obtain a character image writing track recovery network model; and recovering the writing track by using the trained character and image writing track recovery network model. In the encoding process, the method respectively extracts the characters in the vertical direction and the horizontal directionThe character is down-sampled, the parameter quantity is reduced, necessary character font information is kept, subsequent decoding is helped to accurately reflect character fonts, and the recovery performance of writing tracks of character images is effectively improved.

Description

Character image writing track recovery method based on double-stream coding

Technical Field

The invention relates to the field of character and image pattern recognition, in particular to a character and image writing track recovery method based on double-stream coding.

Background

The text data can be roughly divided into two types, namely image modal data and writing track modal data according to modal types, and the text generation technology mainly expands around the two modal types. The character image is usually obtained by an image acquisition device such as a scanner or a camera and is stored in the form of a dot matrix image, and the data can intuitively display the shape of the character and is commonly used for displaying and reading the character. The writing trace of the character is acquired by interactive equipment such as a digital pen, a handwriting pad or a touch screen and the like which can record the trace, is usually stored in a mode of a pen point coordinate point trace sequence, and can possibly record auxiliary information such as pen point pressure, speed and the like in the writing process. The writing track recovery of the character image is a cross-mode character generation technology, aims to recover and obtain writing motion track information from the character image without track information, is often used as an important technical means for character recognition and data augmentation, and has great application potential in the fields of judicial handwriting identification, writing robots, font generation, character special effect generation and the like.

The challenge of writing trajectory recovery algorithms comes first from the complexity of the glyph structure. Taking Chinese characters as an example, the number of Chinese characters stored in the national standard GB18030 is as many as 7 ten thousand, and there are no characters with complex structures or characters with easy confusion among classes, and a slight error in recovering the model may result in fuzzy characters, disorder of classes, or meaningless characters. The recovery algorithm not only needs to overcome the complexity of the character pattern structure, but also needs to learn the position distribution of the pen point on the space and the sequence (stroke order of Chinese characters) between different stroke points. Therefore, in general, the task of generating a writing trajectory of characters is more difficult than the task of generating an ordinary character image. In addition, since the writing track recovery task spans the image and track sequence modalities of the text, the characteristics of the two modalities and the complex mapping relationship between the two modalities need to be considered comprehensively, which makes the design of the track recovery algorithm have great challenges.

Disclosure of Invention

In view of this, the present invention aims to provide a method for recovering a writing track of a text image based on a dual-stream coding, so as to solve the problems of poor characteristic characterization capability, weak generalization performance and low writing track recovery accuracy existing in the text image writing track recovery in the prior art.

The invention discloses a method for recovering writing tracks of character images based on double-stream coding, which comprises the following steps:

step 1, adjusting a character image to a preset size and carrying out binarization processing;

step 2, constructing a double-stream coding network, wherein the double-stream coding network inputs character images and outputs character images with double-stream fusion coding characteristics

；

Step 3, constructing a decoding network, wherein the input of the decoding network is the double-current fusion coding characteristic

Outputting a predicted character writing track sequence;

step 4, training a double-flow coding network and a decoding network in a combined manner to obtain a character image writing track recovery network model;

step 5, writing track recovery is carried out by utilizing the trained character image writing track recovery network model;

specifically, the double-current coding network comprises a vertical convolution cyclic neural network, a horizontal convolution cyclic neural network and an attention module;

the parallel-connected vertical convolution cyclic neural network and horizontal convolution cyclic neural network both comprise CNN encoders and BilSTM encoders, the CNN encoders in the vertical convolution cyclic neural network perform vertical down-sampling by using vertical down-sampling operation, and then encode input text images by matching with convolution operation to obtain one-dimensional direction characteristics of texts in the horizontal direction

One-dimensional directional characteristics

Obtaining a characteristic sequence taking the direction as a time sequence after splitting in the direction dimension, and coding the characteristic sequence of the time sequence by a BilSTM coder in the vertical convolution cyclic neural network to obtain double-current coding characteristics

(ii) a The CNN encoder in the horizontal convolution cyclic neural network performs down-sampling in the horizontal direction by using down-sampling operation in the horizontal direction, and then encodes an input character image by matching with convolution operation to obtain one-dimensional direction characteristics of characters in the vertical direction

One-dimensional directional characteristics

Obtaining a characteristic sequence taking the direction as a time sequence after splitting in the direction dimension, and coding the characteristic sequence of the time sequence by a BilSTM coder in the horizontal convolution cyclic neural network to obtain double-current coding characteristics

；

Encoding features for dual streams in attention modules

And

performing fusion to obtain the double-current fusion coding characteristics

：

，

Wherein the content of the first and second substances,

，

by incorporating features

And

to obtain

，

Is composed of

The component (b) of (a) is,

is composed of

The length of (a) of (b),

is a learnable parameter of a fully connected layer.

Optionally, the downsampling operation is asymmetric pooling operation, asymmetric convolution operation or full connection layer network operation downsampling;

optionally, the decoding network is an LSTM decoder, and the LSTM decoder uses dual-stream fusion coding features

Sequentially predicting track points for input; LSTM decoder based on

Predicted value of time

And hidden layer vector

Prediction of

Track point information of time

，

Wherein, in the step (A),

and

to represent

The position coordinates of the time of day,

to represent

The meaning of the state of the pen point at any moment and 3 states is: "the pen point is contacting with the paper surface", "the current stroke is finished, the temporary pen is lifted" and "all strokes are finished", finally,

a sequence of trajectories is written for the predicted text.

Specifically, in the process of jointly training the dual-stream coding network and the decoding network, the coding and decoding network loss function is as follows:

to balance the predetermined constants of the respective loss weights,

for the L2 loss, the formula is calculated as:

wherein the content of the first and second substances,

and

respectively the X-coordinate and Y-coordinate predictors of the position of the decoding network,

and

label values of an X coordinate and a Y coordinate of the position are respectively, and N is the number of the track points;

for cross entropy loss, the calculation formula is:

wherein the content of the first and second substances,

for the decoding network to pen point state

The probability of (a) is predicted,

a label value in a pen point state;

for dynamic time warping loss, an optimal alignment path between the prediction and the label track sequence is found by using a dynamic time warping algorithm, and the sequence distance under the optimal alignment path is calculated as the global loss of the prediction sequence:

Given a sequence of predicted trajectories

And a sequence of tag tracks

The sequence length is respectively

And

setting the Euclidean distance function

For characterizing points of track

And

define an alignment path

Wherein, in the step (A),

，

for the length of the alignment path, each item of the alignment path defines

And

the corresponding relation of (1):

wherein the content of the first and second substances,

to represent

To (1) a

The number of the track points is one,

to represent

To (1) a

And (3) searching an alignment path which enables the sequence distance to be minimum by using a Dynamic Time Warping (DTW) algorithm as an optimal alignment path, wherein the corresponding sequence distance is used as the global loss of the predicted sequence:

preferably, the hidden layer state of a BilSTM encoder in a dual stream coding network is used as the hidden layer initial state of an LSTM decoder

。

Preferably, the first and second electrodes are formed of a metal,

the value of the carbon dioxide is 0.5,

the value of the carbon dioxide is 1.0,

taking the value of 1/6000.

Compared with the prior art, the method provided by the invention has the advantages that the characteristics of the characters in the vertical and horizontal directions are respectively extracted in the encoding process, the characteristics are down-sampled, the parameter quantity is reduced, the necessary character font information is retained, the subsequent decoding is assisted to accurately reflect the font of the characters, and finally the recovery performance of the writing track of the character image can be effectively improved.

Drawings

FIG. 1 shows a schematic flow diagram of a method embodying the present invention;

FIG. 2 shows a schematic structural diagram of a dual-stream coding network in an embodiment of the present invention;

fig. 3 shows a schematic structural diagram of a decoding network in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

For reference and clarity, the technical terms, abbreviations or acronyms used hereinafter are to be construed in summary as follows:

CNN: a Convolutional Neural Network;

RNN: a current Neural Network, a Recurrent Neural Network;

CRNN: a Convolutional Recurrent Neural Network;

BilSTM: bi-directional Long Short-Term Memory, bidirectional Long-and-Short-Term Memory model;

DTW: dynamic Time Warping, Dynamic Time Warping.

Fig. 1 shows a schematic flow diagram of an embodiment of the invention. A method for restoring writing tracks of character images based on double-stream coding comprises the following steps:

；

Outputting a predicted character writing track sequence;

and 5, restoring the writing track by using the trained character and image writing track restoring network model.

The specific operation steps of this embodiment are as follows:

(1) preprocessing the input text image by resizing while maintaining aspect ratio

(ii) a And carrying out binarization processing.

(2) And constructing a double-stream coding network.

1) As shown in FIG. 2, two Convolutional Recurrent Neural Network (CRNN) branches are constructed

And

. They contain a CNN encoder and a BiLSTM encoder, respectively. Two CNN encoders respectively utilize asymmetric pooling operation in the vertical or horizontal direction to perform down-sampling in the vertical or horizontal direction, and encode input text images in cooperation with convolution operation to obtain one-dimensional direction characteristics of the text in the vertical or horizontal direction

And

. One-dimensional directional characteristics

And

obtaining a characteristic sequence taking the direction as a time sequence after splitting in the direction dimension, and coding the time sequence characteristic sequence by a BiLSTM coder to obtain a double-stream coding characteristic

And

。

2) attention-based mechanism for two features

And

performing fusion to obtain the double-current fusion coding characteristics

：

，

，

，

Wherein features are combined

And

to obtain

，

And

is composed of

To (1) aiA component and ajThe number of the components is such that,

to represent

The attention weight of (a) is given,

to represent

The attention weight of (a) is given,

a function representing a fully connected layer,

is composed of

The length of (a) of (b),

is a learnable parameter of a fully connected layer.

(3) And constructing a decoding network, performing characteristic decoding, and outputting a predicted character writing track sequence.

1) Constructing an LSTM decoder encoding features with dual stream fusion

For input, the trace points are predicted in turn. As shown in fig. 3, the LSTM decoder is based on

Predicted value of time

And hidden layer vector

Prediction of

Track point information of time

. In the end, the flow rate of the gas is controlled,

a sequence of trajectories is written for the predicted text. Using hidden layer states of a BilSt encoder in a dual stream coding network as hidden layer initial states of an LSTM decoder

。

2) For

Track point information of time, setting

. Wherein the content of the first and second substances,

and

the position coordinates representing the time of day are,

indicating the state of the pen point at that time to form a thermal code

，

And

respectively representing 3 states during writing: "the nib is contacting the paper surface", "the current stroke is done, the temporary stroke is lifted" and "all strokes are done". In particular, the initial input trace point is set to

。

(4) And constructing a loss function of the encoding and decoding network, and training a model formed by the double-flow encoding network and the decoding network end to end (end-to-end). The set codec network loss functions include L2 loss, cross entropy loss and dynamic time warping loss.

Loss of L2:

wherein the content of the first and second substances,

and

in order to be a predictive value for the network,

and

is the label value and N is the number of trace points.

Cross entropy Loss (CrossEntropy Loss):

wherein the content of the first and second substances,

in order to be a predictive value for the network,

is the tag value.

Dynamic time warping loss (dynamictimewarping loss): and searching an optimal alignment path between the prediction and the label track sequence by using a dynamic time warping algorithm, and calculating a sequence distance under the optimal alignment path to be used as the global loss of the prediction sequence, thereby realizing the global optimization of the track sequence.

Given a sequence of predicted trajectories

And a sequence of tag tracks

The sequence length is respectively

And

setting the Euclidean distance function

For characterizing points of track

And

define an alignment path

(wherein

For the length of the alignment path), each item of the alignment path defines

And

the corresponding relation of (1):

using a Dynamic Time Warping (DTW) algorithm to find an alignment path which enables the sequence distance to be minimum, wherein the alignment path is used as an optimal alignment path, and the corresponding sequence distance is used as the global loss of a predicted sequence:

coding and decoding network loss function:

are constants that balance the respective loss weights. In practice, we set up separately

0.5,1.0 and 1/6000.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims

1. A method for restoring writing tracks of character images based on double-stream coding is characterized by comprising the following steps:

；

Outputting a predicted character writing track sequence;

the double-current coding network comprises a vertical convolution cyclic neural network, a horizontal convolution cyclic neural network and an attention module;

One-dimensional directional characteristics

(ii) a The CNN encoder in the horizontal convolution cyclic neural network performs down-sampling in the horizontal direction by using down-sampling operation in the horizontal direction, and then encodes an input text image by matching with convolution operation to obtain one-dimensional direction characteristics of the text in the vertical direction

One-dimensional directional characteristics

Splitting in direction dimension to obtain a characteristic sequence taking direction as time sequence, and coding the characteristic sequence of the time sequence by a BilSTM coder in a horizontal convolution cyclic neural network to obtain double-current coding characteristics

；

Encoding features for dual streams in attention module

And

performing fusion to obtain the double-current fusion coding characteristics

：

，

Wherein the content of the first and second substances,

，

by incorporating features

And

to obtain

，

And

is composed of

The ith and jth components of (a),

to represent

The attention weight of (a) is given,

to represent

The attention weight of (a) is given,

a function representing a fully connected layer,

is composed of

The length of (a) of (b),

is a learnable parameter of a fully connected layer.

2. The method for recovering the writing track of the character image based on the double-stream coding as claimed in claim 1, wherein the down-sampling operation is an asymmetric pooling operation, an asymmetric convolution operation or a full-connection layer network operation.

3. The method for recovering writing tracks of character images based on dual-stream coding as claimed in claim 1, wherein the decoding network is an LSTM decoder, and the LSTM decoder is characterized by dual-stream fusion coding

Sequentially predicting track points for input; LSTM decoder based on

Predicted value of time

And hidden layer vector

Prediction of

Track point information of time

，

Wherein, in the step (A),

and

to represent

The position coordinates of the time of day,

to represent

a sequence of trajectories is written for the predicted text.

4. The method for recovering the writing locus of the character image based on the double-stream coding as claimed in claim 3, wherein in the process of jointly training the double-stream coding network and the decoding network, the loss function of the coding and decoding network is as follows:

to balance the predetermined constants of the respective loss weights,

for the L2 loss, the formula is calculated as:

wherein the content of the first and second substances,

and

and

For cross entropy loss, the calculation formula is:

wherein the content of the first and second substances,

for the decoding network to pen point state

The probability of (a) is predicted,

a tag value which is a pen point state;

given a sequence of predicted trajectories

And a sequence of tag tracks

The sequence length is respectively

And

setting the Euclidean distance function

For characterizing points of track

And

define an alignment path

Wherein, in the step (A),

，

for the length of the alignment path, each item of the alignment path defines

And

the corresponding relation of (1):

wherein the content of the first and second substances,

to represent

To (1) a

The number of the track points is one,

to represent

To (1) a

。

5. the method for recovering writing trace of character and image based on dual-stream coding as claimed in claim 1, wherein the hidden layer state of BilSTM encoder in dual-stream coding network is used as the initial state of hidden layer of LSTM decoder

。

6. The method for recovering writing trace of character images based on dual-stream coding as claimed in claim 4,

the value of the carbon dioxide is 0.5,

the value of the carbon dioxide is 1.0,

taking the value of 1/6000.