CN114782961A

CN114782961A - Character image augmentation method based on shape transformation

Info

Publication number: CN114782961A
Application number: CN202210285238.8A
Authority: CN
Inventors: 黄双萍; 黄鸿翔; 杨代辉
Original assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Current assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-07-22
Anticipated expiration: 2042-03-23
Also published as: CN114782961B

Abstract

The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps: constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator; taking an original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator; training the shape transformation to generate a confrontation network; augmented character images are generated using a trained generator. The method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape changes at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.

Description

Character image augmentation method based on shape transformation

Technical Field

The invention belongs to the technical field of image processing and artificial intelligence, and particularly relates to a character image augmentation method based on shape transformation.

Background

Text image recognition methods based on deep learning exhibit great potential, however, training a high-performance character image recognition model requires a large amount of and as diverse as possible annotation data. Since manually collecting and labeling text image data is an extremely expensive and time-consuming task, this is especially true for text images with complicated shapes such as ancient texts, handwritten formulas, and the like. In contrast, data augmentation is a cost effective way to increase data diversity.

The data augmentation modes include augmentation based on shape transformation such as flipping, rotating, scaling, cropping and translation of the image, and augmentation based on non-shape transformation such as color dithering and noise injection. The conventional shape transformation-based augmentation algorithm mainly randomly samples deformation parameters from an artificially preset probability distribution to control shape transformation of image content, for example, the most common way is to use an affine matrix (affine matrix) to generate affine transformation, such as an affine transformation-based Spatial Transformation Network (STN) algorithm, or generate local deformation by using other non-rigid deformation algorithms, such as Thin Plate Spline (TPS) transformation. However, the conventional augmentation algorithm relies on the artificially preset probability distribution for sampling, so that not only is the calculation and selection of the distribution complex, but also the selected distribution is difficult to completely fit the distribution of the actual character shape, which results in higher labor cost and poorer reality of deformation.

The traditional augmentation algorithm based on shape transformation mainly samples deformation parameters randomly from artificially preset probability distribution to control the shape transformation of image content, so that not only is the distribution calculation and selection complex, but also the selected distribution is difficult to completely fit the actual distribution of character shapes. Therefore, this method is labor-intensive and the resulting deformation is not realistic. Although the defect of manual calculation and selection distribution can be overcome by generating a countermeasure network, the techniques still have the problem of insufficient deformation fineness because the actual character shape has the characteristics of global (such as a rotation angle, a translation distance and the like) and local (such as the distortion degree, the length, the thickness and the like of a stroke), and the prior art only generates single global deformation or local deformation, so that the authenticity and the diversity of the transformed shape are poor, and finally the performance improvement of the expanded data on downstream tasks is limited. The augmentation technology based on the neural network adopts the generation of the antagonistic loss function as an optimization target, but because the generation of the antagonistic loss function only calculates the loss values of two kinds of labels, namely true labels and false labels, weak supervision is only provided, and the character image has global shape characteristics and local shape characteristics, so that the loss function needs to be provided with stronger supervision in order to ensure the authenticity and diversity of the shape of the transformed characters. For example, studies have shown that generating an anti-network may suppress the effect of semantic vectors or noise vectors in the generator, which may result in too little (even no change) or too much (distorted shape) deformation of the text, and may break the balance between the diversity and reality of the text shape, so that data expansion may only bring about limited performance improvement, and even produce adverse effects.

Disclosure of Invention

In view of the above, there is a need to provide a new character image augmentation method based on shape transformation, which combines affine matrix and TPS transformation sampling grid parameters to make STN generate global and local shape changes simultaneously, and improve the fineness of deformation; a noise vector injection technology is introduced into the STN to enrich the diversity of the formed samples, and meanwhile, a diversity loss function and a signal-noise reconstruction loss function are designed to strengthen supervision on a network. Wherein the diversity loss function is used to promote diversity of deformation parameters, thereby increasing diversity of deformation. The signal-to-noise reconstruction loss function is used to ensure signal-to-noise balance, so that the deformation degree is maintained within a reasonable range.

The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps:

step 1, constructing a shape transformation generation confrontation network, which comprises a generator and a discriminator;

step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, inputting the target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;

step 3, training the shape transformation to generate a confrontation network;

and 4, generating the augmented character image by using the trained generator.

Specifically, the generator is a spatial transformation network and comprises an encoder, a predictor, a sampler, a noise reconstruction network and an image reconstruction network;

the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;

the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;

the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;

the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;

the noise reconstruction network is sequentially connected by a plurality of layers of full connection modules and a full connection layer;

firstly, an original character image is used as the input of an encoder, the encoder extracts shape characteristics from the original character image and outputs a shape characteristic vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape characteristic vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS transformation parameters and affine transformation parameters, wherein the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, the transformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and the target character image output by the generator are respectively input to a discriminator, and the discriminator outputs discrimination results of the deformed character image and the target character image;

the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU active layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer.

Preferably, a least square generated countermeasure loss function is constructed as an optimization target of the shape transformation generated countermeasure network training, and a calculation formula of the least square generated countermeasure loss function is as follows:

a function representing the loss of the generator is represented,

the presence of the discriminator is indicated by the expression,

a representation generator for generating a representation of the object,

representing a signal-to-noise reconstruction loss function,

a function representing the loss of diversity is represented,

a function representing the loss of the arbiter is represented,

which represents an image of an original character,

a target character image is represented by a character image of a character,

and

indicating the corresponding mathematical expectation.

The signal-noise reconstruction loss function comprises a signal reconstruction sub-item, a noise reconstruction sub-item and a reconstruction error ratio item, and the calculation formula is as follows:

the mean absolute error is represented by the average absolute error,

which represents the image of the original character(s),

a noise-hidden vector is represented that is,

and

separately representing image reconstruction networks

And

the reconstructed original image and the noise hidden vector,

is a dynamic coefficient, if the reconstruction error ratio term is greater than logM, let α =1, if the reconstruction error ratio term is less than-logM, let α = -1, if the reconstruction error ratio term is within the ideal range, i.e., [ -logM, logM]If the parameter is set to be alpha =0, M represents a hyperparameter;

the calculation formula of the diversity loss function is as follows:

whereinPA predictor is represented by a representation of the motion vector,Eit is shown that the encoder is a digital video encoder,

and

respectively representing different noise hidden vectors taken from the same gaussian distribution.

Optionally, each convolution module in the encoder further includes a batch normalization layer located between the two-dimensional convolution layer and the nonlinear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects to maximize convergence.

Optionally, each fully-connected module in the predictor further comprises a batch normalization layer located between the fully-connected layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximization of convergence.

Preferably, the number of output channels of the last fully-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.

Optionally, the transposed convolution module further includes a batch normalization layer located between the transposed convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence.

Specifically, the sampler is implemented by a torch.nn.functional.grid _ sample () method in the torch, and simultaneously, the affine parameters are converted into the affine transformation sampling grid by the torch.nn.functional.affine _ grid () method in the torch.

More specifically, the goal of the TPS transformation is to solve a deformation function

The deformation function is such that

And the function of the bending energy is the minimum,

coordinates representing matching points of the TPS transformed sampling grid on the original character image,

the TPS transform representing the deformed character image samples the coordinates of the grid matching points,

for one TPS transform sampling the number of grid matching points, assume that n sets of matching point pairs for two images have been acquired:

、

、…、

the deformation function is imagined as bending a thin metal plate through a given n points, the energy function of bending the sheet being expressed as:

the spline function of the thin plate can be proved to be the function with the minimum bending energy, and the spline function of the thin plate is as follows:

wherein

As basis functions:

（4）

，

，

and

the method is solved by preset values of n TPS transformation sampling grid matching point coordinates and the offset predicted by a predictor, so that the method can obtain

And (5) specific expressions.

The sampling formula of the affine transformation sampling grid is as follows:

wherein the content of the first and second substances,scale，theta，

，

respectively representing affine transformation parameters predicted by the predictor,

and

respectively the position coordinates of the pixel points before and after transformation.

Preferably, all images have a pixel size of 64 x 64, a batch size of 64, an initial learning rate of 0.0001, a challenge network iteration count of 5000, and a linear decay of the learning rate to 1e after 2500 iterations^-5And the countermeasure network is optimized by adopting an adam optimizer.

Compared with the prior art, the invention has the beneficial effects that:

the method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape change at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.

The method of the invention introduces a noise vector injection technology in the STN to promote the diversity of the generated samples, designs a signal-noise reconstruction loss function which can ensure the signal-noise balance and a diversity loss function which can generate rich shape transformation, provides stronger supervision for the training of the STN, ensures that the deformation degree of the samples is more reasonable and rich, uses the augmented data in the training classifier, and can improve the classification performance of the classifier.

Drawings

FIG. 1 shows a schematic flow diagram of a method embodying the present invention;

fig. 2 is a schematic structural diagram illustrating operations of modules according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

For reference and clarity, the technical terms, abbreviations or acronyms used hereinafter are to be construed in summary as follows:

STN: a spatialtransformational network space transform network.

TPS, thin plate spline and thin plate spline.

CNN: a convolutional neural network.

FC network: full connected network.

Pythrch: a mainstream deep learning framework encapsulates a plurality of commonly used deep learning related functions and classes.

ReLU/leakyReLU: a non-linear activation function.

Generating a countermeasure network: a generating network training framework based on the idea of zero sum game comprises a generator and a discriminator.

Hidden vector quantity: vectors in random variable space.

The invention discloses a character image augmentation method based on shape transformation, which aims to solve various problems in the prior art.

Fig. 1 shows a schematic flow diagram of an embodiment of the invention. A character image augmentation method based on shape transformation includes the following steps:

step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;

step 3, training the shape transformation to generate a confrontation network;

and 4, generating the augmented character image by using the trained generator.

As shown in fig. 2, the generator is a spatial transform network, and includes an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network. First, an original character image is used as an input of an encoder, the encoder extracts shape features from the original character image, outputs a shape feature vector, then randomly selecting a noise hidden vector from the standard normal distribution, fusing the shape characteristic vector and the noise hidden vector, inputting the fused hidden vector into a predictor, the predictor is responsible for mapping TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then, the TPS conversion sampling grid, the affine conversion sampling grid and the original character image are input into a sampler, a converted character image is output, meanwhile, the output end of the predictor is connected with the input ends of the image reconstruction network and the noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed. Then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.

Specifically, the present embodiment adopts the following steps to implement the inventive method.

1. And respectively constructing an original character data set to be augmented and a target character data set with target shape characteristics.

2. A space transformation network is built to be used as a generator in the countermeasure training, and the specific steps are as follows:

(1) the spatial transform network comprises three modules, respectively an encoder, a predictor and a sampler. Firstly, an encoder is constructed, the encoder is composed of connected Convolutional Neural Networks (CNNs), the number of convolutional layers of the CNNs is generally selected to exceed 3, in an implementation example, 4 convolutional modules are selected and connected in sequence, each convolutional module includes a two-dimensional convolutional layer, a batch normalization layer, a nonlinear activation layer and a convergence (pooling) layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a leakyReLU function, the convergence operation can be a maximum convergence, an average convergence or an adaptive convergence, and the ReLU function and the maximum convergence are adopted in the implementation example.

(2) Secondly, a predictor is constructed, the predictor is composed of a fully-connected (FC) neural network, the number of FC layers is generally more than 2, the FC network of the embodiment comprises 3 FC modules which are connected in sequence, and finally, one FC layer is connected. Each FC module comprises an FC layer, a batch normalization layer and a nonlinear activation layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a LeakyReLU function, the aggregation operation can be maximized aggregation, average aggregation or adaptive aggregation, the ReLU function and the maximized aggregation are adopted in the example, the number of output channels of the last FC layer is set to be the number of deformation parameters needing to be predicted, the number of the deformation parameters adopted in the example is 132, the 128 parameters are the coordinates of matching points of 8 x 8=64 TPS transformation sampling grids, and 4 are the element values of an affine transformation matrix. Note that the number of TPS transform sampling grid matching points may instead be less than the square of any integer number of original image heights or widths.

(3) And then constructing a sampler, wherein the sampler maps the deformed character image pixel region to the original character image pixel region by applying matrix multiplication on a sampling grid, and the sampling is realized by adopting a torch.

(4) Finally, an image reconstruction network is constructed

Sum noise reconstruction network

，

Consists of 3 FC modules, 1 FC layer and 4 transposition convolution modules which are connected in sequence,

the device comprises 3 FC modules and 1 FC layer which are connected in sequence, wherein the transposed convolution module comprises 1 transposed convolution layer, 1 batch normalization layer and 1 nonlinear activation layer which are connected in sequence, wherein the batch normalization layer is optional, the nonlinear activation function can select a ReLU function or a LEAKyReLU function, the convergence operation can select maximum convergence, average convergence or self-adaptive convergence, and the ReLU function and the maximum convergence are adopted in the embodiment. The role of the reconstructed network is detailed in point 4.

(5) The working principle of the space transformation network is as follows: firstly, an original character image is taken as the input of an encoder, the encoder extracts shape features from the original character image, and outputs a shape feature vector

Next, we randomly choose a noise hidden vector from the standard normal distribution

Will be

And

the fusion is carried out in the way of the present example by directly performing superposition summation, wherein

Contains font characteristic information, has the function of ensuring output authenticity,

certain randomness can be brought, and the diversity of output is ensured. And inputting the fused implicit vector into a predictor, wherein the predictor is responsible for mapping out TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the sampling grid has 8 x 8=64 grid matching points, and the affine transformation parameters are element values of an affine transformation matrix and have 4 parameters in total. Next, the 4 affine transformation parameters are converted into an affine transformation sampling grid by the method of torch. Then, the TPS transformed sampling grid and the affine transformed sampling grid and the original image are input into the sampler, and the deformed character image is output. We assume that n sets of TPS transform sampling grid matching point pairs for two images have been acquired:

、

、…、

in this example, n is 64. The procedure for the coordinate correspondence calculated using the TPS transformation is as follows: the objective of the TPS transformation is to solve a function f such that

And the bending energy function is minimum, so that other points on the image can obtain a good transformation result through interpolation. The deformation function can be thought of as bending a thin metal plate through a given n points, and the energy function for bending the sheet can be expressed as:

(1)

(2)

where U is the basis function:

(3)

（4）

in the above formula, only need to obtain

，

，

And

can determine

，

，

，

And

the solution can be done by pre-setting the coordinates of the matching points of the sampling grid by 64 TPS transforms and the offset predicted by the predictor.

Similarly, suppose

And

the positions of the pixel points before and after the transformation are respectively, the sampling formula of the affine transformation is as follows:

（5）

whereinscale ，theta，

，

Each representing 4 affine transformation parameters predicted by the predictor.

And finally, connecting the output end of the predictor with the input ends of the image reconstruction network and the noise reconstruction network, and reconstructing an original image and a noise hidden vector respectively.

3. And constructing a discriminator in the countermeasure training, wherein the discriminator is based on the structure of patchGAN and consists of 5 convolution modules which are connected in sequence, and each module in the first 4 convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LeakyReLU activation layer. The last convolution module includes an optional padding layer and a two-dimensional convolution layer.

4. The generator uses the original character image

As input, the character image which generates deformation after passing through the space transformation network

Connecting the output end of the generator with the input end of the discriminator and simultaneously, connecting the target character image

And the input is input to the other output end of the discriminator, and the discriminator outputs the discrimination result of the deformed character image and the target character image.

5. Constructing a signal-noise reconstruction loss function, wherein the loss function consists of three sub-items, namely a signal reconstruction sub-item, a noise reconstruction sub-item and a reconstruction error ratio item, and the formula is calculated as follows:

（6）

wherein the content of the first and second substances,

and

separately representing reconstructed networks

And

the reconstructed original image and the noise hidden vector. Under the condition of lacking strong supervision, the action generated by input information can be restrained in the neural network learning process, and in order to avoid the condition, reconstruction loss items are respectively designed for information and noise vectors, so that the action of guaranteeing authenticity of shape information and guaranteeing diversity of noise cannot be restrained. In addition, in order to ensure that the transformation degree of the transformed font is more reasonable and controllable, a reconstruction loss ratio is designed to seek the balance of respective effects of shape information and noise, and a hyperparameter M is further used>1 to constrain this term. Where α is a dynamic coefficient, we assume α =1 if the term is greater than logM during training. In this case, the noise reconstruction is much worse than the signal reconstruction. This means that the effect of noise is suppressed by the network and we use the gradient descent method to optimize the positive ratio term. Conversely, when the term is less than-logM, we assume α = -1. In this case, the signal reconstruction is much worse than the noise reconstruction. This means that the effect of the noise is too prominent, the effect of the shape information is suppressed, and we use the gradient descent method to optimize the negative ratio term. If the term is within the ideal range, i.e., [ -logM, logM]Then α =0 is set, i.e. without any additional optimization.

6. And constructing a diversity loss function, wherein the diversity is in positive correlation with the difference between the transformation parameters corresponding to different noise hidden vectors, and the two transformation parameters are respectively estimated from two signal-noise mixed hidden vectors respectively injected with noise. The loss of diversity is defined as follows:

（7）

where P denotes a predictor, E denotes an encoder,

and

each is derived from the sameDifferent noise hidden vectors of gaussian distribution.

7. Constructing a least square to generate a confrontation loss function as an optimization target of confrontation training, wherein the confrontation loss is to draw up the distribution between a deformed character image and a target character image, and the specific formula is as follows:

(8)

through the countertraining, the spatial transform network can generate samples approximating the target character image.

8. All image pixel sizes were set to 64 x 64, batch size was 64, initial learning rate was 0.0001, number of iterations of the network was 5000, learning rate started to linearly decay to 1e-5 after 2500 iterations, and the network was optimized using adam optimizer.

9. Training the whole shape transformation according to the setting in 7 to generate a confrontation network, and obtaining a trained generator which can be used for generating various augmentation samples.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A character image augmentation method based on shape transformation is characterized by comprising the following steps:

step 1, constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator;

step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, and connecting the output end of the generator with the input end of a discriminator; inputting the target character image into the other input end of the discriminator, and outputting discrimination results of the deformed character image and the target character image by the discriminator;

step 3, training the shape transformation to generate a confrontation network;

and 4, generating the augmented character image by using the trained generator.

2. The method of claim 1, wherein the generator is a spatial transform network including an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network;

the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU activation layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer;

firstly, an original character image is used as the input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS transformation parameters and affine transformation parameters, wherein the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, a deformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.

3. The character image augmentation method based on shape transformation as claimed in claim 2, wherein a least square generation countermeasure loss function is constructed as an optimization target of shape transformation generation countermeasure network training, and a calculation formula of the least square generation countermeasure loss function is as follows:

a function representing the loss of the generator is expressed,

the presence of the discriminator is indicated by the expression,

a representation generator for generating a representation of the object,

representing a signal-to-noise reconstruction loss function,

a function representing the loss of diversity is represented,

a function representing the loss of the arbiter is represented,

which represents the image of the original character(s),

a target character image is represented by a character image,

and

indicating the corresponding mathematical expectation.

4. The method as claimed in claim 3, wherein the signal-to-noise reconstruction loss function includes a signal reconstruction sub-term, a noise reconstruction sub-term, and a reconstruction error ratio term, and the calculation formula is as follows:

the average absolute error is expressed in terms of,

a noise-hidden vector is represented that is,

and

separately representing image reconstruction networks

And

the reconstructed original image and the noise hidden vector,

the calculation formula of the diversity loss function is as follows:

and

5. The method of claim 2, wherein each of said convolution modules in the encoder further comprises a batch normalization layer disposed between the two-dimensional convolution layer and the non-linear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects the maximized convergence;

each full-connection module in the predictor also comprises a batch normalization layer which is positioned between the full-connection layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximized convergence;

the transposition convolution module also comprises a batch normalization layer which is positioned between the transposition convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximization convergence.

6. The method as claimed in claim 2, wherein the number of output channels of the last full-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.

7. A method for augmenting a character image based on shape transformation according to claim 2 or 6, characterized in that the TPS transformation aims to solve a deformation function

So that

，

And the function of the bending energy is the minimum,

coordinates representing TPS transformed sampling grid matching points on the original character image,

for one TPS transform sampling grid matching point number, assume that n groups of TPS transforms of two images have been acquiredMatching point coordinate pairs of the sample grids:

、

、…、

the thin plate spline function can be proved to be the function with the minimum bending energy, and the thin plate spline function is as follows:

wherein

As basis functions:

，

，

and

obtaining the coordinate preset value of the sampling grid matching point and the offset predicted by the predictor through n TPS transformation

And (5) specific expressions.

8. The method as claimed in claim 2 or 6, wherein the sampling formula of the affine transformation sampling grid is as follows:

wherein the content of the first and second substances,scale，theta，

，

and

9. The method as claimed in claim 2, wherein the sampler is implemented by a torch.n.functional.grid _ sample () method in the torch, and the affine parameter is converted into the affine transformation sampling grid by the torch.n.functional.affine _ grid () method in the torch.

10. The method of claim 2, wherein the size of pixels of all images is 64 x 64, the batch size is 64, the initial learning rate is 0.0001, the number of iterations of the countermeasure network is 5000, and the learning rate starts to decay linearly to 1e after 2500 iterations^-5And optimizing the countermeasure network by adopting an adam optimizer.