CN114782961B

CN114782961B - Character image augmentation method based on shape transformation

Info

Publication number: CN114782961B
Application number: CN202210285238.8A
Authority: CN
Inventors: 黄双萍; 黄鸿翔; 杨代辉
Original assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Current assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2023-04-18
Anticipated expiration: 2042-03-23
Also published as: CN114782961A

Abstract

The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps: constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator; the method comprises the steps of taking an original character image as input of a generator, generating a deformed character image after spatial transformation, connecting an output end of the generator with an input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting a discrimination result of the deformed character image and the target character image by the discriminator; training the shape transformation to generate a confrontation network; augmented character images are generated using a trained generator. The method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape changes at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.

Description

Character image augmentation method based on shape transformation

Technical Field

The invention belongs to the technical field of image processing and artificial intelligence, and particularly relates to a character image augmentation method based on shape transformation.

Background

Text image recognition methods based on deep learning exhibit great potential, however, training a high-performance character image recognition model requires a large amount of and as diverse as possible annotation data. Since manually collecting and labeling text image data is an extremely expensive and time-consuming task, this is especially true for text images with complicated shapes such as ancient characters, handwritten formulas, and the like. In contrast, data augmentation is a cost effective way to increase data diversity.

The data augmentation modes include augmentation based on shape transformation such as flipping, rotating, scaling, cropping and translating the image, and augmentation based on non-shape transformation such as color dithering and noise injection. The conventional shape transformation-based augmentation algorithm mainly randomly samples deformation parameters from an artificially preset probability distribution to control shape transformation of image content, for example, the most common way is to use an affine matrix (affine matrix) to generate affine transformation, such as an affine transformation-based Spatial Transformation Network (STN) algorithm, or generate local deformation by using other non-rigid deformation algorithms, such as Thin Plate Spline (TPS) transformation. However, the conventional augmentation algorithm relies on the artificially preset probability distribution for sampling, so that not only is the calculation and selection of the distribution complex, but also the selected distribution is difficult to completely fit the distribution of the actual character shape, which results in higher labor cost and poorer reality of deformation.

The traditional augmentation algorithm based on shape transformation mainly samples deformation parameters randomly from artificially preset probability distribution to control the shape transformation of image content, so that not only is the distribution calculation and selection complex, but also the selected distribution is difficult to completely fit the actual distribution of character shapes. Therefore, this method is labor-intensive and the resulting deformation is not realistic. Although the defect of manual calculation selection distribution can be solved by generating a countermeasure network, the techniques still have the problem of insufficient deformation fineness, because the actual character shape has both global (such as rotation angle, translation distance and the like) and local (such as twisting degree, length, thickness and the like of strokes) characteristics, while the prior art only generates single global deformation or local deformation, so that the authenticity and diversity of the transformed shape are poor, and finally the performance improvement of the expanded data on downstream tasks is limited. The augmentation technology based on the neural network adopts a generated countermeasure loss function as an optimization target, but only weak supervision is provided because the generated countermeasure loss function only calculates loss values of two kinds of labels, namely true labels and false labels, and the character image has global shape characteristics and local shape characteristics, so that stronger supervision is required to be provided for the loss function to ensure the shape authenticity and diversity of the transformed characters. For example, studies have shown that generating an anti-network may suppress the effect of semantic vectors or noise vectors in the generator, which may result in too little (even no change) or too much (distorted shape) deformation of the text, and may break the balance between the diversity and reality of the text shape, so that data expansion may only bring about limited performance improvement, and even produce adverse effects.

Disclosure of Invention

In view of the above, there is a need to provide a new character image augmentation method based on shape transformation, which combines affine matrix and TPS transformation sampling grid parameters to make STN generate global and local shape changes simultaneously, and improve the fineness of deformation; a noise vector injection technology is introduced into the STN to enrich the diversity of the formed samples, and meanwhile, a diversity loss function and a signal-noise reconstruction loss function are designed to strengthen supervision on a network. Wherein the diversity loss function is used to promote diversity of deformation parameters, thereby increasing the diversity of deformation. The signal-to-noise reconstruction loss function is used to ensure signal-to-noise balance so that the degree of distortion is maintained within a reasonable range.

The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps:

step 1, constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator;

step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, inputting the target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;

step 3, training the shape transformation to generate a confrontation network;

and 4, generating the augmented character image by using the trained generator.

Specifically, the generator is a spatial transformation network and comprises an encoder, a predictor, a sampler, a noise reconstruction network and an image reconstruction network;

the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;

the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;

the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;

the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;

the noise reconstruction network is sequentially connected by a plurality of layers of full-connection modules and a full-connection layer;

firstly, an original character image is used as the input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS (thermoplastic polystyrene) conversion parameters and affine conversion parameters, wherein the TPS conversion parameters are coordinate values of TPS conversion sampling grid matching points, the affine conversion parameters are converted into affine conversion sampling grids, then the TPS conversion sampling grids, the affine conversion sampling grids and the original character image are input into a sampler, the converted character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and a target character image output by a generator are respectively input into a discriminator, and the discriminator outputs the discrimination results of the deformed character image and the target character image;

the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU active layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer.

Preferably, a least square generated countermeasure loss function is constructed as an optimization target of the shape transformation generated countermeasure network training, and a calculation formula of the least square generated countermeasure loss function is as follows:

represents a generator penalty function, <' > based on the number of bits in the generator>

Represents a discriminator>

Represents a generator, <' > based on>

Represents a signal to noise reconstruction loss function, <' > based on the signal to noise ratio>

Represents a loss of diversity function, is selected>

Presentation discriminatorA loss function +>

Represents the original character image, and>

represents the target character image, and>

and &>

Indicating the corresponding mathematical expectation.

The signal-noise reconstruction loss function comprises a signal reconstruction sub-term, a noise reconstruction sub-term and a reconstruction error ratio term, and the calculation formula is as follows:

represents the mean absolute error, is>

Represents the original character image, and>

represents a noise masking vector +>

And &>

Respectively represent an image reconstruction network->

And &>

Reconstructed original image and noise hidden vector->

Is a dynamic coefficient, if the reconstruction error ratio term is greater than logM, let α =1, if the reconstruction error ratio term is less than-logM, let α = -1, if the reconstruction error ratio term is within the ideal range, i.e., [ -logM, logM]Then set α =0,M to represent the hyperparameter;

the calculation formula of the diversity loss function is as follows:

whereinPA predictor is represented by a representation of the motion vector,Ewhich represents the encoder, is a digital representation of the encoder,

and &>

Respectively representing different noise hidden vectors taken from the same gaussian distribution.

Optionally, each convolution module in the encoder further includes a batch normalization layer located between the two-dimensional convolution layer and the nonlinear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects to maximize convergence.

Optionally, each fully-connected module in the predictor further comprises a batch normalization layer located between the fully-connected layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximization of convergence.

Preferably, the number of output channels of the last fully-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.

Optionally, the transposed convolution module further includes a batch normalization layer located between the transposed convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence.

Specifically, the sampler is implemented by a torch.nn.functional.grid _ sample () method in the torch, and simultaneously, affine parameters are converted into affine transformation sampling grids by the torch.nn.functional.affine _ grid () method in the torch.

More specifically, the goal of the TPS transformation is to solve a deformation function

The deformation function is such that

And the bending energy function is minimal, <' >>

Coordinates representing a TPS transform sample grid match point on the original character image, <' >>

Coordinates of the TPS transform sample grid match points, which represent the morphed character image, <' >>

For the number of sampling grid matching points in a TPS transform, assume that n sets of matching point pairs for two images have been acquired:

、

、…、

The deformation function is imagined as bending a thin metal plate through a given n points, the energy function of bending the sheet being expressed as:

the thin plate spline function can be proved to be the function with the minimum bending energy, and the thin plate spline function is as follows:

wherein

As basis functions:

（4）

，

，

and &>

Solving the coordinate preset value of the matching point of the n TPS transformation sampling grids and the offset predicted by the predictor, so as to obtain ^ greater or greater than or equal to>

And (5) specific expressions.

The sampling formula of the affine transformation sampling grid is as follows:

wherein,scale，theta，

，

respectively representing affine transformation parameters predicted by the predictor>

And

respectively the position coordinates of the pixel points before and after transformation.

Preferably, all images have a pixel size of 64 × 64, a batch size of 64, an initial learning rate of 0.0001, a number of iterations of the countermeasure network of 5000, and the learning rate starts to decay linearly to 1e after 2500 iterations ^-5 And optimizing the countermeasure network by adopting an adam optimizer.

Compared with the prior art, the invention has the beneficial effects that:

the method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape changes at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.

The method introduces a noise vector injection technology into the STN to promote the diversity of generated samples, designs a signal-noise reconstruction loss function capable of ensuring signal-noise balance and a diversity loss function capable of generating rich shape transformation, provides stronger supervision for the training of the STN, ensures that the deformation degree of the samples is more reasonable and rich, uses the augmented data for training a classifier, and can improve the classification performance of the classifier.

Drawings

FIG. 1 shows a schematic flow diagram of a method embodying the present invention;

fig. 2 is a schematic structural diagram illustrating operations of modules according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For reference and clarity, the technical terms, abbreviations or acronyms used hereinafter are to be construed in summary as follows:

STN: spatialtransformational network space transformation network.

TPS, thin plate spline and thin plate spline.

CNN: a convolutional neural network.

FC network: fully connected network.

Pythroch: a mainstream deep learning framework encapsulates a plurality of commonly used deep learning related functions and classes.

ReLU/leakyReLU: a non-linear activation function.

And (3) generating a countermeasure network: a generating network training framework based on the idea of zero sum game comprises a generator and a discriminator.

Hidden vector quantity: vectors in random variable space.

The invention discloses a character image augmentation method based on shape transformation, which aims to solve various problems in the prior art.

Fig. 1 shows a schematic flow diagram of an embodiment of the invention. A character image augmentation method based on shape transformation includes the following steps:

step 1, constructing a shape transformation generation confrontation network, which comprises a generator and a discriminator;

step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;

step 3, training the shape transformation to generate a confrontation network;

and 4, generating the augmented character image by using the trained generator.

As shown in fig. 2, the generator is a spatial transform network, and includes an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network. Firstly, an original character image is used as input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for mapping out TPS (transformation protocol standard) transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, the transformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed. Then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.

Specifically, the present embodiment adopts the following steps to implement the inventive method.

1. And respectively constructing an original character data set to be augmented and a target character data set with target shape characteristics.

2. A space transformation network is built to be used as a generator in the countermeasure training, and the specific steps are as follows:

(1) The spatial transform network comprises three modules, respectively an encoder, a predictor and a sampler. Firstly, an encoder is constructed, the encoder is composed of connected Convolutional Neural Networks (CNNs), the number of convolutional layers of the CNNs is generally selected to be more than 3, in an implementation example, 4 convolutional modules are selected to be connected in sequence, each convolutional module comprises a two-dimensional convolutional layer, a batch normalization layer, a nonlinear activation layer and a convergence (pooling) layer, wherein the batch normalization layer is optional, the nonlinear activation function can be selected from a ReLU function or a leakyReLU function, the convergence operation can be selected from maximization convergence, average convergence or self-adaptive convergence, and the ReLU function and the maximization convergence are adopted in the implementation example.

(2) Secondly, a predictor is constructed, the predictor is composed of a fully-connected (FC) neural network, the number of FC layers is generally more than 2, the FC network of the embodiment comprises 3 FC modules which are connected in sequence, and finally, one FC layer is connected. Each FC module comprises an FC layer, a batch normalization layer and a nonlinear activation layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a LeakyReLU function, the aggregation operation can be maximized aggregation, average aggregation or adaptive aggregation, the ReLU function and the maximized aggregation are adopted in the example, the number of output channels of the last FC layer is set to be the number of deformation parameters needing to be predicted, the number of the deformation parameters adopted in the example is 132, the 128 parameters are coordinates of matching points of 8 × 8=64 TPS transformation sampling grids, and 4 are element values of an affine transformation matrix. Note that the number of TPS transform sampling grid matching points may instead be less than the square of any integer number of the original image height or width.

(3) And then constructing a sampler, wherein the sampler maps the deformed character image pixel region to the original character image pixel region by applying matrix multiplication on a sampling grid, and the sampling is realized by adopting a torch.

(4) Finally, an image reconstruction network is constructed

And noise reconstruction network->

，

Consists of 3 FC modules, 1 FC layer and 4 transposition convolution modules which are connected in sequence, and the judgment result is based on the result of judgment>

The device comprises 3 FC modules and 1 FC layer which are connected in sequence, wherein the transposed convolution module comprises 1 transposed convolution layer, 1 batch normalization layer and 1 nonlinear activation layer which are connected in sequence, wherein the batch normalization layer is optional, the nonlinear activation function can select a ReLU function or a LEAKyReLU function, the convergence operation can select maximum convergence, average convergence or self-adaptive convergence, and the ReLU function and the maximum convergence are adopted in the embodiment. The role of the reconstructed network is detailed in point 4.

(5) The working principle of the space transformation network is as follows: firstly, an original character image is taken as the input of an encoder, the encoder extracts shape features from the original character image, and outputs a shape feature vector

Next, we randomly choose a noise hidden vector ≧ from the standard normal distribution>

Will >>

And &>

Fusion is carried out in the manner of the present example by directly carrying out a sum-on-sum, in which->

Contains character-type characteristic information and has the function of ensuring the authenticity of output>

Certain randomness can be brought, and the diversity of output is ensured. And inputting the fused implicit vector into a predictor, wherein the predictor is responsible for mapping out TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the sampling grid has 8 × 8=64 grid matching points, the affine transformation parameters are element values of an affine transformation matrix, and the total number of the parameters is 4. Next, the 4 affine transformation parameters are converted into an affine transformation sampling grid by the torch. Then, the TPS transformed sampling grid and the affine transformed sampling grid and the original image are input into the sampler, and the deformed character image is output. We assume that n sets of TPS transform sampling grid matching point pairs for two images have been acquired:

、

、…、

In this example, n is 64. The procedure for the coordinate correspondence calculated using the TPS transform is as follows: the objective of the TPS transformation is to solve a function f such that @>

And the bending energy function is minimum, so that other points on the image can be insertedThe values yield good transformation results. The deformation function can be thought of as bending a thin metal plate through a given n points, and the energy function for bending the sheet can be expressed as:

(1)

the spline function of the thin plate can be proved to be the function with the minimum bending energy, and the spline function of the thin plate is as follows:

(2)

where U is the basis function:

(3)

（4）

in the above formula, only need to obtain

，

，

And &>

Can determine->

，

，

，

And &>

The solution can be done by sampling the preset values of the grid matching point coordinates and the offset predicted by the predictor through 64 TPS transforms.

Similarly, assume that

And &>

The positions of the pixel points before and after the transformation are respectively, the sampling formula of the affine transformation is as follows:

（5）

whereinscale ，theta，

，

Each representing 4 affine transformation parameters predicted by the predictor.

And finally, connecting the output end of the predictor with the input ends of the image reconstruction network and the noise reconstruction network, and reconstructing an original image and a noise hidden vector respectively.

3. And constructing a discriminator in the countermeasure training, wherein the discriminator is based on the structure of patchGAN and consists of 5 convolution modules which are connected in sequence, and each module in the first 4 convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU activation layer. The last convolution module includes an optional padding layer and a two-dimensional convolution layer.

4. The generator uses the original character image

As input, the character image which is deformed after passing through the spatial transformation network->

The output end of the generator is connected with the input end of the discriminator, and simultaneously, the target character image is based on the judgment result>

And the input is input to the other output end of the discriminator, and the discriminator outputs the discrimination result of the deformed character image and the target character image.

5. Constructing a signal-noise reconstruction loss function, wherein the loss function consists of three sub-terms, namely a signal reconstruction sub-term, a noise reconstruction sub-term and a reconstruction error ratio term, and the formula is calculated as follows:

（6）

wherein,

and &>

Respectively indicates the reconstructed network->

And &>

The reconstructed original image and the noise hidden vector. In the absence of strong supervision, the neural network learning process may inhibit the effect of input information, and in order to avoid this, we design reconstruction loss terms for information and noise vectors respectively, so that shape information guarantees authenticity and noiseEnsuring that the effects of diversity are not inhibited. In addition, in order to ensure that the transformation degree of the transformed font is more reasonable and controllable, a reconstruction loss ratio is designed to seek the balance of respective effects of shape information and noise, and a hyperparameter M is further used>1 to constrain this term. Where α is the dynamic coefficient, we assume α =1 if the term is greater than logM during training. In this case, the noise reconstruction is much worse than the signal reconstruction. This means that the effect of noise is suppressed by the network and we optimize the positive ratio term using the gradient descent method. Conversely, when the term is less than-logM, we assume α = -1. In this case, the signal reconstruction is much worse than the noise reconstruction. This means that the effect of noise is too prominent, the effect of shape information is suppressed and we use the gradient descent method to optimize the negative ratio term. If the term is within the desired range, i.e., [ -logM, logM]Then α =0 is set, i.e. without any additional optimization.

6. And constructing a diversity loss function, wherein the diversity is in positive correlation with the difference between transformation parameters corresponding to different noise hidden vectors, and the two transformation parameters are respectively estimated from two signal-noise mixed hidden vectors which are respectively injected with noise. The loss of diversity is defined as follows:

（7）

where P denotes a predictor, E denotes an encoder,

and &>

7. Constructing a least square to generate a confrontation loss function as an optimization target of confrontation training, wherein the confrontation loss is to draw up the distribution between a deformed character image and a target character image, and the specific formula is as follows:

(8)

through the countertraining, the spatial transform network can produce samples that approximate the target character image.

8. All image pixel sizes were set to 64 x 64, batch size was 64, initial learning rate was 0.0001, number of iterations of the network was 5000, learning rate started to linearly decay to 1e-5 after 2500 iterations, and the network was optimized using adam optimizer.

9. Training the whole shape transformation generation countermeasure network according to the setting in 7 to obtain a trained generator which can be used for generating various augmentation samples.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A character image augmentation method based on shape transformation is characterized by comprising the following steps:

step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, and connecting the output end of the generator with the input end of a discriminator; inputting the target character image into the other input end of the discriminator, and outputting discrimination results of the deformed character image and the target character image by the discriminator;

step 3, training the shape transformation to generate a confrontation network;

step 4, generating an augmented character image by using a trained generator;

the generator is a space transformation network and comprises an encoder, a predictor, a sampler, a noise reconstruction network and an image reconstruction network;

the noise reconstruction network is sequentially connected by a plurality of layers of full connection modules and a full connection layer;

the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU activation layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer;

firstly, an original character image is used as input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS conversion parameters and affine conversion parameters, wherein the TPS conversion parameters are coordinate values of TPS conversion sampling grid matching points, the affine conversion parameters are converted into affine conversion sampling grids, then the TPS conversion sampling grids, the affine conversion sampling grids and the original character image are input into a sampler, a deformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, the original character image and the noise hidden vector are respectively reconstructed, then the deformed character image and a target character image output by a generator are respectively input into a discriminator, and the discriminator outputs discrimination results of the deformed character image and the target character image.

2. The character image augmentation method based on shape transformation as claimed in claim 1, wherein a least squares generation countermeasure loss function is constructed as an optimization target of shape transformation generation countermeasure network training, and a least squares generation countermeasure loss function calculation formula is as follows:

representing the generator loss function, D _g Presentation discriminator, G _g Representation generator, L _snr (G _g ) Representing the signal-to-noise reconstruction loss function, L _div (E, P) denotes a loss of diversity function, based on>

Representing a discriminator loss function, x representing an original character image, y representing a target character image, and->

And &>

Indicating the corresponding mathematical expectation.

3. The method as claimed in claim 2, wherein the signal-to-noise reconstruction loss function includes a signal reconstruction sub-term, a noise reconstruction sub-term, and a reconstruction error ratio term, and the calculation formula is as follows:

L ₁ representing mean absolute error, z representing noise-hidden vector, x _rec And z _rec Individual watchImage reconstruction network R _x And R _z The reconstructed original image and the noise hidden vector, alpha is a dynamic coefficient, if the reconstruction error ratio term is larger than logM, alpha =1 is set, when the reconstruction error ratio term is smaller than-logM, alpha = -1 is set, if the reconstruction error ratio term is in an ideal range, namely [ -logM, logM]Then set α =0,M to represent the hyperparameter;

the calculation formula of the diversity loss function is as follows:

L _div (E,P)＝-L ₁ (P(E(x),z ₁ ),P(E(x),z ₂ ))

where P denotes predictor, E denotes encoder, z ₁ And z ₂ Respectively representing different noise hidden vectors taken from the same gaussian distribution.

4. The method of claim 1, wherein each convolution module in the encoder further includes a batch normalization layer disposed between the two-dimensional convolution layer and the non-linear active layer; the nonlinear activation function in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence;

each full-connection module in the predictor also comprises a batch normalization layer which is positioned between the full-connection layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence;

the transposition convolution module also comprises a batch normalization layer which is positioned between the transposition convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence.

5. The method as claimed in claim 1, wherein the number of output channels of the last fully-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.

6. A method for augmenting a character image based on shape transformation according to claim 1 or 5, characterized in that the TPS transformation aims to solve a deformation function f such that f (p) is _i (x _i ，y _i ))＝p _i (x _i ′，y _i ') (1. Ltoreq. I. Ltoreq.n) with the smallest bending energy function, p _i (x _i ，y _i ) Coordinates, p, representing matching points of a TPS transformed sampling grid on an original character image _i (x _i ′，y _i ') represents coordinates of TPS conversion sampling grid matching points of the deformed character image, n is the number of the TPS conversion sampling grid matching points, and supposing that n groups of TPS conversion sampling grid matching point coordinate pairs of the two images are obtained: (p) ₁ (x ₁ ，y ₁ )，p ₁ (x ₁ ′，y ₁ ′))、(p ₂ (x ₂ ，y ₂ )，p ₂ (x ₂ ′，y ₂ ′))、...、(p _n (x _n ，y _n )，p _n (x _n ′，y _n ')) that is to be used to bend a thin metal sheet through a given n points, the energy function for bending the sheet is expressed as:

where U is the basis function:

a ₁ ，a ₂ ，a ₃ and w _i And transforming preset values of coordinates of the sampling grid matching points and the offset predicted by the predictor through the n TPS to solve, thereby obtaining a specific expression of f (x, y).

7. The method for augmenting the character image based on the shape transformation as recited in claim 1 or 5, wherein the sampling formula of the affine transformation sampling grid is as follows:

wherein, scale, theta, t _x ，t _y The (x, y) and (x ', y') are position coordinates of pixel points before and after transformation, respectively.

8. The method as claimed in claim 1, wherein the sampler is implemented by a torch.nn.functional.grid _ sample () method in the pyrtch, and the affine parameters are converted into the affine transformation sampling grid by a torch.nn.functional.affine _ grid () method in the pyrtch.

9. The method of claim 1, wherein the size of pixels of all images is 64 x 64, the batch size is 64, the initial learning rate is 0.0001, the number of iterations of the countermeasure network is 5000, and the learning rate starts to linearly decay to 1e after 2500 iterations ^-5 And optimizing the countermeasure network by adopting an adam optimizer.