CN114782961A - Character image augmentation method based on shape transformation - Google Patents
Character image augmentation method based on shape transformation Download PDFInfo
- Publication number
- CN114782961A CN114782961A CN202210285238.8A CN202210285238A CN114782961A CN 114782961 A CN114782961 A CN 114782961A CN 202210285238 A CN202210285238 A CN 202210285238A CN 114782961 A CN114782961 A CN 114782961A
- Authority
- CN
- China
- Prior art keywords
- character image
- layer
- transformation
- function
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009466 transformation Effects 0.000 title claims abstract description 96
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000003416 augmentation Effects 0.000 title claims abstract description 17
- 238000005070 sampling Methods 0.000 claims abstract description 44
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims abstract description 12
- 230000003190 augmentative effect Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 79
- 239000013598 vector Substances 0.000 claims description 41
- 230000004913 activation Effects 0.000 claims description 27
- 238000009826 distribution Methods 0.000 claims description 18
- 238000010606 normalization Methods 0.000 claims description 15
- 238000005452 bending Methods 0.000 claims description 12
- 230000017105 transposition Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 230000014509 gene expression Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 239000002184 metal Substances 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000002776 aggregation Effects 0.000 description 5
- 238000004220 aggregation Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000002347 injection Methods 0.000 description 3
- 239000007924 injection Substances 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000003042 antagnostic effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013434 data augmentation Methods 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/18—Image warping, e.g. rearranging pixels individually
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps: constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator; taking an original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator; training the shape transformation to generate a confrontation network; augmented character images are generated using a trained generator. The method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape changes at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.
Description
Technical Field
The invention belongs to the technical field of image processing and artificial intelligence, and particularly relates to a character image augmentation method based on shape transformation.
Background
Text image recognition methods based on deep learning exhibit great potential, however, training a high-performance character image recognition model requires a large amount of and as diverse as possible annotation data. Since manually collecting and labeling text image data is an extremely expensive and time-consuming task, this is especially true for text images with complicated shapes such as ancient texts, handwritten formulas, and the like. In contrast, data augmentation is a cost effective way to increase data diversity.
The data augmentation modes include augmentation based on shape transformation such as flipping, rotating, scaling, cropping and translation of the image, and augmentation based on non-shape transformation such as color dithering and noise injection. The conventional shape transformation-based augmentation algorithm mainly randomly samples deformation parameters from an artificially preset probability distribution to control shape transformation of image content, for example, the most common way is to use an affine matrix (affine matrix) to generate affine transformation, such as an affine transformation-based Spatial Transformation Network (STN) algorithm, or generate local deformation by using other non-rigid deformation algorithms, such as Thin Plate Spline (TPS) transformation. However, the conventional augmentation algorithm relies on the artificially preset probability distribution for sampling, so that not only is the calculation and selection of the distribution complex, but also the selected distribution is difficult to completely fit the distribution of the actual character shape, which results in higher labor cost and poorer reality of deformation.
The traditional augmentation algorithm based on shape transformation mainly samples deformation parameters randomly from artificially preset probability distribution to control the shape transformation of image content, so that not only is the distribution calculation and selection complex, but also the selected distribution is difficult to completely fit the actual distribution of character shapes. Therefore, this method is labor-intensive and the resulting deformation is not realistic. Although the defect of manual calculation and selection distribution can be overcome by generating a countermeasure network, the techniques still have the problem of insufficient deformation fineness because the actual character shape has the characteristics of global (such as a rotation angle, a translation distance and the like) and local (such as the distortion degree, the length, the thickness and the like of a stroke), and the prior art only generates single global deformation or local deformation, so that the authenticity and the diversity of the transformed shape are poor, and finally the performance improvement of the expanded data on downstream tasks is limited. The augmentation technology based on the neural network adopts the generation of the antagonistic loss function as an optimization target, but because the generation of the antagonistic loss function only calculates the loss values of two kinds of labels, namely true labels and false labels, weak supervision is only provided, and the character image has global shape characteristics and local shape characteristics, so that the loss function needs to be provided with stronger supervision in order to ensure the authenticity and diversity of the shape of the transformed characters. For example, studies have shown that generating an anti-network may suppress the effect of semantic vectors or noise vectors in the generator, which may result in too little (even no change) or too much (distorted shape) deformation of the text, and may break the balance between the diversity and reality of the text shape, so that data expansion may only bring about limited performance improvement, and even produce adverse effects.
Disclosure of Invention
In view of the above, there is a need to provide a new character image augmentation method based on shape transformation, which combines affine matrix and TPS transformation sampling grid parameters to make STN generate global and local shape changes simultaneously, and improve the fineness of deformation; a noise vector injection technology is introduced into the STN to enrich the diversity of the formed samples, and meanwhile, a diversity loss function and a signal-noise reconstruction loss function are designed to strengthen supervision on a network. Wherein the diversity loss function is used to promote diversity of deformation parameters, thereby increasing diversity of deformation. The signal-to-noise reconstruction loss function is used to ensure signal-to-noise balance, so that the deformation degree is maintained within a reasonable range.
The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps:
step 1, constructing a shape transformation generation confrontation network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, inputting the target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
Specifically, the generator is a spatial transformation network and comprises an encoder, a predictor, a sampler, a noise reconstruction network and an image reconstruction network;
the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;
the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;
the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;
the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;
the noise reconstruction network is sequentially connected by a plurality of layers of full connection modules and a full connection layer;
firstly, an original character image is used as the input of an encoder, the encoder extracts shape characteristics from the original character image and outputs a shape characteristic vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape characteristic vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS transformation parameters and affine transformation parameters, wherein the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, the transformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and the target character image output by the generator are respectively input to a discriminator, and the discriminator outputs discrimination results of the deformed character image and the target character image;
the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU active layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer.
Preferably, a least square generated countermeasure loss function is constructed as an optimization target of the shape transformation generated countermeasure network training, and a calculation formula of the least square generated countermeasure loss function is as follows:
a function representing the loss of the generator is represented,the presence of the discriminator is indicated by the expression,a representation generator for generating a representation of the object,representing a signal-to-noise reconstruction loss function,a function representing the loss of diversity is represented,a function representing the loss of the arbiter is represented,which represents an image of an original character,a target character image is represented by a character image of a character,andindicating the corresponding mathematical expectation.
The signal-noise reconstruction loss function comprises a signal reconstruction sub-item, a noise reconstruction sub-item and a reconstruction error ratio item, and the calculation formula is as follows:
the mean absolute error is represented by the average absolute error,which represents the image of the original character(s),a noise-hidden vector is represented that is,andseparately representing image reconstruction networksAndthe reconstructed original image and the noise hidden vector,is a dynamic coefficient, if the reconstruction error ratio term is greater than logM, let α =1, if the reconstruction error ratio term is less than-logM, let α = -1, if the reconstruction error ratio term is within the ideal range, i.e., [ -logM, logM]If the parameter is set to be alpha =0, M represents a hyperparameter;
the calculation formula of the diversity loss function is as follows:
whereinPA predictor is represented by a representation of the motion vector,Eit is shown that the encoder is a digital video encoder,andrespectively representing different noise hidden vectors taken from the same gaussian distribution.
Optionally, each convolution module in the encoder further includes a batch normalization layer located between the two-dimensional convolution layer and the nonlinear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects to maximize convergence.
Optionally, each fully-connected module in the predictor further comprises a batch normalization layer located between the fully-connected layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximization of convergence.
Preferably, the number of output channels of the last fully-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.
Optionally, the transposed convolution module further includes a batch normalization layer located between the transposed convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence.
Specifically, the sampler is implemented by a torch.nn.functional.grid _ sample () method in the torch, and simultaneously, the affine parameters are converted into the affine transformation sampling grid by the torch.nn.functional.affine _ grid () method in the torch.
More specifically, the goal of the TPS transformation is to solve a deformation functionThe deformation function is such thatAnd the function of the bending energy is the minimum,coordinates representing matching points of the TPS transformed sampling grid on the original character image,the TPS transform representing the deformed character image samples the coordinates of the grid matching points,for one TPS transform sampling the number of grid matching points, assume that n sets of matching point pairs for two images have been acquired:、、…、the deformation function is imagined as bending a thin metal plate through a given n points, the energy function of bending the sheet being expressed as:
the spline function of the thin plate can be proved to be the function with the minimum bending energy, and the spline function of the thin plate is as follows:
,,andthe method is solved by preset values of n TPS transformation sampling grid matching point coordinates and the offset predicted by a predictor, so that the method can obtainAnd (5) specific expressions.
The sampling formula of the affine transformation sampling grid is as follows:
wherein the content of the first and second substances,scale,theta,,respectively representing affine transformation parameters predicted by the predictor,andrespectively the position coordinates of the pixel points before and after transformation.
Preferably, all images have a pixel size of 64 x 64, a batch size of 64, an initial learning rate of 0.0001, a challenge network iteration count of 5000, and a linear decay of the learning rate to 1e after 2500 iterations-5And the countermeasure network is optimized by adopting an adam optimizer.
Compared with the prior art, the invention has the beneficial effects that:
the method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape change at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.
The method of the invention introduces a noise vector injection technology in the STN to promote the diversity of the generated samples, designs a signal-noise reconstruction loss function which can ensure the signal-noise balance and a diversity loss function which can generate rich shape transformation, provides stronger supervision for the training of the STN, ensures that the deformation degree of the samples is more reasonable and rich, uses the augmented data in the training classifier, and can improve the classification performance of the classifier.
Drawings
FIG. 1 shows a schematic flow diagram of a method embodying the present invention;
fig. 2 is a schematic structural diagram illustrating operations of modules according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
For reference and clarity, the technical terms, abbreviations or acronyms used hereinafter are to be construed in summary as follows:
STN: a spatialtransformational network space transform network.
TPS, thin plate spline and thin plate spline.
CNN: a convolutional neural network.
FC network: full connected network.
Pythrch: a mainstream deep learning framework encapsulates a plurality of commonly used deep learning related functions and classes.
ReLU/leakyReLU: a non-linear activation function.
Generating a countermeasure network: a generating network training framework based on the idea of zero sum game comprises a generator and a discriminator.
Hidden vector quantity: vectors in random variable space.
The invention discloses a character image augmentation method based on shape transformation, which aims to solve various problems in the prior art.
Fig. 1 shows a schematic flow diagram of an embodiment of the invention. A character image augmentation method based on shape transformation includes the following steps:
step 1, constructing a shape transformation generation confrontation network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
As shown in fig. 2, the generator is a spatial transform network, and includes an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network. First, an original character image is used as an input of an encoder, the encoder extracts shape features from the original character image, outputs a shape feature vector, then randomly selecting a noise hidden vector from the standard normal distribution, fusing the shape characteristic vector and the noise hidden vector, inputting the fused hidden vector into a predictor, the predictor is responsible for mapping TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then, the TPS conversion sampling grid, the affine conversion sampling grid and the original character image are input into a sampler, a converted character image is output, meanwhile, the output end of the predictor is connected with the input ends of the image reconstruction network and the noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed. Then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
Specifically, the present embodiment adopts the following steps to implement the inventive method.
1. And respectively constructing an original character data set to be augmented and a target character data set with target shape characteristics.
2. A space transformation network is built to be used as a generator in the countermeasure training, and the specific steps are as follows:
(1) the spatial transform network comprises three modules, respectively an encoder, a predictor and a sampler. Firstly, an encoder is constructed, the encoder is composed of connected Convolutional Neural Networks (CNNs), the number of convolutional layers of the CNNs is generally selected to exceed 3, in an implementation example, 4 convolutional modules are selected and connected in sequence, each convolutional module includes a two-dimensional convolutional layer, a batch normalization layer, a nonlinear activation layer and a convergence (pooling) layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a leakyReLU function, the convergence operation can be a maximum convergence, an average convergence or an adaptive convergence, and the ReLU function and the maximum convergence are adopted in the implementation example.
(2) Secondly, a predictor is constructed, the predictor is composed of a fully-connected (FC) neural network, the number of FC layers is generally more than 2, the FC network of the embodiment comprises 3 FC modules which are connected in sequence, and finally, one FC layer is connected. Each FC module comprises an FC layer, a batch normalization layer and a nonlinear activation layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a LeakyReLU function, the aggregation operation can be maximized aggregation, average aggregation or adaptive aggregation, the ReLU function and the maximized aggregation are adopted in the example, the number of output channels of the last FC layer is set to be the number of deformation parameters needing to be predicted, the number of the deformation parameters adopted in the example is 132, the 128 parameters are the coordinates of matching points of 8 x 8=64 TPS transformation sampling grids, and 4 are the element values of an affine transformation matrix. Note that the number of TPS transform sampling grid matching points may instead be less than the square of any integer number of original image heights or widths.
(3) And then constructing a sampler, wherein the sampler maps the deformed character image pixel region to the original character image pixel region by applying matrix multiplication on a sampling grid, and the sampling is realized by adopting a torch.
(4) Finally, an image reconstruction network is constructedSum noise reconstruction network,Consists of 3 FC modules, 1 FC layer and 4 transposition convolution modules which are connected in sequence,the device comprises 3 FC modules and 1 FC layer which are connected in sequence, wherein the transposed convolution module comprises 1 transposed convolution layer, 1 batch normalization layer and 1 nonlinear activation layer which are connected in sequence, wherein the batch normalization layer is optional, the nonlinear activation function can select a ReLU function or a LEAKyReLU function, the convergence operation can select maximum convergence, average convergence or self-adaptive convergence, and the ReLU function and the maximum convergence are adopted in the embodiment. The role of the reconstructed network is detailed in point 4.
(5) The working principle of the space transformation network is as follows: firstly, an original character image is taken as the input of an encoder, the encoder extracts shape features from the original character image, and outputs a shape feature vectorNext, we randomly choose a noise hidden vector from the standard normal distributionWill beAndthe fusion is carried out in the way of the present example by directly performing superposition summation, whereinContains font characteristic information, has the function of ensuring output authenticity,certain randomness can be brought, and the diversity of output is ensured. And inputting the fused implicit vector into a predictor, wherein the predictor is responsible for mapping out TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the sampling grid has 8 x 8=64 grid matching points, and the affine transformation parameters are element values of an affine transformation matrix and have 4 parameters in total. Next, the 4 affine transformation parameters are converted into an affine transformation sampling grid by the method of torch. Then, the TPS transformed sampling grid and the affine transformed sampling grid and the original image are input into the sampler, and the deformed character image is output. We assume that n sets of TPS transform sampling grid matching point pairs for two images have been acquired:、、…、in this example, n is 64. The procedure for the coordinate correspondence calculated using the TPS transformation is as follows: the objective of the TPS transformation is to solve a function f such thatAnd the bending energy function is minimum, so that other points on the image can obtain a good transformation result through interpolation. The deformation function can be thought of as bending a thin metal plate through a given n points, and the energy function for bending the sheet can be expressed as:
the spline function of the thin plate can be proved to be the function with the minimum bending energy, and the spline function of the thin plate is as follows:
where U is the basis function:
in the above formula, only need to obtain , ,Andcan determine , , ,Andthe solution can be done by pre-setting the coordinates of the matching points of the sampling grid by 64 TPS transforms and the offset predicted by the predictor.
Similarly, supposeAndthe positions of the pixel points before and after the transformation are respectively, the sampling formula of the affine transformation is as follows:
whereinscale ,theta, ,Each representing 4 affine transformation parameters predicted by the predictor.
And finally, connecting the output end of the predictor with the input ends of the image reconstruction network and the noise reconstruction network, and reconstructing an original image and a noise hidden vector respectively.
3. And constructing a discriminator in the countermeasure training, wherein the discriminator is based on the structure of patchGAN and consists of 5 convolution modules which are connected in sequence, and each module in the first 4 convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LeakyReLU activation layer. The last convolution module includes an optional padding layer and a two-dimensional convolution layer.
4. The generator uses the original character imageAs input, the character image which generates deformation after passing through the space transformation networkConnecting the output end of the generator with the input end of the discriminator and simultaneously, connecting the target character imageAnd the input is input to the other output end of the discriminator, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
5. Constructing a signal-noise reconstruction loss function, wherein the loss function consists of three sub-items, namely a signal reconstruction sub-item, a noise reconstruction sub-item and a reconstruction error ratio item, and the formula is calculated as follows:
wherein the content of the first and second substances,andseparately representing reconstructed networksAndthe reconstructed original image and the noise hidden vector. Under the condition of lacking strong supervision, the action generated by input information can be restrained in the neural network learning process, and in order to avoid the condition, reconstruction loss items are respectively designed for information and noise vectors, so that the action of guaranteeing authenticity of shape information and guaranteeing diversity of noise cannot be restrained. In addition, in order to ensure that the transformation degree of the transformed font is more reasonable and controllable, a reconstruction loss ratio is designed to seek the balance of respective effects of shape information and noise, and a hyperparameter M is further used>1 to constrain this term. Where α is a dynamic coefficient, we assume α =1 if the term is greater than logM during training. In this case, the noise reconstruction is much worse than the signal reconstruction. This means that the effect of noise is suppressed by the network and we use the gradient descent method to optimize the positive ratio term. Conversely, when the term is less than-logM, we assume α = -1. In this case, the signal reconstruction is much worse than the noise reconstruction. This means that the effect of the noise is too prominent, the effect of the shape information is suppressed, and we use the gradient descent method to optimize the negative ratio term. If the term is within the ideal range, i.e., [ -logM, logM]Then α =0 is set, i.e. without any additional optimization.
6. And constructing a diversity loss function, wherein the diversity is in positive correlation with the difference between the transformation parameters corresponding to different noise hidden vectors, and the two transformation parameters are respectively estimated from two signal-noise mixed hidden vectors respectively injected with noise. The loss of diversity is defined as follows:
where P denotes a predictor, E denotes an encoder,andeach is derived from the sameDifferent noise hidden vectors of gaussian distribution.
7. Constructing a least square to generate a confrontation loss function as an optimization target of confrontation training, wherein the confrontation loss is to draw up the distribution between a deformed character image and a target character image, and the specific formula is as follows:
through the countertraining, the spatial transform network can generate samples approximating the target character image.
8. All image pixel sizes were set to 64 x 64, batch size was 64, initial learning rate was 0.0001, number of iterations of the network was 5000, learning rate started to linearly decay to 1e-5 after 2500 iterations, and the network was optimized using adam optimizer.
9. Training the whole shape transformation according to the setting in 7 to generate a confrontation network, and obtaining a trained generator which can be used for generating various augmentation samples.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A character image augmentation method based on shape transformation is characterized by comprising the following steps:
step 1, constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, and connecting the output end of the generator with the input end of a discriminator; inputting the target character image into the other input end of the discriminator, and outputting discrimination results of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
2. The method of claim 1, wherein the generator is a spatial transform network including an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network;
the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;
the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;
the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;
the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;
the noise reconstruction network is sequentially connected by a plurality of layers of full connection modules and a full connection layer;
the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU activation layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer;
firstly, an original character image is used as the input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS transformation parameters and affine transformation parameters, wherein the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, a deformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
3. The character image augmentation method based on shape transformation as claimed in claim 2, wherein a least square generation countermeasure loss function is constructed as an optimization target of shape transformation generation countermeasure network training, and a calculation formula of the least square generation countermeasure loss function is as follows:
a function representing the loss of the generator is expressed,the presence of the discriminator is indicated by the expression,a representation generator for generating a representation of the object,representing a signal-to-noise reconstruction loss function,a function representing the loss of diversity is represented,a function representing the loss of the arbiter is represented,which represents the image of the original character(s),a target character image is represented by a character image,andindicating the corresponding mathematical expectation.
4. The method as claimed in claim 3, wherein the signal-to-noise reconstruction loss function includes a signal reconstruction sub-term, a noise reconstruction sub-term, and a reconstruction error ratio term, and the calculation formula is as follows:
the average absolute error is expressed in terms of,a noise-hidden vector is represented that is,andseparately representing image reconstruction networksAndthe reconstructed original image and the noise hidden vector,is a dynamic coefficient, if the reconstruction error ratio term is greater than logM, let α =1, if the reconstruction error ratio term is less than-logM, let α = -1, if the reconstruction error ratio term is within the ideal range, i.e., [ -logM, logM]If the parameter is set to be alpha =0, M represents a hyperparameter;
the calculation formula of the diversity loss function is as follows:
5. The method of claim 2, wherein each of said convolution modules in the encoder further comprises a batch normalization layer disposed between the two-dimensional convolution layer and the non-linear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects the maximized convergence;
each full-connection module in the predictor also comprises a batch normalization layer which is positioned between the full-connection layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximized convergence;
the transposition convolution module also comprises a batch normalization layer which is positioned between the transposition convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximization convergence.
6. The method as claimed in claim 2, wherein the number of output channels of the last full-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.
7. A method for augmenting a character image based on shape transformation according to claim 2 or 6, characterized in that the TPS transformation aims to solve a deformation functionSo that ,And the function of the bending energy is the minimum,coordinates representing TPS transformed sampling grid matching points on the original character image,the TPS transform representing the deformed character image samples the coordinates of the grid matching points,for one TPS transform sampling grid matching point number, assume that n groups of TPS transforms of two images have been acquiredMatching point coordinate pairs of the sample grids:、、…、the deformation function is imagined as bending a thin metal plate through a given n points, the energy function of bending the sheet being expressed as:
the thin plate spline function can be proved to be the function with the minimum bending energy, and the thin plate spline function is as follows:
8. The method as claimed in claim 2 or 6, wherein the sampling formula of the affine transformation sampling grid is as follows:
9. The method as claimed in claim 2, wherein the sampler is implemented by a torch.n.functional.grid _ sample () method in the torch, and the affine parameter is converted into the affine transformation sampling grid by the torch.n.functional.affine _ grid () method in the torch.
10. The method of claim 2, wherein the size of pixels of all images is 64 x 64, the batch size is 64, the initial learning rate is 0.0001, the number of iterations of the countermeasure network is 5000, and the learning rate starts to decay linearly to 1e after 2500 iterations-5And optimizing the countermeasure network by adopting an adam optimizer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210285238.8A CN114782961B (en) | 2022-03-23 | 2022-03-23 | Character image augmentation method based on shape transformation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210285238.8A CN114782961B (en) | 2022-03-23 | 2022-03-23 | Character image augmentation method based on shape transformation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114782961A true CN114782961A (en) | 2022-07-22 |
CN114782961B CN114782961B (en) | 2023-04-18 |
Family
ID=82424735
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210285238.8A Active CN114782961B (en) | 2022-03-23 | 2022-03-23 | Character image augmentation method based on shape transformation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114782961B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399408A (en) * | 2018-03-06 | 2018-08-14 | 李子衿 | A kind of deformed characters antidote based on deep space converting network |
CN111209497A (en) * | 2020-01-05 | 2020-05-29 | 西安电子科技大学 | DGA domain name detection method based on GAN and Char-CNN |
CN111242241A (en) * | 2020-02-17 | 2020-06-05 | 南京理工大学 | Method for amplifying etched character recognition network training sample |
CN111652332A (en) * | 2020-06-09 | 2020-09-11 | 山东大学 | Deep learning handwritten Chinese character recognition method and system based on two classifications |
CN111915540A (en) * | 2020-06-17 | 2020-11-10 | 华南理工大学 | Method, system, computer device and medium for augmenting oracle character image |
CN114037644A (en) * | 2021-11-26 | 2022-02-11 | 重庆邮电大学 | Artistic digital image synthesis system and method based on generation countermeasure network |
-
2022
- 2022-03-23 CN CN202210285238.8A patent/CN114782961B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399408A (en) * | 2018-03-06 | 2018-08-14 | 李子衿 | A kind of deformed characters antidote based on deep space converting network |
CN111209497A (en) * | 2020-01-05 | 2020-05-29 | 西安电子科技大学 | DGA domain name detection method based on GAN and Char-CNN |
CN111242241A (en) * | 2020-02-17 | 2020-06-05 | 南京理工大学 | Method for amplifying etched character recognition network training sample |
CN111652332A (en) * | 2020-06-09 | 2020-09-11 | 山东大学 | Deep learning handwritten Chinese character recognition method and system based on two classifications |
CN111915540A (en) * | 2020-06-17 | 2020-11-10 | 华南理工大学 | Method, system, computer device and medium for augmenting oracle character image |
CN114037644A (en) * | 2021-11-26 | 2022-02-11 | 重庆邮电大学 | Artistic digital image synthesis system and method based on generation countermeasure network |
Non-Patent Citations (1)
Title |
---|
王浩彬: "基于深度学习的甲骨文检测与识别研究", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114782961B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lei et al. | Coupled adversarial training for remote sensing image super-resolution | |
CN110136063B (en) | Single image super-resolution reconstruction method based on condition generation countermeasure network | |
Liang et al. | Understanding mixup training methods | |
CN111563841B (en) | High-resolution image generation method based on generation countermeasure network | |
Zhu et al. | Data Augmentation using Conditional Generative Adversarial Networks for Leaf Counting in Arabidopsis Plants. | |
Liu et al. | Very deep convolutional neural network based image classification using small training sample size | |
CN106447626A (en) | Blurred kernel dimension estimation method and system based on deep learning | |
CN111080513B (en) | Attention mechanism-based human face image super-resolution method | |
CN107590497A (en) | Off-line Handwritten Chinese Recognition method based on depth convolutional neural networks | |
CN111695494A (en) | Three-dimensional point cloud data classification method based on multi-view convolution pooling | |
CN112837224A (en) | Super-resolution image reconstruction method based on convolutional neural network | |
CN114494003B (en) | Ancient character generation method combining shape transformation and texture transformation | |
Guo et al. | Multiscale semilocal interpolation with antialiasing | |
CN113096020B (en) | Calligraphy font creation method for generating confrontation network based on average mode | |
CN113744136A (en) | Image super-resolution reconstruction method and system based on channel constraint multi-feature fusion | |
CN113378949A (en) | Dual-generation confrontation learning method based on capsule network and mixed attention | |
CN113240584A (en) | Multitask gesture picture super-resolution method based on picture edge information | |
Lin et al. | Generative adversarial image super‐resolution network for multiple degradations | |
CN114782961B (en) | Character image augmentation method based on shape transformation | |
CN116797456A (en) | Image super-resolution reconstruction method, system, device and storage medium | |
CN114155560B (en) | Light weight method of high-resolution human body posture estimation model based on space dimension reduction | |
Jiang et al. | Tcgan: Semantic-aware and structure-preserved gans with individual vision transformer for fast arbitrary one-shot image generation | |
CN112732943B (en) | Chinese character library automatic generation method and system based on reinforcement learning | |
CN114140317A (en) | Image animation method based on cascade generation confrontation network | |
CN115909045B (en) | Two-stage landslide map feature intelligent recognition method based on contrast learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |