CN114782961B - Character image augmentation method based on shape transformation - Google Patents

Character image augmentation method based on shape transformation Download PDF

Info

Publication number
CN114782961B
CN114782961B CN202210285238.8A CN202210285238A CN114782961B CN 114782961 B CN114782961 B CN 114782961B CN 202210285238 A CN202210285238 A CN 202210285238A CN 114782961 B CN114782961 B CN 114782961B
Authority
CN
China
Prior art keywords
character image
layer
transformation
function
discriminator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210285238.8A
Other languages
Chinese (zh)
Other versions
CN114782961A (en
Inventor
黄双萍
黄鸿翔
杨代辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou
South China University of Technology SCUT
Original Assignee
Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou, South China University of Technology SCUT filed Critical Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou
Priority to CN202210285238.8A priority Critical patent/CN114782961B/en
Publication of CN114782961A publication Critical patent/CN114782961A/en
Application granted granted Critical
Publication of CN114782961B publication Critical patent/CN114782961B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/18Image warping, e.g. rearranging pixels individually

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps: constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator; the method comprises the steps of taking an original character image as input of a generator, generating a deformed character image after spatial transformation, connecting an output end of the generator with an input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting a discrimination result of the deformed character image and the target character image by the discriminator; training the shape transformation to generate a confrontation network; augmented character images are generated using a trained generator. The method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape changes at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.

Description

Character image augmentation method based on shape transformation
Technical Field
The invention belongs to the technical field of image processing and artificial intelligence, and particularly relates to a character image augmentation method based on shape transformation.
Background
Text image recognition methods based on deep learning exhibit great potential, however, training a high-performance character image recognition model requires a large amount of and as diverse as possible annotation data. Since manually collecting and labeling text image data is an extremely expensive and time-consuming task, this is especially true for text images with complicated shapes such as ancient characters, handwritten formulas, and the like. In contrast, data augmentation is a cost effective way to increase data diversity.
The data augmentation modes include augmentation based on shape transformation such as flipping, rotating, scaling, cropping and translating the image, and augmentation based on non-shape transformation such as color dithering and noise injection. The conventional shape transformation-based augmentation algorithm mainly randomly samples deformation parameters from an artificially preset probability distribution to control shape transformation of image content, for example, the most common way is to use an affine matrix (affine matrix) to generate affine transformation, such as an affine transformation-based Spatial Transformation Network (STN) algorithm, or generate local deformation by using other non-rigid deformation algorithms, such as Thin Plate Spline (TPS) transformation. However, the conventional augmentation algorithm relies on the artificially preset probability distribution for sampling, so that not only is the calculation and selection of the distribution complex, but also the selected distribution is difficult to completely fit the distribution of the actual character shape, which results in higher labor cost and poorer reality of deformation.
The traditional augmentation algorithm based on shape transformation mainly samples deformation parameters randomly from artificially preset probability distribution to control the shape transformation of image content, so that not only is the distribution calculation and selection complex, but also the selected distribution is difficult to completely fit the actual distribution of character shapes. Therefore, this method is labor-intensive and the resulting deformation is not realistic. Although the defect of manual calculation selection distribution can be solved by generating a countermeasure network, the techniques still have the problem of insufficient deformation fineness, because the actual character shape has both global (such as rotation angle, translation distance and the like) and local (such as twisting degree, length, thickness and the like of strokes) characteristics, while the prior art only generates single global deformation or local deformation, so that the authenticity and diversity of the transformed shape are poor, and finally the performance improvement of the expanded data on downstream tasks is limited. The augmentation technology based on the neural network adopts a generated countermeasure loss function as an optimization target, but only weak supervision is provided because the generated countermeasure loss function only calculates loss values of two kinds of labels, namely true labels and false labels, and the character image has global shape characteristics and local shape characteristics, so that stronger supervision is required to be provided for the loss function to ensure the shape authenticity and diversity of the transformed characters. For example, studies have shown that generating an anti-network may suppress the effect of semantic vectors or noise vectors in the generator, which may result in too little (even no change) or too much (distorted shape) deformation of the text, and may break the balance between the diversity and reality of the text shape, so that data expansion may only bring about limited performance improvement, and even produce adverse effects.
Disclosure of Invention
In view of the above, there is a need to provide a new character image augmentation method based on shape transformation, which combines affine matrix and TPS transformation sampling grid parameters to make STN generate global and local shape changes simultaneously, and improve the fineness of deformation; a noise vector injection technology is introduced into the STN to enrich the diversity of the formed samples, and meanwhile, a diversity loss function and a signal-noise reconstruction loss function are designed to strengthen supervision on a network. Wherein the diversity loss function is used to promote diversity of deformation parameters, thereby increasing the diversity of deformation. The signal-to-noise reconstruction loss function is used to ensure signal-to-noise balance so that the degree of distortion is maintained within a reasonable range.
The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps:
step 1, constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, inputting the target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
Specifically, the generator is a spatial transformation network and comprises an encoder, a predictor, a sampler, a noise reconstruction network and an image reconstruction network;
the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;
the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;
the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;
the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;
the noise reconstruction network is sequentially connected by a plurality of layers of full-connection modules and a full-connection layer;
firstly, an original character image is used as the input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS (thermoplastic polystyrene) conversion parameters and affine conversion parameters, wherein the TPS conversion parameters are coordinate values of TPS conversion sampling grid matching points, the affine conversion parameters are converted into affine conversion sampling grids, then the TPS conversion sampling grids, the affine conversion sampling grids and the original character image are input into a sampler, the converted character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and a target character image output by a generator are respectively input into a discriminator, and the discriminator outputs the discrimination results of the deformed character image and the target character image;
the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU active layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer.
Preferably, a least square generated countermeasure loss function is constructed as an optimization target of the shape transformation generated countermeasure network training, and a calculation formula of the least square generated countermeasure loss function is as follows:
Figure 224215DEST_PATH_IMAGE001
Figure 540927DEST_PATH_IMAGE002
Figure 325213DEST_PATH_IMAGE003
represents a generator penalty function, <' > based on the number of bits in the generator>
Figure 18362DEST_PATH_IMAGE004
Represents a discriminator>
Figure 938914DEST_PATH_IMAGE005
Represents a generator, <' > based on>
Figure 692106DEST_PATH_IMAGE006
Represents a signal to noise reconstruction loss function, <' > based on the signal to noise ratio>
Figure 698108DEST_PATH_IMAGE007
Represents a loss of diversity function, is selected>
Figure 929369DEST_PATH_IMAGE008
Presentation discriminatorA loss function +>
Figure 429662DEST_PATH_IMAGE009
Represents the original character image, and>
Figure 478390DEST_PATH_IMAGE010
represents the target character image, and>
Figure 847054DEST_PATH_IMAGE011
and &>
Figure 6640DEST_PATH_IMAGE012
Indicating the corresponding mathematical expectation.
The signal-noise reconstruction loss function comprises a signal reconstruction sub-term, a noise reconstruction sub-term and a reconstruction error ratio term, and the calculation formula is as follows:
Figure 511571DEST_PATH_IMAGE013
Figure 731199DEST_PATH_IMAGE014
represents the mean absolute error, is>
Figure 587160DEST_PATH_IMAGE009
Represents the original character image, and>
Figure 786322DEST_PATH_IMAGE015
represents a noise masking vector +>
Figure 411339DEST_PATH_IMAGE016
And &>
Figure 801869DEST_PATH_IMAGE017
Respectively represent an image reconstruction network->
Figure 879546DEST_PATH_IMAGE018
And &>
Figure 646514DEST_PATH_IMAGE019
Reconstructed original image and noise hidden vector->
Figure 126037DEST_PATH_IMAGE020
Is a dynamic coefficient, if the reconstruction error ratio term is greater than logM, let α =1, if the reconstruction error ratio term is less than-logM, let α = -1, if the reconstruction error ratio term is within the ideal range, i.e., [ -logM, logM]Then set α =0,M to represent the hyperparameter;
the calculation formula of the diversity loss function is as follows:
Figure 687468DEST_PATH_IMAGE021
whereinPA predictor is represented by a representation of the motion vector,Ewhich represents the encoder, is a digital representation of the encoder,
Figure 252442DEST_PATH_IMAGE022
and &>
Figure 790477DEST_PATH_IMAGE023
Respectively representing different noise hidden vectors taken from the same gaussian distribution.
Optionally, each convolution module in the encoder further includes a batch normalization layer located between the two-dimensional convolution layer and the nonlinear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects to maximize convergence.
Optionally, each fully-connected module in the predictor further comprises a batch normalization layer located between the fully-connected layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximization of convergence.
Preferably, the number of output channels of the last fully-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.
Optionally, the transposed convolution module further includes a batch normalization layer located between the transposed convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence.
Specifically, the sampler is implemented by a torch.nn.functional.grid _ sample () method in the torch, and simultaneously, affine parameters are converted into affine transformation sampling grids by the torch.nn.functional.affine _ grid () method in the torch.
More specifically, the goal of the TPS transformation is to solve a deformation function
Figure 390086DEST_PATH_IMAGE024
The deformation function is such that
Figure 122418DEST_PATH_IMAGE025
And the bending energy function is minimal, <' >>
Figure 174688DEST_PATH_IMAGE026
Coordinates representing a TPS transform sample grid match point on the original character image, <' >>
Figure 17879DEST_PATH_IMAGE027
Coordinates of the TPS transform sample grid match points, which represent the morphed character image, <' >>
Figure 471994DEST_PATH_IMAGE028
For the number of sampling grid matching points in a TPS transform, assume that n sets of matching point pairs for two images have been acquired:
Figure 375228DEST_PATH_IMAGE029
Figure 914794DEST_PATH_IMAGE030
、…、
Figure 63141DEST_PATH_IMAGE031
The deformation function is imagined as bending a thin metal plate through a given n points, the energy function of bending the sheet being expressed as:
Figure 371762DEST_PATH_IMAGE032
the thin plate spline function can be proved to be the function with the minimum bending energy, and the thin plate spline function is as follows:
Figure 445897DEST_PATH_IMAGE033
wherein
Figure 207180DEST_PATH_IMAGE034
As basis functions:
Figure 657753DEST_PATH_IMAGE035
Figure 820881DEST_PATH_IMAGE036
(4)
Figure 65918DEST_PATH_IMAGE037
Figure 376813DEST_PATH_IMAGE038
Figure 240864DEST_PATH_IMAGE039
and &>
Figure 881668DEST_PATH_IMAGE040
Solving the coordinate preset value of the matching point of the n TPS transformation sampling grids and the offset predicted by the predictor, so as to obtain ^ greater or greater than or equal to>
Figure 438551DEST_PATH_IMAGE041
And (5) specific expressions.
The sampling formula of the affine transformation sampling grid is as follows:
Figure 299059DEST_PATH_IMAGE042
wherein,scaletheta
Figure 701222DEST_PATH_IMAGE043
Figure 963576DEST_PATH_IMAGE044
respectively representing affine transformation parameters predicted by the predictor>
Figure 691361DEST_PATH_IMAGE045
And
Figure 773586DEST_PATH_IMAGE046
respectively the position coordinates of the pixel points before and after transformation.
Preferably, all images have a pixel size of 64 × 64, a batch size of 64, an initial learning rate of 0.0001, a number of iterations of the countermeasure network of 5000, and the learning rate starts to decay linearly to 1e after 2500 iterations -5 And optimizing the countermeasure network by adopting an adam optimizer.
Compared with the prior art, the invention has the beneficial effects that:
the method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape changes at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.
The method introduces a noise vector injection technology into the STN to promote the diversity of generated samples, designs a signal-noise reconstruction loss function capable of ensuring signal-noise balance and a diversity loss function capable of generating rich shape transformation, provides stronger supervision for the training of the STN, ensures that the deformation degree of the samples is more reasonable and rich, uses the augmented data for training a classifier, and can improve the classification performance of the classifier.
Drawings
FIG. 1 shows a schematic flow diagram of a method embodying the present invention;
fig. 2 is a schematic structural diagram illustrating operations of modules according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For reference and clarity, the technical terms, abbreviations or acronyms used hereinafter are to be construed in summary as follows:
STN: spatialtransformational network space transformation network.
TPS, thin plate spline and thin plate spline.
CNN: a convolutional neural network.
FC network: fully connected network.
Pythroch: a mainstream deep learning framework encapsulates a plurality of commonly used deep learning related functions and classes.
ReLU/leakyReLU: a non-linear activation function.
And (3) generating a countermeasure network: a generating network training framework based on the idea of zero sum game comprises a generator and a discriminator.
Hidden vector quantity: vectors in random variable space.
The invention discloses a character image augmentation method based on shape transformation, which aims to solve various problems in the prior art.
Fig. 1 shows a schematic flow diagram of an embodiment of the invention. A character image augmentation method based on shape transformation includes the following steps:
step 1, constructing a shape transformation generation confrontation network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
As shown in fig. 2, the generator is a spatial transform network, and includes an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network. Firstly, an original character image is used as input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for mapping out TPS (transformation protocol standard) transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, the transformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed. Then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
Specifically, the present embodiment adopts the following steps to implement the inventive method.
1. And respectively constructing an original character data set to be augmented and a target character data set with target shape characteristics.
2. A space transformation network is built to be used as a generator in the countermeasure training, and the specific steps are as follows:
(1) The spatial transform network comprises three modules, respectively an encoder, a predictor and a sampler. Firstly, an encoder is constructed, the encoder is composed of connected Convolutional Neural Networks (CNNs), the number of convolutional layers of the CNNs is generally selected to be more than 3, in an implementation example, 4 convolutional modules are selected to be connected in sequence, each convolutional module comprises a two-dimensional convolutional layer, a batch normalization layer, a nonlinear activation layer and a convergence (pooling) layer, wherein the batch normalization layer is optional, the nonlinear activation function can be selected from a ReLU function or a leakyReLU function, the convergence operation can be selected from maximization convergence, average convergence or self-adaptive convergence, and the ReLU function and the maximization convergence are adopted in the implementation example.
(2) Secondly, a predictor is constructed, the predictor is composed of a fully-connected (FC) neural network, the number of FC layers is generally more than 2, the FC network of the embodiment comprises 3 FC modules which are connected in sequence, and finally, one FC layer is connected. Each FC module comprises an FC layer, a batch normalization layer and a nonlinear activation layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a LeakyReLU function, the aggregation operation can be maximized aggregation, average aggregation or adaptive aggregation, the ReLU function and the maximized aggregation are adopted in the example, the number of output channels of the last FC layer is set to be the number of deformation parameters needing to be predicted, the number of the deformation parameters adopted in the example is 132, the 128 parameters are coordinates of matching points of 8 × 8=64 TPS transformation sampling grids, and 4 are element values of an affine transformation matrix. Note that the number of TPS transform sampling grid matching points may instead be less than the square of any integer number of the original image height or width.
(3) And then constructing a sampler, wherein the sampler maps the deformed character image pixel region to the original character image pixel region by applying matrix multiplication on a sampling grid, and the sampling is realized by adopting a torch.
(4) Finally, an image reconstruction network is constructed
Figure 245019DEST_PATH_IMAGE047
And noise reconstruction network->
Figure 597765DEST_PATH_IMAGE048
Figure 230872DEST_PATH_IMAGE047
Consists of 3 FC modules, 1 FC layer and 4 transposition convolution modules which are connected in sequence, and the judgment result is based on the result of judgment>
Figure 65972DEST_PATH_IMAGE048
The device comprises 3 FC modules and 1 FC layer which are connected in sequence, wherein the transposed convolution module comprises 1 transposed convolution layer, 1 batch normalization layer and 1 nonlinear activation layer which are connected in sequence, wherein the batch normalization layer is optional, the nonlinear activation function can select a ReLU function or a LEAKyReLU function, the convergence operation can select maximum convergence, average convergence or self-adaptive convergence, and the ReLU function and the maximum convergence are adopted in the embodiment. The role of the reconstructed network is detailed in point 4.
(5) The working principle of the space transformation network is as follows: firstly, an original character image is taken as the input of an encoder, the encoder extracts shape features from the original character image, and outputs a shape feature vector
Figure 341096DEST_PATH_IMAGE049
Next, we randomly choose a noise hidden vector ≧ from the standard normal distribution>
Figure 312463DEST_PATH_IMAGE015
Will >>
Figure 382050DEST_PATH_IMAGE049
And &>
Figure 173289DEST_PATH_IMAGE015
Fusion is carried out in the manner of the present example by directly carrying out a sum-on-sum, in which->
Figure 252103DEST_PATH_IMAGE049
Contains character-type characteristic information and has the function of ensuring the authenticity of output>
Figure 842091DEST_PATH_IMAGE015
Certain randomness can be brought, and the diversity of output is ensured. And inputting the fused implicit vector into a predictor, wherein the predictor is responsible for mapping out TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the sampling grid has 8 × 8=64 grid matching points, the affine transformation parameters are element values of an affine transformation matrix, and the total number of the parameters is 4. Next, the 4 affine transformation parameters are converted into an affine transformation sampling grid by the torch. Then, the TPS transformed sampling grid and the affine transformed sampling grid and the original image are input into the sampler, and the deformed character image is output. We assume that n sets of TPS transform sampling grid matching point pairs for two images have been acquired:
Figure 817000DEST_PATH_IMAGE050
Figure 361114DEST_PATH_IMAGE051
、…、
Figure 978040DEST_PATH_IMAGE052
In this example, n is 64. The procedure for the coordinate correspondence calculated using the TPS transform is as follows: the objective of the TPS transformation is to solve a function f such that @>
Figure 658420DEST_PATH_IMAGE053
And the bending energy function is minimum, so that other points on the image can be insertedThe values yield good transformation results. The deformation function can be thought of as bending a thin metal plate through a given n points, and the energy function for bending the sheet can be expressed as:
Figure 804231DEST_PATH_IMAGE054
(1)
the spline function of the thin plate can be proved to be the function with the minimum bending energy, and the spline function of the thin plate is as follows:
Figure 835641DEST_PATH_IMAGE055
(2)
where U is the basis function:
Figure 990679DEST_PATH_IMAGE056
(3)
Figure 292609DEST_PATH_IMAGE057
(4)
in the above formula, only need to obtain
Figure 874900DEST_PATH_IMAGE037
Figure 393606DEST_PATH_IMAGE038
Figure 86756DEST_PATH_IMAGE039
And &>
Figure 7307DEST_PATH_IMAGE040
Can determine->
Figure 760500DEST_PATH_IMAGE041
Figure 500923DEST_PATH_IMAGE037
Figure 997763DEST_PATH_IMAGE038
Figure 271356DEST_PATH_IMAGE039
And &>
Figure 195450DEST_PATH_IMAGE040
The solution can be done by sampling the preset values of the grid matching point coordinates and the offset predicted by the predictor through 64 TPS transforms.
Similarly, assume that
Figure 688748DEST_PATH_IMAGE045
And &>
Figure 723700DEST_PATH_IMAGE058
The positions of the pixel points before and after the transformation are respectively, the sampling formula of the affine transformation is as follows:
Figure 353265DEST_PATH_IMAGE060
(5)
whereinscaletheta
Figure 244997DEST_PATH_IMAGE043
Figure 835378DEST_PATH_IMAGE044
Each representing 4 affine transformation parameters predicted by the predictor.
And finally, connecting the output end of the predictor with the input ends of the image reconstruction network and the noise reconstruction network, and reconstructing an original image and a noise hidden vector respectively.
3. And constructing a discriminator in the countermeasure training, wherein the discriminator is based on the structure of patchGAN and consists of 5 convolution modules which are connected in sequence, and each module in the first 4 convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU activation layer. The last convolution module includes an optional padding layer and a two-dimensional convolution layer.
4. The generator uses the original character image
Figure 798655DEST_PATH_IMAGE009
As input, the character image which is deformed after passing through the spatial transformation network->
Figure 423672DEST_PATH_IMAGE061
The output end of the generator is connected with the input end of the discriminator, and simultaneously, the target character image is based on the judgment result>
Figure 315667DEST_PATH_IMAGE010
And the input is input to the other output end of the discriminator, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
5. Constructing a signal-noise reconstruction loss function, wherein the loss function consists of three sub-terms, namely a signal reconstruction sub-term, a noise reconstruction sub-term and a reconstruction error ratio term, and the formula is calculated as follows:
Figure 393344DEST_PATH_IMAGE062
(6)
wherein,
Figure 894732DEST_PATH_IMAGE016
and &>
Figure 374255DEST_PATH_IMAGE017
Respectively indicates the reconstructed network->
Figure 935687DEST_PATH_IMAGE018
And &>
Figure 235081DEST_PATH_IMAGE063
The reconstructed original image and the noise hidden vector. In the absence of strong supervision, the neural network learning process may inhibit the effect of input information, and in order to avoid this, we design reconstruction loss terms for information and noise vectors respectively, so that shape information guarantees authenticity and noiseEnsuring that the effects of diversity are not inhibited. In addition, in order to ensure that the transformation degree of the transformed font is more reasonable and controllable, a reconstruction loss ratio is designed to seek the balance of respective effects of shape information and noise, and a hyperparameter M is further used>1 to constrain this term. Where α is the dynamic coefficient, we assume α =1 if the term is greater than logM during training. In this case, the noise reconstruction is much worse than the signal reconstruction. This means that the effect of noise is suppressed by the network and we optimize the positive ratio term using the gradient descent method. Conversely, when the term is less than-logM, we assume α = -1. In this case, the signal reconstruction is much worse than the noise reconstruction. This means that the effect of noise is too prominent, the effect of shape information is suppressed and we use the gradient descent method to optimize the negative ratio term. If the term is within the desired range, i.e., [ -logM, logM]Then α =0 is set, i.e. without any additional optimization.
6. And constructing a diversity loss function, wherein the diversity is in positive correlation with the difference between transformation parameters corresponding to different noise hidden vectors, and the two transformation parameters are respectively estimated from two signal-noise mixed hidden vectors which are respectively injected with noise. The loss of diversity is defined as follows:
Figure 540160DEST_PATH_IMAGE064
(7)
where P denotes a predictor, E denotes an encoder,
Figure 139769DEST_PATH_IMAGE022
and &>
Figure 636216DEST_PATH_IMAGE023
Respectively representing different noise hidden vectors taken from the same gaussian distribution.
7. Constructing a least square to generate a confrontation loss function as an optimization target of confrontation training, wherein the confrontation loss is to draw up the distribution between a deformed character image and a target character image, and the specific formula is as follows:
Figure 422906DEST_PATH_IMAGE065
Figure 531677DEST_PATH_IMAGE066
(8)
through the countertraining, the spatial transform network can produce samples that approximate the target character image.
8. All image pixel sizes were set to 64 x 64, batch size was 64, initial learning rate was 0.0001, number of iterations of the network was 5000, learning rate started to linearly decay to 1e-5 after 2500 iterations, and the network was optimized using adam optimizer.
9. Training the whole shape transformation generation countermeasure network according to the setting in 7 to obtain a trained generator which can be used for generating various augmentation samples.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. A character image augmentation method based on shape transformation is characterized by comprising the following steps:
step 1, constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, and connecting the output end of the generator with the input end of a discriminator; inputting the target character image into the other input end of the discriminator, and outputting discrimination results of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
step 4, generating an augmented character image by using a trained generator;
the generator is a space transformation network and comprises an encoder, a predictor, a sampler, a noise reconstruction network and an image reconstruction network;
the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;
the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;
the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;
the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;
the noise reconstruction network is sequentially connected by a plurality of layers of full connection modules and a full connection layer;
the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU activation layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer;
firstly, an original character image is used as input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS conversion parameters and affine conversion parameters, wherein the TPS conversion parameters are coordinate values of TPS conversion sampling grid matching points, the affine conversion parameters are converted into affine conversion sampling grids, then the TPS conversion sampling grids, the affine conversion sampling grids and the original character image are input into a sampler, a deformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, the original character image and the noise hidden vector are respectively reconstructed, then the deformed character image and a target character image output by a generator are respectively input into a discriminator, and the discriminator outputs discrimination results of the deformed character image and the target character image.
2. The character image augmentation method based on shape transformation as claimed in claim 1, wherein a least squares generation countermeasure loss function is constructed as an optimization target of shape transformation generation countermeasure network training, and a least squares generation countermeasure loss function calculation formula is as follows:
Figure FDA0004064780050000021
Figure FDA0004064780050000022
Figure FDA0004064780050000023
representing the generator loss function, D g Presentation discriminator, G g Representation generator, L snr (G g ) Representing the signal-to-noise reconstruction loss function, L div (E, P) denotes a loss of diversity function, based on>
Figure FDA0004064780050000024
Representing a discriminator loss function, x representing an original character image, y representing a target character image, and->
Figure FDA0004064780050000025
And &>
Figure FDA0004064780050000026
Indicating the corresponding mathematical expectation.
3. The method as claimed in claim 2, wherein the signal-to-noise reconstruction loss function includes a signal reconstruction sub-term, a noise reconstruction sub-term, and a reconstruction error ratio term, and the calculation formula is as follows:
Figure FDA0004064780050000031
L 1 representing mean absolute error, z representing noise-hidden vector, x rec And z rec Individual watchImage reconstruction network R x And R z The reconstructed original image and the noise hidden vector, alpha is a dynamic coefficient, if the reconstruction error ratio term is larger than logM, alpha =1 is set, when the reconstruction error ratio term is smaller than-logM, alpha = -1 is set, if the reconstruction error ratio term is in an ideal range, namely [ -logM, logM]Then set α =0,M to represent the hyperparameter;
the calculation formula of the diversity loss function is as follows:
L div (E,P)=-L 1 (P(E(x),z 1 ),P(E(x),z 2 ))
where P denotes predictor, E denotes encoder, z 1 And z 2 Respectively representing different noise hidden vectors taken from the same gaussian distribution.
4. The method of claim 1, wherein each convolution module in the encoder further includes a batch normalization layer disposed between the two-dimensional convolution layer and the non-linear active layer; the nonlinear activation function in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence;
each full-connection module in the predictor also comprises a batch normalization layer which is positioned between the full-connection layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence;
the transposition convolution module also comprises a batch normalization layer which is positioned between the transposition convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence.
5. The method as claimed in claim 1, wherein the number of output channels of the last fully-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.
6. A method for augmenting a character image based on shape transformation according to claim 1 or 5, characterized in that the TPS transformation aims to solve a deformation function f such that f (p) is i (x i ,y i ))=p i (x i ′,y i ') (1. Ltoreq. I. Ltoreq.n) with the smallest bending energy function, p i (x i ,y i ) Coordinates, p, representing matching points of a TPS transformed sampling grid on an original character image i (x i ′,y i ') represents coordinates of TPS conversion sampling grid matching points of the deformed character image, n is the number of the TPS conversion sampling grid matching points, and supposing that n groups of TPS conversion sampling grid matching point coordinate pairs of the two images are obtained: (p) 1 (x 1 ,y 1 ),p 1 (x 1 ′,y 1 ′))、(p 2 (x 2 ,y 2 ),p 2 (x 2 ′,y 2 ′))、...、(p n (x n ,y n ),p n (x n ′,y n ')) that is to be used to bend a thin metal sheet through a given n points, the energy function for bending the sheet is expressed as:
Figure FDA0004064780050000041
the thin plate spline function can be proved to be the function with the minimum bending energy, and the thin plate spline function is as follows:
Figure FDA0004064780050000042
where U is the basis function:
Figure FDA0004064780050000043
Figure FDA0004064780050000044
a 1 ,a 2 ,a 3 and w i And transforming preset values of coordinates of the sampling grid matching points and the offset predicted by the predictor through the n TPS to solve, thereby obtaining a specific expression of f (x, y).
7. The method for augmenting the character image based on the shape transformation as recited in claim 1 or 5, wherein the sampling formula of the affine transformation sampling grid is as follows:
Figure FDA0004064780050000045
wherein, scale, theta, t x ,t y The (x, y) and (x ', y') are position coordinates of pixel points before and after transformation, respectively.
8. The method as claimed in claim 1, wherein the sampler is implemented by a torch.nn.functional.grid _ sample () method in the pyrtch, and the affine parameters are converted into the affine transformation sampling grid by a torch.nn.functional.affine _ grid () method in the pyrtch.
9. The method of claim 1, wherein the size of pixels of all images is 64 x 64, the batch size is 64, the initial learning rate is 0.0001, the number of iterations of the countermeasure network is 5000, and the learning rate starts to linearly decay to 1e after 2500 iterations -5 And optimizing the countermeasure network by adopting an adam optimizer.
CN202210285238.8A 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation Active CN114782961B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210285238.8A CN114782961B (en) 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210285238.8A CN114782961B (en) 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation

Publications (2)

Publication Number Publication Date
CN114782961A CN114782961A (en) 2022-07-22
CN114782961B true CN114782961B (en) 2023-04-18

Family

ID=82424735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210285238.8A Active CN114782961B (en) 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation

Country Status (1)

Country Link
CN (1) CN114782961B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399408A (en) * 2018-03-06 2018-08-14 李子衿 A kind of deformed characters antidote based on deep space converting network
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209497B (en) * 2020-01-05 2022-03-04 西安电子科技大学 DGA domain name detection method based on GAN and Char-CNN
CN111242241A (en) * 2020-02-17 2020-06-05 南京理工大学 Method for amplifying etched character recognition network training sample
CN111915540B (en) * 2020-06-17 2023-08-18 华南理工大学 Rubbing oracle character image augmentation method, rubbing oracle character image augmentation system, computer equipment and medium
CN114037644B (en) * 2021-11-26 2024-07-23 重庆邮电大学 Artistic word image synthesis system and method based on generation countermeasure network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399408A (en) * 2018-03-06 2018-08-14 李子衿 A kind of deformed characters antidote based on deep space converting network
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications

Also Published As

Publication number Publication date
CN114782961A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN110136063B (en) Single image super-resolution reconstruction method based on condition generation countermeasure network
Liang et al. Understanding mixup training methods
CN111080513B (en) Attention mechanism-based human face image super-resolution method
CN106447626A (en) Blurred kernel dimension estimation method and system based on deep learning
CN108960301B (en) Ancient Yi-nationality character recognition method based on convolutional neural network
CN114118012B (en) Personalized font generation method based on CycleGAN
CN109064389B (en) Deep learning method for generating realistic images by hand-drawn line drawings
CN112837224A (en) Super-resolution image reconstruction method based on convolutional neural network
CN112580694B (en) Small sample image target recognition method and system based on joint attention mechanism
CN112364838B (en) Method for improving handwriting OCR performance by utilizing synthesized online text image
CN114494003B (en) Ancient character generation method combining shape transformation and texture transformation
CN113378949A (en) Dual-generation confrontation learning method based on capsule network and mixed attention
CN113744136A (en) Image super-resolution reconstruction method and system based on channel constraint multi-feature fusion
Guo et al. Creating New Chinese Fonts based on Manifold Learning and Adversarial Networks.
CN114782961B (en) Character image augmentation method based on shape transformation
CN112732943B (en) Chinese character library automatic generation method and system based on reinforcement learning
CN114140317A (en) Image animation method based on cascade generation confrontation network
CN113420760A (en) Handwritten Mongolian detection and identification method based on segmentation and deformation LSTM
Miao et al. Chinese font migration combining local and global features learning
CN114022360B (en) Rendered image super-resolution system based on deep learning
CN115909045B (en) Two-stage landslide map feature intelligent recognition method based on contrast learning
CN114897672B (en) Image cartoon style migration method based on equal deformation constraint
CN118037898B (en) Text generation video method based on image guided video editing
Liu et al. Implementation of Artificial Intelligence Anime Stylization System Based on PyTorch
Su et al. Single Image Super-Resolution Via A Progressive Mixture Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant