CN114782961A - Character image augmentation method based on shape transformation - Google Patents

Character image augmentation method based on shape transformation Download PDF

Info

Publication number
CN114782961A
CN114782961A CN202210285238.8A CN202210285238A CN114782961A CN 114782961 A CN114782961 A CN 114782961A CN 202210285238 A CN202210285238 A CN 202210285238A CN 114782961 A CN114782961 A CN 114782961A
Authority
CN
China
Prior art keywords
character image
layer
transformation
function
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210285238.8A
Other languages
Chinese (zh)
Other versions
CN114782961B (en
Inventor
黄双萍
黄鸿翔
杨代辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou
South China University of Technology SCUT
Original Assignee
Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou, South China University of Technology SCUT filed Critical Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou
Priority to CN202210285238.8A priority Critical patent/CN114782961B/en
Publication of CN114782961A publication Critical patent/CN114782961A/en
Application granted granted Critical
Publication of CN114782961B publication Critical patent/CN114782961B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/18Image warping, e.g. rearranging pixels individually

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps: constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator; taking an original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator; training the shape transformation to generate a confrontation network; augmented character images are generated using a trained generator. The method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape changes at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.

Description

Character image augmentation method based on shape transformation
Technical Field
The invention belongs to the technical field of image processing and artificial intelligence, and particularly relates to a character image augmentation method based on shape transformation.
Background
Text image recognition methods based on deep learning exhibit great potential, however, training a high-performance character image recognition model requires a large amount of and as diverse as possible annotation data. Since manually collecting and labeling text image data is an extremely expensive and time-consuming task, this is especially true for text images with complicated shapes such as ancient texts, handwritten formulas, and the like. In contrast, data augmentation is a cost effective way to increase data diversity.
The data augmentation modes include augmentation based on shape transformation such as flipping, rotating, scaling, cropping and translation of the image, and augmentation based on non-shape transformation such as color dithering and noise injection. The conventional shape transformation-based augmentation algorithm mainly randomly samples deformation parameters from an artificially preset probability distribution to control shape transformation of image content, for example, the most common way is to use an affine matrix (affine matrix) to generate affine transformation, such as an affine transformation-based Spatial Transformation Network (STN) algorithm, or generate local deformation by using other non-rigid deformation algorithms, such as Thin Plate Spline (TPS) transformation. However, the conventional augmentation algorithm relies on the artificially preset probability distribution for sampling, so that not only is the calculation and selection of the distribution complex, but also the selected distribution is difficult to completely fit the distribution of the actual character shape, which results in higher labor cost and poorer reality of deformation.
The traditional augmentation algorithm based on shape transformation mainly samples deformation parameters randomly from artificially preset probability distribution to control the shape transformation of image content, so that not only is the distribution calculation and selection complex, but also the selected distribution is difficult to completely fit the actual distribution of character shapes. Therefore, this method is labor-intensive and the resulting deformation is not realistic. Although the defect of manual calculation and selection distribution can be overcome by generating a countermeasure network, the techniques still have the problem of insufficient deformation fineness because the actual character shape has the characteristics of global (such as a rotation angle, a translation distance and the like) and local (such as the distortion degree, the length, the thickness and the like of a stroke), and the prior art only generates single global deformation or local deformation, so that the authenticity and the diversity of the transformed shape are poor, and finally the performance improvement of the expanded data on downstream tasks is limited. The augmentation technology based on the neural network adopts the generation of the antagonistic loss function as an optimization target, but because the generation of the antagonistic loss function only calculates the loss values of two kinds of labels, namely true labels and false labels, weak supervision is only provided, and the character image has global shape characteristics and local shape characteristics, so that the loss function needs to be provided with stronger supervision in order to ensure the authenticity and diversity of the shape of the transformed characters. For example, studies have shown that generating an anti-network may suppress the effect of semantic vectors or noise vectors in the generator, which may result in too little (even no change) or too much (distorted shape) deformation of the text, and may break the balance between the diversity and reality of the text shape, so that data expansion may only bring about limited performance improvement, and even produce adverse effects.
Disclosure of Invention
In view of the above, there is a need to provide a new character image augmentation method based on shape transformation, which combines affine matrix and TPS transformation sampling grid parameters to make STN generate global and local shape changes simultaneously, and improve the fineness of deformation; a noise vector injection technology is introduced into the STN to enrich the diversity of the formed samples, and meanwhile, a diversity loss function and a signal-noise reconstruction loss function are designed to strengthen supervision on a network. Wherein the diversity loss function is used to promote diversity of deformation parameters, thereby increasing diversity of deformation. The signal-to-noise reconstruction loss function is used to ensure signal-to-noise balance, so that the deformation degree is maintained within a reasonable range.
The invention discloses a character image augmentation method based on shape transformation, which comprises the following steps:
step 1, constructing a shape transformation generation confrontation network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, inputting the target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
Specifically, the generator is a spatial transformation network and comprises an encoder, a predictor, a sampler, a noise reconstruction network and an image reconstruction network;
the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;
the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;
the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;
the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;
the noise reconstruction network is sequentially connected by a plurality of layers of full connection modules and a full connection layer;
firstly, an original character image is used as the input of an encoder, the encoder extracts shape characteristics from the original character image and outputs a shape characteristic vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape characteristic vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS transformation parameters and affine transformation parameters, wherein the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, the transformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and the target character image output by the generator are respectively input to a discriminator, and the discriminator outputs discrimination results of the deformed character image and the target character image;
the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU active layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer.
Preferably, a least square generated countermeasure loss function is constructed as an optimization target of the shape transformation generated countermeasure network training, and a calculation formula of the least square generated countermeasure loss function is as follows:
Figure 224215DEST_PATH_IMAGE001
Figure 540927DEST_PATH_IMAGE002
Figure 325213DEST_PATH_IMAGE003
a function representing the loss of the generator is represented,
Figure 18362DEST_PATH_IMAGE004
the presence of the discriminator is indicated by the expression,
Figure 938914DEST_PATH_IMAGE005
a representation generator for generating a representation of the object,
Figure 692106DEST_PATH_IMAGE006
representing a signal-to-noise reconstruction loss function,
Figure 698108DEST_PATH_IMAGE007
a function representing the loss of diversity is represented,
Figure 929369DEST_PATH_IMAGE008
a function representing the loss of the arbiter is represented,
Figure 429662DEST_PATH_IMAGE009
which represents an image of an original character,
Figure 478390DEST_PATH_IMAGE010
a target character image is represented by a character image of a character,
Figure 847054DEST_PATH_IMAGE011
and
Figure 6640DEST_PATH_IMAGE012
indicating the corresponding mathematical expectation.
The signal-noise reconstruction loss function comprises a signal reconstruction sub-item, a noise reconstruction sub-item and a reconstruction error ratio item, and the calculation formula is as follows:
Figure 511571DEST_PATH_IMAGE013
Figure 731199DEST_PATH_IMAGE014
the mean absolute error is represented by the average absolute error,
Figure 587160DEST_PATH_IMAGE009
which represents the image of the original character(s),
Figure 786322DEST_PATH_IMAGE015
a noise-hidden vector is represented that is,
Figure 411339DEST_PATH_IMAGE016
and
Figure 801869DEST_PATH_IMAGE017
separately representing image reconstruction networks
Figure 879546DEST_PATH_IMAGE018
And
Figure 646514DEST_PATH_IMAGE019
the reconstructed original image and the noise hidden vector,
Figure 126037DEST_PATH_IMAGE020
is a dynamic coefficient, if the reconstruction error ratio term is greater than logM, let α =1, if the reconstruction error ratio term is less than-logM, let α = -1, if the reconstruction error ratio term is within the ideal range, i.e., [ -logM, logM]If the parameter is set to be alpha =0, M represents a hyperparameter;
the calculation formula of the diversity loss function is as follows:
Figure 687468DEST_PATH_IMAGE021
whereinPA predictor is represented by a representation of the motion vector,Eit is shown that the encoder is a digital video encoder,
Figure 252442DEST_PATH_IMAGE022
and
Figure 790477DEST_PATH_IMAGE023
respectively representing different noise hidden vectors taken from the same gaussian distribution.
Optionally, each convolution module in the encoder further includes a batch normalization layer located between the two-dimensional convolution layer and the nonlinear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects to maximize convergence.
Optionally, each fully-connected module in the predictor further comprises a batch normalization layer located between the fully-connected layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximization of convergence.
Preferably, the number of output channels of the last fully-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.
Optionally, the transposed convolution module further includes a batch normalization layer located between the transposed convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximum convergence.
Specifically, the sampler is implemented by a torch.nn.functional.grid _ sample () method in the torch, and simultaneously, the affine parameters are converted into the affine transformation sampling grid by the torch.nn.functional.affine _ grid () method in the torch.
More specifically, the goal of the TPS transformation is to solve a deformation function
Figure 390086DEST_PATH_IMAGE024
The deformation function is such that
Figure 122418DEST_PATH_IMAGE025
And the function of the bending energy is the minimum,
Figure 174688DEST_PATH_IMAGE026
coordinates representing matching points of the TPS transformed sampling grid on the original character image,
Figure 17879DEST_PATH_IMAGE027
the TPS transform representing the deformed character image samples the coordinates of the grid matching points,
Figure 471994DEST_PATH_IMAGE028
for one TPS transform sampling the number of grid matching points, assume that n sets of matching point pairs for two images have been acquired:
Figure 375228DEST_PATH_IMAGE029
Figure 914794DEST_PATH_IMAGE030
、…、
Figure 63141DEST_PATH_IMAGE031
the deformation function is imagined as bending a thin metal plate through a given n points, the energy function of bending the sheet being expressed as:
Figure 371762DEST_PATH_IMAGE032
the spline function of the thin plate can be proved to be the function with the minimum bending energy, and the spline function of the thin plate is as follows:
Figure 445897DEST_PATH_IMAGE033
wherein
Figure 207180DEST_PATH_IMAGE034
As basis functions:
Figure 657753DEST_PATH_IMAGE035
Figure 820881DEST_PATH_IMAGE036
(4)
Figure 65918DEST_PATH_IMAGE037
Figure 376813DEST_PATH_IMAGE038
Figure 240864DEST_PATH_IMAGE039
and
Figure 881668DEST_PATH_IMAGE040
the method is solved by preset values of n TPS transformation sampling grid matching point coordinates and the offset predicted by a predictor, so that the method can obtain
Figure 438551DEST_PATH_IMAGE041
And (5) specific expressions.
The sampling formula of the affine transformation sampling grid is as follows:
Figure 299059DEST_PATH_IMAGE042
wherein the content of the first and second substances,scaletheta
Figure 701222DEST_PATH_IMAGE043
Figure 963576DEST_PATH_IMAGE044
respectively representing affine transformation parameters predicted by the predictor,
Figure 691361DEST_PATH_IMAGE045
and
Figure 773586DEST_PATH_IMAGE046
respectively the position coordinates of the pixel points before and after transformation.
Preferably, all images have a pixel size of 64 x 64, a batch size of 64, an initial learning rate of 0.0001, a challenge network iteration count of 5000, and a linear decay of the learning rate to 1e after 2500 iterations-5And the countermeasure network is optimized by adopting an adam optimizer.
Compared with the prior art, the invention has the beneficial effects that:
the method combines the affine matrix and the TPS to transform the sampling grid parameters, so that the STN can generate global and local shape change at the same time, the shape characteristics of the character can be better fitted, the authenticity and diversity of the generated character are better, and the classification performance of the classifier trained by using the augmented data is further improved.
The method of the invention introduces a noise vector injection technology in the STN to promote the diversity of the generated samples, designs a signal-noise reconstruction loss function which can ensure the signal-noise balance and a diversity loss function which can generate rich shape transformation, provides stronger supervision for the training of the STN, ensures that the deformation degree of the samples is more reasonable and rich, uses the augmented data in the training classifier, and can improve the classification performance of the classifier.
Drawings
FIG. 1 shows a schematic flow diagram of a method embodying the present invention;
fig. 2 is a schematic structural diagram illustrating operations of modules according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
For reference and clarity, the technical terms, abbreviations or acronyms used hereinafter are to be construed in summary as follows:
STN: a spatialtransformational network space transform network.
TPS, thin plate spline and thin plate spline.
CNN: a convolutional neural network.
FC network: full connected network.
Pythrch: a mainstream deep learning framework encapsulates a plurality of commonly used deep learning related functions and classes.
ReLU/leakyReLU: a non-linear activation function.
Generating a countermeasure network: a generating network training framework based on the idea of zero sum game comprises a generator and a discriminator.
Hidden vector quantity: vectors in random variable space.
The invention discloses a character image augmentation method based on shape transformation, which aims to solve various problems in the prior art.
Fig. 1 shows a schematic flow diagram of an embodiment of the invention. A character image augmentation method based on shape transformation includes the following steps:
step 1, constructing a shape transformation generation confrontation network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, connecting the output end of the generator with the input end of a discriminator, simultaneously inputting a target character image to the other input end of the discriminator, and outputting the discrimination result of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
As shown in fig. 2, the generator is a spatial transform network, and includes an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network. First, an original character image is used as an input of an encoder, the encoder extracts shape features from the original character image, outputs a shape feature vector, then randomly selecting a noise hidden vector from the standard normal distribution, fusing the shape characteristic vector and the noise hidden vector, inputting the fused hidden vector into a predictor, the predictor is responsible for mapping TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then, the TPS conversion sampling grid, the affine conversion sampling grid and the original character image are input into a sampler, a converted character image is output, meanwhile, the output end of the predictor is connected with the input ends of the image reconstruction network and the noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed. Then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
Specifically, the present embodiment adopts the following steps to implement the inventive method.
1. And respectively constructing an original character data set to be augmented and a target character data set with target shape characteristics.
2. A space transformation network is built to be used as a generator in the countermeasure training, and the specific steps are as follows:
(1) the spatial transform network comprises three modules, respectively an encoder, a predictor and a sampler. Firstly, an encoder is constructed, the encoder is composed of connected Convolutional Neural Networks (CNNs), the number of convolutional layers of the CNNs is generally selected to exceed 3, in an implementation example, 4 convolutional modules are selected and connected in sequence, each convolutional module includes a two-dimensional convolutional layer, a batch normalization layer, a nonlinear activation layer and a convergence (pooling) layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a leakyReLU function, the convergence operation can be a maximum convergence, an average convergence or an adaptive convergence, and the ReLU function and the maximum convergence are adopted in the implementation example.
(2) Secondly, a predictor is constructed, the predictor is composed of a fully-connected (FC) neural network, the number of FC layers is generally more than 2, the FC network of the embodiment comprises 3 FC modules which are connected in sequence, and finally, one FC layer is connected. Each FC module comprises an FC layer, a batch normalization layer and a nonlinear activation layer, wherein the batch normalization layer is optional, the nonlinear activation function can be a ReLU function or a LeakyReLU function, the aggregation operation can be maximized aggregation, average aggregation or adaptive aggregation, the ReLU function and the maximized aggregation are adopted in the example, the number of output channels of the last FC layer is set to be the number of deformation parameters needing to be predicted, the number of the deformation parameters adopted in the example is 132, the 128 parameters are the coordinates of matching points of 8 x 8=64 TPS transformation sampling grids, and 4 are the element values of an affine transformation matrix. Note that the number of TPS transform sampling grid matching points may instead be less than the square of any integer number of original image heights or widths.
(3) And then constructing a sampler, wherein the sampler maps the deformed character image pixel region to the original character image pixel region by applying matrix multiplication on a sampling grid, and the sampling is realized by adopting a torch.
(4) Finally, an image reconstruction network is constructed
Figure 245019DEST_PATH_IMAGE047
Sum noise reconstruction network
Figure 597765DEST_PATH_IMAGE048
Figure 230872DEST_PATH_IMAGE047
Consists of 3 FC modules, 1 FC layer and 4 transposition convolution modules which are connected in sequence,
Figure 65972DEST_PATH_IMAGE048
the device comprises 3 FC modules and 1 FC layer which are connected in sequence, wherein the transposed convolution module comprises 1 transposed convolution layer, 1 batch normalization layer and 1 nonlinear activation layer which are connected in sequence, wherein the batch normalization layer is optional, the nonlinear activation function can select a ReLU function or a LEAKyReLU function, the convergence operation can select maximum convergence, average convergence or self-adaptive convergence, and the ReLU function and the maximum convergence are adopted in the embodiment. The role of the reconstructed network is detailed in point 4.
(5) The working principle of the space transformation network is as follows: firstly, an original character image is taken as the input of an encoder, the encoder extracts shape features from the original character image, and outputs a shape feature vector
Figure 341096DEST_PATH_IMAGE049
Next, we randomly choose a noise hidden vector from the standard normal distribution
Figure 312463DEST_PATH_IMAGE015
Will be
Figure 382050DEST_PATH_IMAGE049
And
Figure 173289DEST_PATH_IMAGE015
the fusion is carried out in the way of the present example by directly performing superposition summation, wherein
Figure 252103DEST_PATH_IMAGE049
Contains font characteristic information, has the function of ensuring output authenticity,
Figure 842091DEST_PATH_IMAGE015
certain randomness can be brought, and the diversity of output is ensured. And inputting the fused implicit vector into a predictor, wherein the predictor is responsible for mapping out TPS transformation parameters and affine transformation parameters, the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the sampling grid has 8 x 8=64 grid matching points, and the affine transformation parameters are element values of an affine transformation matrix and have 4 parameters in total. Next, the 4 affine transformation parameters are converted into an affine transformation sampling grid by the method of torch. Then, the TPS transformed sampling grid and the affine transformed sampling grid and the original image are input into the sampler, and the deformed character image is output. We assume that n sets of TPS transform sampling grid matching point pairs for two images have been acquired:
Figure 817000DEST_PATH_IMAGE050
Figure 361114DEST_PATH_IMAGE051
、…、
Figure 978040DEST_PATH_IMAGE052
in this example, n is 64. The procedure for the coordinate correspondence calculated using the TPS transformation is as follows: the objective of the TPS transformation is to solve a function f such that
Figure 658420DEST_PATH_IMAGE053
And the bending energy function is minimum, so that other points on the image can obtain a good transformation result through interpolation. The deformation function can be thought of as bending a thin metal plate through a given n points, and the energy function for bending the sheet can be expressed as:
Figure 804231DEST_PATH_IMAGE054
(1)
the spline function of the thin plate can be proved to be the function with the minimum bending energy, and the spline function of the thin plate is as follows:
Figure 835641DEST_PATH_IMAGE055
(2)
where U is the basis function:
Figure 990679DEST_PATH_IMAGE056
(3)
Figure 292609DEST_PATH_IMAGE057
(4)
in the above formula, only need to obtain
Figure 874900DEST_PATH_IMAGE037
Figure 393606DEST_PATH_IMAGE038
Figure 86756DEST_PATH_IMAGE039
And
Figure 7307DEST_PATH_IMAGE040
can determine
Figure 760500DEST_PATH_IMAGE041
Figure 500923DEST_PATH_IMAGE037
Figure 997763DEST_PATH_IMAGE038
Figure 271356DEST_PATH_IMAGE039
And
Figure 195450DEST_PATH_IMAGE040
the solution can be done by pre-setting the coordinates of the matching points of the sampling grid by 64 TPS transforms and the offset predicted by the predictor.
Similarly, suppose
Figure 688748DEST_PATH_IMAGE045
And
Figure 723700DEST_PATH_IMAGE058
the positions of the pixel points before and after the transformation are respectively, the sampling formula of the affine transformation is as follows:
Figure 353265DEST_PATH_IMAGE060
(5)
whereinscaletheta
Figure 244997DEST_PATH_IMAGE043
Figure 835378DEST_PATH_IMAGE044
Each representing 4 affine transformation parameters predicted by the predictor.
And finally, connecting the output end of the predictor with the input ends of the image reconstruction network and the noise reconstruction network, and reconstructing an original image and a noise hidden vector respectively.
3. And constructing a discriminator in the countermeasure training, wherein the discriminator is based on the structure of patchGAN and consists of 5 convolution modules which are connected in sequence, and each module in the first 4 convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LeakyReLU activation layer. The last convolution module includes an optional padding layer and a two-dimensional convolution layer.
4. The generator uses the original character image
Figure 798655DEST_PATH_IMAGE009
As input, the character image which generates deformation after passing through the space transformation network
Figure 423672DEST_PATH_IMAGE061
Connecting the output end of the generator with the input end of the discriminator and simultaneously, connecting the target character image
Figure 315667DEST_PATH_IMAGE010
And the input is input to the other output end of the discriminator, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
5. Constructing a signal-noise reconstruction loss function, wherein the loss function consists of three sub-items, namely a signal reconstruction sub-item, a noise reconstruction sub-item and a reconstruction error ratio item, and the formula is calculated as follows:
Figure 393344DEST_PATH_IMAGE062
(6)
wherein the content of the first and second substances,
Figure 894732DEST_PATH_IMAGE016
and
Figure 374255DEST_PATH_IMAGE017
separately representing reconstructed networks
Figure 935687DEST_PATH_IMAGE018
And
Figure 235081DEST_PATH_IMAGE063
the reconstructed original image and the noise hidden vector. Under the condition of lacking strong supervision, the action generated by input information can be restrained in the neural network learning process, and in order to avoid the condition, reconstruction loss items are respectively designed for information and noise vectors, so that the action of guaranteeing authenticity of shape information and guaranteeing diversity of noise cannot be restrained. In addition, in order to ensure that the transformation degree of the transformed font is more reasonable and controllable, a reconstruction loss ratio is designed to seek the balance of respective effects of shape information and noise, and a hyperparameter M is further used>1 to constrain this term. Where α is a dynamic coefficient, we assume α =1 if the term is greater than logM during training. In this case, the noise reconstruction is much worse than the signal reconstruction. This means that the effect of noise is suppressed by the network and we use the gradient descent method to optimize the positive ratio term. Conversely, when the term is less than-logM, we assume α = -1. In this case, the signal reconstruction is much worse than the noise reconstruction. This means that the effect of the noise is too prominent, the effect of the shape information is suppressed, and we use the gradient descent method to optimize the negative ratio term. If the term is within the ideal range, i.e., [ -logM, logM]Then α =0 is set, i.e. without any additional optimization.
6. And constructing a diversity loss function, wherein the diversity is in positive correlation with the difference between the transformation parameters corresponding to different noise hidden vectors, and the two transformation parameters are respectively estimated from two signal-noise mixed hidden vectors respectively injected with noise. The loss of diversity is defined as follows:
Figure 540160DEST_PATH_IMAGE064
(7)
where P denotes a predictor, E denotes an encoder,
Figure 139769DEST_PATH_IMAGE022
and
Figure 636216DEST_PATH_IMAGE023
each is derived from the sameDifferent noise hidden vectors of gaussian distribution.
7. Constructing a least square to generate a confrontation loss function as an optimization target of confrontation training, wherein the confrontation loss is to draw up the distribution between a deformed character image and a target character image, and the specific formula is as follows:
Figure 422906DEST_PATH_IMAGE065
Figure 531677DEST_PATH_IMAGE066
(8)
through the countertraining, the spatial transform network can generate samples approximating the target character image.
8. All image pixel sizes were set to 64 x 64, batch size was 64, initial learning rate was 0.0001, number of iterations of the network was 5000, learning rate started to linearly decay to 1e-5 after 2500 iterations, and the network was optimized using adam optimizer.
9. Training the whole shape transformation according to the setting in 7 to generate a confrontation network, and obtaining a trained generator which can be used for generating various augmentation samples.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A character image augmentation method based on shape transformation is characterized by comprising the following steps:
step 1, constructing a shape transformation generation countermeasure network, which comprises a generator and a discriminator;
step 2, taking the original character image as the input of a generator, generating a deformed character image after spatial transformation, and connecting the output end of the generator with the input end of a discriminator; inputting the target character image into the other input end of the discriminator, and outputting discrimination results of the deformed character image and the target character image by the discriminator;
step 3, training the shape transformation to generate a confrontation network;
and 4, generating the augmented character image by using the trained generator.
2. The method of claim 1, wherein the generator is a spatial transform network including an encoder, a predictor, a sampler, a noise reconstruction network, and an image reconstruction network;
the encoder is sequentially connected by a plurality of convolution modules, and each convolution module comprises a two-dimensional convolution layer, a nonlinear activation layer and a convergence layer which are sequentially connected;
the predictor is sequentially connected with the last full-connection layer by a plurality of full-connection modules, each full-connection module comprises a full-connection layer and a nonlinear activation layer, and the number of output channels of the last full-connection layer is set as the number of deformation parameters to be predicted;
the sampler maps the deformed character image pixel area to the original character image pixel area by applying matrix multiplication on a sampling grid;
the image reconstruction network is sequentially connected by a plurality of full-connection modules, a full-connection layer and a plurality of transposition convolution modules, and each transposition convolution module comprises a transposition convolution layer and a nonlinear activation layer which are sequentially connected;
the noise reconstruction network is sequentially connected by a plurality of layers of full connection modules and a full connection layer;
the discriminator is based on the structure of patchGAN and consists of five convolution modules which are connected in sequence, wherein each convolution module in the first four convolution modules comprises a two-dimensional convolution layer, an example normalization layer and a LEAKYRELU activation layer, and the last convolution module comprises a padding layer and a two-dimensional convolution layer;
firstly, an original character image is used as the input of an encoder, the encoder extracts shape features from the original character image and outputs a shape feature vector, then a noise hidden vector is randomly selected from standard normal distribution, the shape feature vector and the noise hidden vector are fused, the fused hidden vector is input into a predictor, the predictor is responsible for predicting TPS transformation parameters and affine transformation parameters, wherein the TPS transformation parameters are coordinate values of TPS transformation sampling grid matching points, the affine transformation parameters are converted into affine transformation sampling grids, then the TPS transformation sampling grids, the affine transformation sampling grids and the original character image are input into a sampler, a deformed character image is output, meanwhile, the output end of the predictor is connected with the input ends of an image reconstruction network and a noise reconstruction network, and the original character image and the noise hidden vector are respectively reconstructed, then, the deformed character image and the target character image output by the generator are input to a discriminator, respectively, and the discriminator outputs the discrimination result of the deformed character image and the target character image.
3. The character image augmentation method based on shape transformation as claimed in claim 2, wherein a least square generation countermeasure loss function is constructed as an optimization target of shape transformation generation countermeasure network training, and a calculation formula of the least square generation countermeasure loss function is as follows:
Figure 445456DEST_PATH_IMAGE001
Figure 220514DEST_PATH_IMAGE002
Figure 144607DEST_PATH_IMAGE003
a function representing the loss of the generator is expressed,
Figure 640835DEST_PATH_IMAGE004
the presence of the discriminator is indicated by the expression,
Figure 675787DEST_PATH_IMAGE005
a representation generator for generating a representation of the object,
Figure 305352DEST_PATH_IMAGE006
representing a signal-to-noise reconstruction loss function,
Figure 400347DEST_PATH_IMAGE007
a function representing the loss of diversity is represented,
Figure 115362DEST_PATH_IMAGE008
a function representing the loss of the arbiter is represented,
Figure 954005DEST_PATH_IMAGE009
which represents the image of the original character(s),
Figure 703655DEST_PATH_IMAGE010
a target character image is represented by a character image,
Figure 235131DEST_PATH_IMAGE011
and
Figure 437442DEST_PATH_IMAGE012
indicating the corresponding mathematical expectation.
4. The method as claimed in claim 3, wherein the signal-to-noise reconstruction loss function includes a signal reconstruction sub-term, a noise reconstruction sub-term, and a reconstruction error ratio term, and the calculation formula is as follows:
Figure 814197DEST_PATH_IMAGE013
Figure 559299DEST_PATH_IMAGE014
the average absolute error is expressed in terms of,
Figure 120730DEST_PATH_IMAGE015
a noise-hidden vector is represented that is,
Figure 685704DEST_PATH_IMAGE016
and
Figure 990783DEST_PATH_IMAGE017
separately representing image reconstruction networks
Figure 855971DEST_PATH_IMAGE018
And
Figure 588303DEST_PATH_IMAGE019
the reconstructed original image and the noise hidden vector,
Figure 640573DEST_PATH_IMAGE020
is a dynamic coefficient, if the reconstruction error ratio term is greater than logM, let α =1, if the reconstruction error ratio term is less than-logM, let α = -1, if the reconstruction error ratio term is within the ideal range, i.e., [ -logM, logM]If the parameter is set to be alpha =0, M represents a hyperparameter;
the calculation formula of the diversity loss function is as follows:
Figure 624710DEST_PATH_IMAGE021
whereinPA predictor is represented by a representation of the motion vector,Eit is shown that the encoder is a digital video encoder,
Figure 194669DEST_PATH_IMAGE022
and
Figure 238849DEST_PATH_IMAGE023
respectively representing different noise hidden vectors taken from the same gaussian distribution.
5. The method of claim 2, wherein each of said convolution modules in the encoder further comprises a batch normalization layer disposed between the two-dimensional convolution layer and the non-linear active layer; the nonlinear activation function of the nonlinear activation layer in the encoder selects a ReLU function, and the convergence operation of the convergence layer selects the maximized convergence;
each full-connection module in the predictor also comprises a batch normalization layer which is positioned between the full-connection layer and the nonlinear activation layer; the nonlinear activation function in the predictor selects a ReLU function, and the convergence operation of the convergence layer selects the maximized convergence;
the transposition convolution module also comprises a batch normalization layer which is positioned between the transposition convolution layer and the nonlinear activation layer; the nonlinear activation function in the transposition convolution module selects a ReLU function, and the convergence operation of the convergence layer selects the maximization convergence.
6. The method as claimed in claim 2, wherein the number of output channels of the last full-connected layer in the predictor is set to 132, 128 parameters of the deformation parameters to be predicted are coordinates of 64 TPS transformed sampling grid matching points, and 4 parameters are element values of an affine transformation matrix.
7. A method for augmenting a character image based on shape transformation according to claim 2 or 6, characterized in that the TPS transformation aims to solve a deformation function
Figure 637469DEST_PATH_IMAGE024
So that
Figure 159717DEST_PATH_IMAGE025
Figure 733918DEST_PATH_IMAGE026
And the function of the bending energy is the minimum,
Figure 808053DEST_PATH_IMAGE027
coordinates representing TPS transformed sampling grid matching points on the original character image,
Figure 834915DEST_PATH_IMAGE028
the TPS transform representing the deformed character image samples the coordinates of the grid matching points,
Figure 19909DEST_PATH_IMAGE029
for one TPS transform sampling grid matching point number, assume that n groups of TPS transforms of two images have been acquiredMatching point coordinate pairs of the sample grids:
Figure 714195DEST_PATH_IMAGE030
Figure 959232DEST_PATH_IMAGE031
、…、
Figure 207811DEST_PATH_IMAGE032
the deformation function is imagined as bending a thin metal plate through a given n points, the energy function of bending the sheet being expressed as:
Figure 337441DEST_PATH_IMAGE033
the thin plate spline function can be proved to be the function with the minimum bending energy, and the thin plate spline function is as follows:
Figure 745288DEST_PATH_IMAGE034
wherein
Figure 36592DEST_PATH_IMAGE035
As basis functions:
Figure 162680DEST_PATH_IMAGE036
Figure 830422DEST_PATH_IMAGE037
Figure 92776DEST_PATH_IMAGE038
Figure 820561DEST_PATH_IMAGE039
Figure 43732DEST_PATH_IMAGE040
and
Figure 642728DEST_PATH_IMAGE041
obtaining the coordinate preset value of the sampling grid matching point and the offset predicted by the predictor through n TPS transformation
Figure 900534DEST_PATH_IMAGE042
And (5) specific expressions.
8. The method as claimed in claim 2 or 6, wherein the sampling formula of the affine transformation sampling grid is as follows:
Figure 658274DEST_PATH_IMAGE043
wherein the content of the first and second substances,scaletheta
Figure 634320DEST_PATH_IMAGE044
Figure 643865DEST_PATH_IMAGE045
respectively representing affine transformation parameters predicted by the predictor,
Figure 615232DEST_PATH_IMAGE046
and
Figure 950398DEST_PATH_IMAGE047
respectively the position coordinates of the pixel points before and after transformation.
9. The method as claimed in claim 2, wherein the sampler is implemented by a torch.n.functional.grid _ sample () method in the torch, and the affine parameter is converted into the affine transformation sampling grid by the torch.n.functional.affine _ grid () method in the torch.
10. The method of claim 2, wherein the size of pixels of all images is 64 x 64, the batch size is 64, the initial learning rate is 0.0001, the number of iterations of the countermeasure network is 5000, and the learning rate starts to decay linearly to 1e after 2500 iterations-5And optimizing the countermeasure network by adopting an adam optimizer.
CN202210285238.8A 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation Active CN114782961B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210285238.8A CN114782961B (en) 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210285238.8A CN114782961B (en) 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation

Publications (2)

Publication Number Publication Date
CN114782961A true CN114782961A (en) 2022-07-22
CN114782961B CN114782961B (en) 2023-04-18

Family

ID=82424735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210285238.8A Active CN114782961B (en) 2022-03-23 2022-03-23 Character image augmentation method based on shape transformation

Country Status (1)

Country Link
CN (1) CN114782961B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399408A (en) * 2018-03-06 2018-08-14 李子衿 A kind of deformed characters antidote based on deep space converting network
CN111209497A (en) * 2020-01-05 2020-05-29 西安电子科技大学 DGA domain name detection method based on GAN and Char-CNN
CN111242241A (en) * 2020-02-17 2020-06-05 南京理工大学 Method for amplifying etched character recognition network training sample
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications
CN111915540A (en) * 2020-06-17 2020-11-10 华南理工大学 Method, system, computer device and medium for augmenting oracle character image
CN114037644A (en) * 2021-11-26 2022-02-11 重庆邮电大学 Artistic digital image synthesis system and method based on generation countermeasure network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399408A (en) * 2018-03-06 2018-08-14 李子衿 A kind of deformed characters antidote based on deep space converting network
CN111209497A (en) * 2020-01-05 2020-05-29 西安电子科技大学 DGA domain name detection method based on GAN and Char-CNN
CN111242241A (en) * 2020-02-17 2020-06-05 南京理工大学 Method for amplifying etched character recognition network training sample
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications
CN111915540A (en) * 2020-06-17 2020-11-10 华南理工大学 Method, system, computer device and medium for augmenting oracle character image
CN114037644A (en) * 2021-11-26 2022-02-11 重庆邮电大学 Artistic digital image synthesis system and method based on generation countermeasure network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王浩彬: "基于深度学习的甲骨文检测与识别研究", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 *

Also Published As

Publication number Publication date
CN114782961B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
Lei et al. Coupled adversarial training for remote sensing image super-resolution
CN110136063B (en) Single image super-resolution reconstruction method based on condition generation countermeasure network
Liang et al. Understanding mixup training methods
CN111563841B (en) High-resolution image generation method based on generation countermeasure network
Zhu et al. Data Augmentation using Conditional Generative Adversarial Networks for Leaf Counting in Arabidopsis Plants.
Liu et al. Very deep convolutional neural network based image classification using small training sample size
CN106447626A (en) Blurred kernel dimension estimation method and system based on deep learning
CN111080513B (en) Attention mechanism-based human face image super-resolution method
CN107590497A (en) Off-line Handwritten Chinese Recognition method based on depth convolutional neural networks
CN111695494A (en) Three-dimensional point cloud data classification method based on multi-view convolution pooling
CN112837224A (en) Super-resolution image reconstruction method based on convolutional neural network
CN114494003B (en) Ancient character generation method combining shape transformation and texture transformation
Guo et al. Multiscale semilocal interpolation with antialiasing
CN113096020B (en) Calligraphy font creation method for generating confrontation network based on average mode
CN113744136A (en) Image super-resolution reconstruction method and system based on channel constraint multi-feature fusion
CN113378949A (en) Dual-generation confrontation learning method based on capsule network and mixed attention
CN113240584A (en) Multitask gesture picture super-resolution method based on picture edge information
Lin et al. Generative adversarial image super‐resolution network for multiple degradations
CN114782961B (en) Character image augmentation method based on shape transformation
CN116797456A (en) Image super-resolution reconstruction method, system, device and storage medium
CN114155560B (en) Light weight method of high-resolution human body posture estimation model based on space dimension reduction
Jiang et al. Tcgan: Semantic-aware and structure-preserved gans with individual vision transformer for fast arbitrary one-shot image generation
CN112732943B (en) Chinese character library automatic generation method and system based on reinforcement learning
CN114140317A (en) Image animation method based on cascade generation confrontation network
CN115909045B (en) Two-stage landslide map feature intelligent recognition method based on contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant