CN115482544A

CN115482544A - Adaptive fitting model training method and device, computer equipment and storage medium

Info

Publication number: CN115482544A
Application number: CN202211135932.8A
Authority: CN
Inventors: 姚旭峰; 沈小勇; 吕江波
Original assignee: Shenzhen Smartmore Technology Co Ltd
Current assignee: Shenzhen Smartmore Technology Co Ltd
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2022-12-16

Abstract

The application discloses a method and a device for training a self-adaptive fitting model, computer equipment and a storage medium. The method comprises the following steps: acquiring a scene text image for training; performing batch normalization processing on each text in a scene text image for training through a text self-adaptive model to be trained to obtain a synthesized text normalization characteristic and a real text normalization characteristic; performing feature weight sorting on each real text normalization feature to obtain a first sorting result, and determining real text loss information according to the first sorting result; performing feature weight sorting on each synthesized text normalization feature to obtain a second sorting result, and determining synthesized text loss information according to the second sorting result; and adjusting the model parameters of the text self-adaptive model to be trained until a preset training end condition is met according to the real text loss information and the synthetic text loss information to obtain the trained text self-adaptive model. The method can make up the difference between the synthesized text and the real text in the scene text.

Description

Adaptive fitting model training method and device, computer equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, and relates to a method and a device for training a self-adaptive fitting model, computer equipment and a storage medium.

Background

With the development of artificial intelligence technology, a character recognition technology appears, which is a common computer vision task and has been well researched from the algorithm point of view. However, scene text detection techniques via deep learning have evolved tremendously. Most traditional obstacles, such as text of various sizes and shapes, curved text, complex lighting and perspective distortion have solved the DNN model to a large extent. One common requirement of such methods is that sufficient marking data is used for training to address the various situations described above.

However, the amount of tagged data is usually less than desirable, and using synthetic data is a marginal solution to alleviate the disadvantages of lack of tagged real data. Computers are aided by advanced text generation technology to generate text in large quantities over any background, not only at low cost, but also with accurate labels. However, the synthesized data cannot be simply considered to be identical to the real data. One of the main reasons is the distribution gap between the real text and the synthesized text, which results in poor generalization effect. Another potential reason is that the data set is generated on a background unrelated to the scene text data, and may then contain some prejudice against the background view compared to the scene text data. Thus resulting in inefficient and less accurate fitting.

Disclosure of Invention

The application provides a method and a device for training a self-adaptive fitting model, computer equipment and a storage medium, which can make up the difference between a synthetic text and a real text in a scene text.

In a first aspect, the present application provides a method for training an adaptive fitting model, including:

acquiring a scene text image for training; the scene text image for training is obtained by filling the synthetic text into at least one preset scene text image; the synthetic text is obtained by performing style adjustment on at least one training text; presetting a scene text image as a scene image comprising at least one real text;

respectively carrying out batch normalization processing on synthetic texts and real texts in scene text images for training through a text self-adaptive model to be trained to obtain synthetic text normalization features corresponding to at least two synthetic texts and real text normalization features corresponding to at least two real texts;

performing feature weight sorting on each real text normalization feature to obtain a first sorting result, and determining real text loss information corresponding to the real text according to the first sorting result; performing feature weight sorting on each synthesized text normalized feature to obtain a second sorting result, and determining synthesized text loss information corresponding to the synthesized text according to the second sorting result;

according to the real text loss information and the synthetic text loss information, adjusting model parameters of the text self-adaptive model to be trained until a preset training end condition is met, and obtaining the trained text self-adaptive model; the trained text self-adaptive model is used for generating a target text meeting the preset text effect.

In a second aspect, the present application further provides an adaptive fitting model training apparatus, including:

the acquisition module is used for acquiring scene text images for training; the scene text image for training is obtained by filling the synthetic text into at least one preset scene text image; the synthetic text is obtained by performing style adjustment on at least one training text; presetting a scene text image as a scene image comprising at least one real text;

the processing module is used for respectively carrying out batch normalization processing on the synthetic texts and the real texts in the scene text images for training through the text self-adaptive model to be trained to obtain synthetic text normalization features corresponding to at least two synthetic texts and real text normalization features corresponding to at least two real texts;

the sorting module is used for carrying out feature weight sorting on each real text normalization feature to obtain a first sorting result, and determining real text loss information corresponding to the real text according to the first sorting result; performing feature weight sorting on each synthesized text normalized feature to obtain a second sorting result, and determining synthesized text loss information corresponding to the synthesized text according to the second sorting result;

the training module is used for adjusting model parameters of the text self-adaptive model to be trained until a preset training end condition is met according to the real text loss information and the synthesized text loss information to obtain the trained text self-adaptive model; the trained text self-adaptive model is used for generating a target text meeting the preset text effect.

In a third aspect, the present application further provides a computer device, where the computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the steps in the adaptive fitting model training method when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the above-mentioned adaptive fitting model training method.

In a fifth aspect, the present application further provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the above-mentioned method for training an adaptive fitting model.

According to the self-adaptive fitting model training method, the self-adaptive fitting model training device, the computer equipment and the storage medium, the convolution neural network is used for carrying out feature extraction on the real text and the synthesized text, and different responses of batch normalization features displayed at different positions of the synthesized text and the real text are observed; and then, sequencing the batch normalization features lost by the real text, and introducing a weighting mechanism on the basis, so that the model can better focus on the active features of the real text while optimizing the synthesized text and the real text. The model is trained by using a large number of real texts and synthetic texts, so that the self-adaptive fitting model is effective in making up the difference between the synthetic texts and the real texts in scene text detection, and the accuracy and efficiency of generating the synthetic texts can be effectively improved.

Drawings

Fig. 1 is an application environment diagram of a method for training an adaptive fitting model according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for training an adaptive fitting model according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a first method for obtaining text normalization features according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a second method for obtaining text normalization features according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a third method for obtaining text normalization features according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a method for determining loss information of a real text according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a method for determining loss information of a synthesized text according to an embodiment of the present application;

fig. 8 is a flowchart illustrating a method for obtaining a second sorting result according to an embodiment of the present application;

fig. 9 is an exemplary diagram of a synthesized text and an actual text provided in an embodiment of the present application;

fig. 10 is a schematic flowchart illustrating a process of calculating a loss value of a synthesized text and an actual text according to an embodiment of the present application;

fig. 11 is a schematic flowchart of text adaptive model training according to an embodiment of the present disclosure;

FIG. 12 is a diagram illustrating a comparison of different text adaptation models provided by an embodiment of the present application;

fig. 13 is a block diagram of a structure of an adaptive fitting model training apparatus according to an embodiment of the present application;

fig. 14 is an internal structural diagram of a computer device according to an embodiment of the present application;

fig. 15 is an internal structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The adaptive fitting model training method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. The computer device 102 acquires data, the server 104 receives the data of the computer device 102 in response to an instruction of the computer device 102 and performs calculation on the acquired data, and the server 104 transmits the calculation result of the data back to the computer device 102 and is displayed by the computer device 102. Wherein the computer device 102 communicates with the server 104 over a communication network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be placed on the cloud or other network server. The server 104 acquires scene text images for training through the computer device 102; the scene text image for training is obtained by filling the synthetic text into at least one preset scene text image; the synthetic text is obtained by performing style adjustment on at least one training text; presetting a scene text image as a scene image comprising at least one real text; respectively carrying out batch normalization processing on synthetic texts and real texts in scene text images for training through a text self-adaptive model to be trained to obtain synthetic text normalization features corresponding to at least two synthetic texts and real text normalization features corresponding to at least two real texts; performing feature weight sorting on each real text normalization feature to obtain a first sorting result, and determining real text loss information corresponding to the real text according to the first sorting result; performing feature weight sorting on each synthesized text normalized feature to obtain a second sorting result, and determining synthesized text loss information corresponding to the synthesized text according to the second sorting result; according to the real text loss information and the synthesized text loss information, adjusting model parameters of the text self-adaptive model to be trained until a preset training end condition is met to obtain the trained text self-adaptive model; the trained text adaptive model is used for generating target texts meeting preset text effects. The computer device 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In some embodiments, as shown in fig. 2, an adaptive fitting model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, obtaining a scene text image for training.

The scene text image may be a sample image for training, where the scene text image includes a synthesized text, a real text, and a real scene image, and the synthesized text and the real text are located above the scene image, but the synthesized text and the real text do not overlap.

Specifically, the server responds to an instruction of the computer device, acquires a plurality of training scene text images from the computer device, stores the acquired training scene text images in the storage unit, and calls volatile storage resources from the storage unit for the central processing unit to calculate when the server needs to process any image information in the training scene text images. The training scene text image is obtained by filling the synthetic text into at least one preset scene text image; the synthetic text is obtained by performing style adjustment on at least one training text; and presetting the scene text image as a scene image comprising at least one real text.

For example, in response to an instruction from the computer device 102, the server 104 acquires a plurality of training scene text images from the computer device 102 and stores the training scene text images in a storage unit in the server 104, wherein 10 training scene text images acquired by the server 104 can be sequentially input to the server 104 for processing.

Step 204, performing batch normalization processing on the synthetic texts and the real texts in the scene text images for training respectively through the text adaptive models to be trained to obtain synthetic text normalization features corresponding to at least two synthetic texts and real text normalization features corresponding to at least two real texts.

The text adaptive model to be trained may be an untrained text adaptive model, and the model may be used to render the synthesized text in the scene text image, that is, to make the synthesized text as close as possible to the scene of the real text.

The synthesized text can be a text synthesized by a computer, and the expression form of the synthesized text can be characters, character strings and synthesized pictures with semantics.

The real text may be a text obtained by taking a picture under a real condition, and the representation form of the real text may be a word, a character string, and a real picture with semantics.

The batch normalization processing may be to eliminate dimensional influence between indexes, and data normalization processing is required to solve comparability between data indexes. After the raw data are subjected to data standardization processing, all indexes are in the same order of magnitude, and the method is suitable for comprehensive comparison and evaluation. Data normalization is generally [0,1] normalization and normal distribution normalization.

The synthesized text normalization feature may be a normalization feature obtained by batch normalization processing of feature values corresponding to the synthesized text.

The real text normalization feature can be a normalization feature obtained by batch normalization processing of feature values corresponding to the real text.

Specifically, the generation process of the scene text image for training is as follows, the first step: generating a synthetic text instance using an open source text generation tool, wherein words of the text instance are to be randomly selected from a dictionary in a particular language, and then generating images of the words according to specified fonts, backgrounds, and modifications (skew, blur, etc.); the second step: acquiring a scene text image from a real data set as a background, wherein all backgrounds are real and have real text data labels; the third step: pasting the instance of the synthesized text onto the image of the scene, wherein the location of the synthesized text is randomly selected. The real text and the synthesized text are shown in fig. 9.

The generated scene text images for training have real texts, synthetic texts and scene images inside. Inputting a scene text image for training into a text self-adaptive model to be trained, extracting the characteristics of a synthesized text and a real text in the scene text image by using a convolutional neural network in the text self-adaptive model to be trained to obtain a synthesized text extraction characteristic value corresponding to the synthesized text and a real text extraction characteristic value corresponding to the real text, and then inputting the synthesized text extraction characteristic value and the real text extraction characteristic value into a batch normalization processing layer in the text self-adaptive model to be trained for batch normalization calculation; wherein the statistics of the batch normalization processing layer include the characteristics of the data field. Therefore, by replacing the batch normalization process statistics of the source data with the target data, the domain shift problem can be solved to a large extent. Conversion of the characteristics of the input to a given batch normalization layer to batch normalization layer the output characteristics can be expressed in the following expression:

wherein, gamma is _i And beta _i For learning parameters, x is learned by the network itself during training _i As the input characteristic value, z _i For the output characteristic value, E [ X ] _i ]Is a batch mean value, var [ X ] _i ]Is the variance.

Given the batch mean and variance, the batch normalization features are defined as:

wherein X is a characteristic value, E [ X ] _i ]Is a batch mean value, var [ X ] _i ]Is the variance.

The procedure for the batch normalization process is as follows: 1. calculating the mean value of each training batch of data; 2. calculating the variance of each training batch of data; 3. the training data for this batch were normalized using the found mean and variance to obtain a (0, 1) normal distribution. A slight positive number is introduced to avoid the use of a divisor of 0; 4. the scaling and offset are calculated. Based on the above process, the synthesized text normalization features corresponding to the at least two synthesized texts and the real text normalization features corresponding to the at least two real texts can be calculated.

It is noted that because an online synthetic text generation strategy is employed, real text and synthetic text may be present in one image at the same time. Extracting pairs of real text and synthetic text, however, also has an important meaning that the batch normalized statistics of the source and target are separable, meaning that the source and target data can be distinguished by mean and variance. Similarly, the response should be similar to the source and target data after the batch normalization feature is counted by the domain batch normalization.

Step 206, performing feature weight sorting on each real text normalization feature to obtain a first sorting result, and determining real text loss information corresponding to the real text according to the first sorting result; and performing feature weight sorting on each synthesized text normalized feature to obtain a second sorting result, and determining synthesized text loss information corresponding to the synthesized text according to the second sorting result.

The feature weight sorting may be sorting the real text normalized features or the synthesized text normalized features according to a preset weight rule.

The first ranking result may be a ranking result formed by combining a plurality of real text normalized features ranked by the real text normalized features through the feature weights.

The second ranking result may be a ranking result formed by combining a plurality of synthesized text normalized features ranked by feature weights.

The real text loss information may be a loss value obtained by calculating a real text normalization feature using a loss function.

The synthesized text loss information may be a loss value obtained by calculating the normalized feature of the synthesized text using a loss function.

In particular, since the response of the batch normalized features to the real text and the synthetic text is similar, this means that the neural network learns useful features of the text. Because one image has both synthetic text and real text, it is difficult to extract corresponding features, so the importance of the batch normalization features of the real text and the synthetic text is adopted to rank and then compare the ranks of the real text and the synthetic text. Based on filter pruning, a method based on Taylor expansion criterion is adopted to identify and evaluate the importance of the neurons. Therefore, for each real text normalization feature and the synthesized text normalization feature, respectively using Taylor expansion to calculate the feature weight, so as to obtain a weight value corresponding to each real text normalization feature, and sequencing to obtain a first sequencing result; and obtaining the weight value corresponding to each synthesized text normalization feature, sequencing to obtain an intermediate sequencing result, multiplying the intermediate sequencing result obtained after sequencing with the first sequencing result of each real text normalization feature to obtain a second sequencing result corresponding to each synthesized text normalization feature.

For the real text, inputting the first sequencing result, the probability of any pixel in the scene image, the height and the width of the scene image into a classification layer in the text adaptive model to be trained for calculating a loss value, and obtaining real text loss information corresponding to the real text, wherein a calculation formula of the real text loss information is as follows:

where H and W represent the height and width, respectively, of the corresponding scene image. y represents the probability of true text at a pixel.

For the synthetic text, detailed label information such as character level and word level can be obtained, the second sequencing result, the region corresponding to the scene image and the region corresponding to the synthetic text are input to a classification layer in the text adaptive model to be trained for calculating loss values, and synthetic text loss information corresponding to the synthetic text is obtained; the calculation formula of the synthetic text loss information is as follows:

wherein, U is the region of the image corresponding to the scene image, and V is the region of the synthesized text predicted in the scene image. The calculation procedure for the real text loss information and the synthetic text loss information is shown in fig. 10.

And step 208, adjusting model parameters of the text adaptive model to be trained according to the real text loss information and the synthetic text loss information until a preset training end condition is met, and obtaining the trained text adaptive model.

The trained text self-adaptive model is used for generating a target text meeting a preset text effect; the trained text adaptation model can be used to render the synthesized text in the scene text image, i.e., to make the synthesized text as close as possible to the scene of the rendered text.

Specifically, the real text loss information and the synthetic text loss information of the text adaptive model to be trained, which are calculated through a real text loss function and a synthetic text loss function, are used as references, the model parameters of the text adaptive model to be trained are adjusted, the calculation of the text loss information of the real text loss function and the synthetic text loss information is repeated for multiple times, the model parameters of the text adaptive model to be trained are adjusted according to the text loss information until the preset neural network convergence condition for training ending is met, and the trained text adaptive model is obtained. The training process for the text adaptive model as a whole is shown in fig. 11.

In the self-adaptive fitting model training method, the convolution neural network is used for extracting the characteristics of the real text and the synthesized text, and different responses of batch normalization characteristics displayed at different positions of the synthesized text and the real text are observed; and then, sequencing the batch normalization features lost by the real text, and introducing a weighting mechanism on the basis, so that the model can better focus on the active features of the real text while optimizing the synthesized text and the real text. The model is trained by using a large number of real texts and synthetic texts, so that the self-adaptive fitting model is effective in making up the difference between the synthetic texts and the real texts in scene text detection, and the accuracy and efficiency of generating the synthetic texts can be effectively improved.

In some embodiments, as shown in fig. 3, the performing, by using a text adaptive model to be trained, batch normalization processing on a synthesized text and a real text in a scene text image for training respectively to obtain synthesized text normalized features corresponding to at least two synthesized texts and real text normalized features corresponding to at least two real texts includes:

step 302, respectively performing feature extraction on the synthetic text and the real text in the scene text image for training through the text adaptive model to be trained, so as to obtain a synthetic text extraction feature value corresponding to the synthetic text and a real text extraction feature value corresponding to the real text.

The feature value extracted from the synthesized text may be a feature vector obtained by extracting features of the synthesized text in the scene text image through a convolutional neural network.

The feature value of the real text extraction may be a feature vector obtained after the real text in the scene text image is subjected to feature extraction by the convolutional neural network.

Specifically, the generated scene text image for training has real text, synthetic text, and scene image therein. Inputting the scene text image for training into a text self-adaptive model to be trained, and performing feature extraction on the synthetic text and the real text in the scene text image by using a convolution neural network in the text self-adaptive model to be trained to obtain a synthetic text extraction feature value corresponding to the synthetic text and a real text extraction feature value corresponding to the real text. Because the convolutional neural network uses a gradient descent algorithm for learning, the input characteristics of the convolutional neural network need to be standardized. Specifically, before inputting the learning data into the convolutional neural network, the input data needs to be normalized in the channel or time/frequency dimension, and if the input data is pixels, the original pixel values distributed in [0, 255] can also be normalized to the [0,1] interval. The hidden layer of the convolutional neural network for feature extraction includes 3 types of common structures such as a convolutional layer, a pooling layer and a full-link layer, and may also have complicated structures such as an inclusion module and a residual block. Generally, convolutional layers and pooling layers are specific to convolutional neural networks, and convolutional cores in convolutional layers contain weight coefficients, whereas pooling layers do not contain weight coefficients, and the common order built in hidden layers is usually: input-convolutional layer-pooling layer-full-link layer-output.

And 304, respectively carrying out batch normalization processing on the extracted feature values of the synthetic texts and the extracted feature values of the real texts to obtain the normalized features of the synthetic texts corresponding to at least two synthetic texts and the normalized features of the real texts corresponding to at least two real texts.

Step 304 refers to the related description of step 204, and will not be described here.

In this embodiment, the synthetic text and the real text are subjected to feature extraction by the convolutional neural network and then input to the batch normalization layer for batch normalization processing, so as to obtain a synthetic text normalization feature and a real text normalization feature. The training of the text self-adaptive model to be trained can be accelerated, and even the precision of the text self-adaptive model to be trained is improved.

In some embodiments, as shown in fig. 4, the performing batch normalization on the extracted feature values of the synthesized text and the extracted feature values of the real text to obtain normalized features of the synthesized text corresponding to the at least two synthesized texts and normalized features of the real text corresponding to the at least two real texts includes:

step 402, respectively performing feature value conversion on the extracted feature value of the synthesized text and the extracted feature value of the real text to obtain a converted feature value of the synthesized text corresponding to the extracted feature value of the synthesized text and a converted feature value of the real text corresponding to the extracted feature value of the real text.

The feature value of the synthetic text conversion may be a feature value obtained by extracting a feature value from the synthetic text through adjustment of learning parameters, batch mean values and variances, and used for batch normalization calculation.

The real text conversion characteristic value can be a characteristic value obtained by real text extraction characteristic value through adjustment of learning parameters, batch mean values and variances and used for batch normalization calculation.

Specifically, the extracted feature values of the synthesized text and the extracted feature values of the real text are input to a batch normalization processing layer in the text adaptive model to be trained for batch normalization calculation, wherein statistics of the batch normalization processing layer include features of a data field. Therefore, by replacing the batch normalization process statistics of the source data with the target data, the domain shift problem can be solved to a large extent. Conversion of the characteristics of the input to a given batch normalization layer to batch normalization layer the output characteristics can be expressed in the following expression:

wherein, gamma is _i And beta _i For learning parameters, x is obtained by the network learning itself during training _i Is an input characteristic value, z _i For the output characteristic value, E [ X ] _i ]Is a batch mean value, var [ X ] _i ]Is the variance.

Step 404, performing batch normalization calculation on the synthetic text conversion characteristic values and the real text conversion characteristic values respectively to obtain synthetic text normalization characteristics corresponding to the at least two synthetic texts and real text normalization characteristics corresponding to the at least two real texts.

Step 404 is described in relation to step 204 and will not be described here.

In this embodiment, the feature value conversion is performed on the extracted feature value of the synthetic text and the extracted feature value of the real text, and the converted feature value of the synthetic text and the converted feature value of the real text are input to the batch normalization layer as input feature values to perform batch normalization calculation, so that the problem of domain shift can be solved to a great extent, and the precision of the text adaptive model to be trained is improved.

In some embodiments, as shown in fig. 5, performing batch normalization calculation on the synthesized text conversion feature values and the real text conversion feature values respectively to obtain synthesized text normalization features corresponding to at least two synthesized texts and real text normalization features corresponding to at least two real texts includes:

step 502, performing batch mean calculation on the synthetic text conversion characteristic value and the real text conversion characteristic value respectively to obtain a synthetic text batch mean corresponding to the synthetic text conversion characteristic value and a real text batch mean corresponding to the real text conversion characteristic value.

The synthesized text batch average value may be a calculated value obtained by calculating each synthesized text conversion characteristic value by a batch average value method.

The real text batch average value may be a calculated value obtained by performing batch average value calculation on each real text conversion characteristic value.

Specifically, the batch mean value calculation is performed on the synthetic text conversion characteristic value and the real text conversion characteristic value respectively to obtain a synthetic text batch mean value corresponding to the synthetic text conversion characteristic value and a real text batch mean value corresponding to the real text conversion characteristic value. Wherein the formula for calculating the batch mean is E [ X ] _i ]And respectively dividing the synthetic text conversion characteristic value and the real text conversion characteristic value into a plurality of equal-length intervals by batch mean calculation, and performing one-time independent simulation operation according to each interval to obtain a random variable sequence which is similar to mutually independent same-distribution variables, so as to obtain a synthetic text batch mean and a real text batch mean. Factors such as how many intervals the data is divided into, how many observation values each interval contains, and whether the obtained random variable sequences are mutually independent need to be considered. Generally, the sampling times of all the intervals are required to be large enough, and the number of the segments is required to be large enough, so that the mutual independence of the random variable sequences is ensured.

Step 504, performing variance calculation on the synthesized text conversion characteristic value and the real text conversion characteristic value respectively to obtain a synthesized text variance corresponding to the synthesized text conversion characteristic value and a real text variance corresponding to the real text conversion characteristic value.

The synthetic text variance may be a calculated value obtained by performing variance calculation on each synthetic text conversion feature value.

The true text variance may be a calculated value obtained by performing variance calculation on each true text conversion feature value.

Specifically, variance calculation is performed on the synthetic text conversion characteristic value and the real text conversion characteristic value respectively to obtain a synthetic text variance corresponding to the synthetic text conversion characteristic value and a real text variance corresponding to the real text conversion characteristic value. Wherein the variance is represented by Var [ X ] _i ]Let X be a discrete random variable, if E ((X-E (X)) ² ) If present, then called E ((X-E (X)) ² ) Variance of X, namely Var [ X ] _i ]Where E (X) is the desired value of X, X is the variable value, and E in the formula is the abbreviation for the desired value expected value, then the variance calculation formula is:

Var[X _i ]＝E((X-E(X)) ² )＝E(X ² )-E ² (X)

step 506, performing batch normalization calculation on the synthetic text batch mean value and the synthetic text variance to obtain synthetic text normalization characteristics corresponding to at least two synthetic texts; and carrying out batch normalization calculation on the real text batch mean value and the real text variance to obtain real text normalization characteristics corresponding to at least two real texts.

Step 506 is described in relation to step 204, and will not be described here.

In the embodiment, the batch mean value and the variance corresponding to the synthetic text and the real text are calculated, so that the synthetic text normalization characteristic and the real text normalization characteristic after batch normalization calculation can be distinguished from the source data and the target data through the mean value and the variance, meanwhile, the skin normalization calculation can be extracted to be used as the measurement of the channel importance in deep learning, and the aim of improving the precision of the text adaptive model to be trained is fulfilled.

In some embodiments, as shown in fig. 6, performing feature weight sorting on each real text normalized feature to obtain a first sorting result, and determining real text loss information corresponding to a real text according to the first sorting result, includes:

step 602, calculating the feature weight corresponding to each real text normalized feature according to taylor expansion to obtain a first feature weight calculation result, and ranking each real text normalized feature based on the first feature weight calculation result to obtain a first ranking result.

The first feature weight calculation result may be weights obtained by calculating normalized features of each real text by using taylor expansion.

Specifically, for the real text normalization features, the characteristic weight of each real text normalization feature is calculated by using taylor expansion, so that a weight value corresponding to each real text normalization feature can be obtained, and the real text normalization features are sorted according to a preset sorting method to obtain a first sorting result. If the function f corresponding to the real text normalization feature can calculate a derivative of any order (for example, a polynomial, a power function, a trigonometric function, an exponential function, a logarithmic function, etc.) in the open interval I, the function can be approximated by a polynomial, and in a certain sense, the more the total number of terms N is, the more accurate the approximation is. Specifically, for any x ₀ Belonging to I, there being a unique numberColumn { c _n That for any positive integer N there is:

wherein, (x-x) ₀ ) ⁰ Always 1 even if x = x ₀ For each coefficient c _n From a function at x ₀ The nth order derivative of (a) is:

where the factorial of 0 is 0! And =1. When x = x ₀ The function value is equal to the polynomial value. When the number of terms N is limited, usually | x-x ₀ The smaller the | the closer the polynomial is to the function.

And step 604, determining real text loss information corresponding to the real text according to the first sequencing result, the probability of any pixel of the real text in the training scene text image, and the height and width corresponding to the training scene text image.

Step 604 refers to the related description of step 206, and will not be described herein.

In the embodiment, the characteristic weights corresponding to the real text normalization characteristics are calculated through Taylor expansion and then sorted to obtain a first sorting result, real text loss information is calculated according to the first sorting result, the importance of different real text batch normalization characteristics can be evaluated based on the Taylor expansion criterion, the weight of calculation of the real text loss information is further influenced, and the training speed of the text self-adaptive model and the accuracy of the model can be improved.

In some embodiments, as shown in fig. 7, performing feature weight sorting on each normalized feature of the synthesized text to obtain a second sorting result, and determining loss information of the synthesized text corresponding to the synthesized text according to the second sorting result, includes:

step 702, calculating the feature weight corresponding to each synthesized text normalized feature according to the taylor expansion to obtain a second feature weight calculation result, and ranking each synthesized text normalized feature based on the second feature weight calculation result to obtain a second ranking result.

Specifically, aiming at the synthesized text normalization features, feature weights of the synthesized text normalization features are respectively calculated by using Taylor expansion, so that weight values corresponding to the synthesized text normalization features can be obtained, and sequencing is performed according to a preset sequencing method to obtain a second sequencing result. If the function f corresponding to the synthesized text normalization feature can calculate a derivative of any order (for example, a polynomial, a power function, a trigonometric function, an exponential function, a logarithmic function, etc.) in the open interval I, then the function can be approximated by the polynomial, and in a certain sense, the more the total number of terms N, the more accurate the approximation. Specifically, for any x ₀ Belonging to I, there is only one series of numbers c _n That for any positive integer N there is:

where 0 is factorized to 0! =1. When x = x ₀ The function value is equal to the polynomial value. When the number of terms N is limited, usually | x-x ₀ The smaller the | the closer the polynomial is to the function.

Step 704, determining synthetic text loss information corresponding to the synthetic text according to the second sorting result, the region corresponding to the training scene text image, and the region corresponding to the synthetic text.

The area corresponding to the scene image may be an area surrounded by pixels in the scene image after the pixels are connected.

The area corresponding to the synthesized text may be an area surrounded by connected pixels corresponding to the synthesized text in the scene image.

Specifically, for the synthesized text, detailed label information, such as a character level and a word level, may be obtained, and the second sorting result, the region corresponding to the scene image, and the region corresponding to the synthesized text are input to a classification layer in the text adaptive model to be trained to perform loss value calculation, so as to obtain real text loss information corresponding to the synthesized text, where a calculation formula of the synthesized text loss information is as follows:

wherein, U is the region of the image corresponding to the scene image, and V is the region of the synthesized text predicted in the scene image.

In the embodiment, the feature weights corresponding to the normalization features of the synthetic texts are calculated through Taylor expansion and then sorted to obtain a second sorting result, and the loss information of the synthetic texts is calculated according to the second sorting result, so that the importance of different batch normalization features of the synthetic texts can be evaluated based on the Taylor expansion criterion, the weight of calculation of the loss information of the synthetic texts is further influenced, and the training speed of the text adaptive model and the accuracy of the model can be improved.

In some embodiments, as shown in fig. 8, ranking the synthesized text normalized features based on the second feature weight calculation result to obtain a second ranking result, including:

and step 802, ranking each synthesized text normalized feature based on the second feature weight calculation result to obtain an intermediate ranking result.

Wherein the intermediate ranking result may be a ranking result obtained by reordering the synthesized text normalized features using the second feature weight calculation result.

Specifically, aiming at the synthesized text normalization features, the feature weight of each synthesized text normalization feature is calculated by using Taylor expansion, so that the weight value corresponding to each synthesized text normalization feature can be obtained, and the synthesized text normalization features are sorted according to a preset sorting method to obtain a middle sorting result.

And 804, performing weighted calculation on the intermediate sorting result and the first sorting result to obtain a second sorting result.

Specifically, the intermediate ranking result is multiplied by the first ranking result of each real text normalization feature, so that the first ranking result is used to influence the feature weight ranking obtained by taylor expansion calculation corresponding to the synthesized text, and a second ranking result corresponding to each synthesized text normalization feature can be obtained. As shown in the right drawing of fig. 10.

In this embodiment, a ranking result obtained by ranking the normalized features of the synthesized text by using the second feature weight calculation result is combined with the weight influence of the first ranking result to generate a second ranking result having similar properties to the first ranking result, so that the difference between the synthesized text and the real text can be reduced, and the accuracy of model training can be improved.

In some embodiments, as shown in fig. 12. Wherein "dice", "ent", "Ada" represent die loss, entropy loss and model adaptation of the synthetic text, respectively. "P", "R" and "F" denote precision, call and F-measure, respectively. Using the SynthText 800k pre-training model as a baseline, it can be obtained by the official DBNet source code (note that the baseline results do not access any actual data). And carrying out fine adjustment on real data according to a pre-trained model, wherein the fine adjustment comprises entropy, dice and self-adaptation.

As can be seen from fig. 12, the proposed adaptation method is a great improvement over the baseline, i.e. no adaptation. The corresponding entropy losses, dice losses and adaptive components all contribute to the overall performance. Without synthetic text, w. entropy can be an important benchmark, and plain text data improves performance. On the other hand, optimizing synthetic text loss, i.e., dice loss, also improves performance without actual text loss, which proves the effectiveness of online synthetic text generation strategies. It can be noted from fig. 12 that the true text penalty (i.e., entropy penalty) and the synthetic text penalty (i.e., dice penalty) are complementary, and that performance can be improved by the penalty of affine combination.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially shown as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a self-adaptive fitting model training device. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so the specific limitations in the following embodiment of the adaptive fitting model training device can be referred to the limitations of the adaptive fitting model training method in the above, and details are not repeated here.

In some embodiments, as shown in fig. 13, there is provided an adaptive fitting model training apparatus including:

an obtaining module 1302, configured to obtain scene text images for training; the scene text image for training is obtained by filling the synthetic text into at least one preset scene text image; the synthetic text is obtained by performing style adjustment on at least one training text; presetting a scene text image as a scene image comprising at least one real text;

the processing module 1304 is configured to perform batch normalization processing on the synthetic texts and the real texts in the scene text images for training respectively through a text adaptive model to be trained to obtain synthetic text normalization features corresponding to at least two synthetic texts and real text normalization features corresponding to at least two real texts;

the sorting module 1306 is configured to perform feature weight sorting on each real text normalization feature to obtain a first sorting result, and determine real text loss information corresponding to the real text according to the first sorting result; performing feature weight sorting on each synthesized text normalization feature to obtain a second sorting result, and determining synthesized text loss information corresponding to the synthesized text according to the second sorting result;

a training module 1308, configured to adjust a model parameter of the text adaptive model to be trained until a preset training end condition is met according to the real text loss information and the synthesized text loss information, so as to obtain a trained text adaptive model; the trained text self-adaptive model is used for generating a target text meeting the preset text effect.

In some embodiments, in terms of performing batch normalization processing on the synthesized text and the real text in the scene text image for training respectively to obtain synthesized text normalized features corresponding to at least two synthesized texts and real text normalized features corresponding to at least two real texts, the processing module 1304 is specifically configured to:

respectively extracting the characteristics of a synthetic text and a real text in a scene text image for training through a text self-adaptive model to be trained to obtain a synthetic text extraction characteristic value corresponding to the synthetic text and a real text extraction characteristic value corresponding to the real text;

and respectively carrying out batch normalization processing on the extracted feature values of the synthetic texts and the extracted feature values of the real texts to obtain the normalized features of the synthetic texts corresponding to at least two synthetic texts and the normalized features of the real texts corresponding to at least two real texts.

In some embodiments, in terms of performing batch normalization on the extracted feature values of the synthesized text and the extracted feature values of the real text respectively to obtain normalized features of the synthesized text corresponding to the at least two synthesized texts and normalized features of the real text corresponding to the at least two real texts, the processing module 1304 is specifically configured to:

respectively carrying out characteristic value conversion on the extracted characteristic value of the synthetic text and the extracted characteristic value of the real text to obtain a converted characteristic value of the synthetic text corresponding to the extracted characteristic value of the synthetic text and a converted characteristic value of the real text corresponding to the extracted characteristic value of the real text;

and respectively carrying out batch normalization calculation on the synthetic text conversion characteristic values and the real text conversion characteristic values to obtain synthetic text normalization characteristics corresponding to at least two synthetic texts and real text normalization characteristics corresponding to at least two real texts.

In some embodiments, in terms of performing batch normalization calculation on the synthesized text conversion feature values and the real text conversion feature values respectively to obtain synthesized text normalization features corresponding to at least two synthesized texts and real text normalization features corresponding to at least two real texts, the processing module 1304 is specifically configured to:

respectively carrying out batch mean calculation on the synthetic text conversion characteristic value and the real text conversion characteristic value to obtain a synthetic text batch mean corresponding to the synthetic text conversion characteristic value and a real text batch mean corresponding to the real text conversion characteristic value;

respectively carrying out variance calculation on the synthetic text conversion characteristic value and the real text conversion characteristic value to obtain a synthetic text variance corresponding to the synthetic text conversion characteristic value and a real text variance corresponding to the real text conversion characteristic value;

performing batch normalization calculation on the batch mean value and the variance of the synthesized text to obtain normalized characteristics of the synthesized text corresponding to at least two synthesized texts; and carrying out batch normalization calculation on the real text batch mean value and the real text variance to obtain real text normalization characteristics corresponding to at least two real texts.

In some embodiments, in the aspect of performing feature weight sorting on each real text normalized feature to obtain a first sorting result, and determining real text loss information corresponding to a real text according to the first sorting result, the sorting module 1306 is specifically configured to:

calculating the feature weight corresponding to each real text normalized feature according to Taylor expansion to obtain a first feature weight calculation result, and sequencing each real text normalized feature based on the first feature weight calculation result to obtain a first sequencing result;

and determining real text loss information corresponding to the real text according to the first sequencing result, the probability of any pixel of the real text in the training scene text image and the height and width corresponding to the training scene text image.

In some embodiments, in the aspect of performing feature weight sorting on each normalized feature of the synthesized text to obtain a second sorting result, and determining loss information of the synthesized text corresponding to the synthesized text according to the second sorting result, the sorting module 1306 is specifically configured to:

calculating the feature weight corresponding to each synthesized text normalized feature according to Taylor expansion to obtain a second feature weight calculation result, and sequencing each synthesized text normalized feature based on the second feature weight calculation result to obtain a second sequencing result;

and determining synthetic text loss information corresponding to the synthetic text according to the second sequencing result, the region corresponding to the scene text image for training and the region corresponding to the synthetic text.

In some embodiments, in terms of ranking each synthesized text normalized feature based on the second feature weight calculation result to obtain a second ranking result, the ranking module 1306 is specifically configured to:

ranking each synthesized text normalized feature based on the second feature weight calculation result to obtain a middle ranking result;

and performing weighted calculation on the intermediate sorting result and the first sorting result to obtain a second sorting result.

The modules in the adaptive fitting model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules may be embedded in hardware or independent of a processor in the computer device, or may be stored in a memory in the computer device in software, so that the processor calls and executes operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 14. The computer device includes a processor, a memory, an Input/Output (I/O) interface, and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used for storing server data. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement the steps in the above-described adaptive fitting model training method.

Those skilled in the art will appreciate that the architecture shown in fig. 14 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is further provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above method embodiments when executing the computer program.

In some embodiments, there is provided a computer readable storage medium, the structure shown in fig. 15, having stored thereon a computer program which, when executed by a processor, performs the steps in the above-described method embodiments.

In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps in the above-described method embodiments.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for training an adaptive fitting model, comprising:

acquiring a scene text image for training; the scene text image for training is obtained by filling a synthetic text into at least one preset scene text image; the synthetic text is obtained by performing style adjustment on at least one training text; the preset scene text image is a scene image comprising at least one real text;

respectively carrying out batch normalization processing on the synthetic texts and the real texts in the scene text images for training through a text self-adaptive model to be trained to obtain synthetic text normalization features corresponding to at least two synthetic texts and real text normalization features corresponding to at least two real texts;

performing feature weight sorting on each real text normalization feature to obtain a first sorting result, and determining real text loss information corresponding to the real text according to the first sorting result; performing feature weight sorting on each synthesized text normalization feature to obtain a second sorting result, and determining synthesized text loss information corresponding to the synthesized text according to the second sorting result;

according to the real text loss information and the synthesized text loss information, adjusting model parameters of the text adaptive model to be trained until a preset training end condition is met to obtain a trained text adaptive model; the trained text self-adaptive model is used for generating a target text meeting a preset text effect.

2. The method according to claim 1, wherein the performing, by the text adaptive model to be trained, batch normalization processing on the synthesized texts and the actual texts in the scene text images for training respectively to obtain normalized synthesized text features corresponding to at least two of the synthesized texts and normalized actual text features corresponding to at least two of the actual texts comprises:

respectively extracting the characteristics of the synthetic text and the real text in the scene text image for training through the text self-adaptive model to be trained to obtain a synthetic text extraction characteristic value corresponding to the synthetic text and a real text extraction characteristic value corresponding to the real text;

and respectively carrying out batch normalization processing on the extracted feature values of the synthetic texts and the extracted feature values of the real texts to obtain at least two synthesized text normalization features corresponding to the synthetic texts and at least two real text normalization features corresponding to the real texts.

3. The method according to claim 2, wherein the performing batch normalization processing on the extracted feature values of the synthesized text and the extracted feature values of the real text to obtain normalized features of the synthesized text corresponding to at least two of the synthesized texts and normalized features of the real text corresponding to at least two of the real texts respectively comprises:

respectively performing characteristic value conversion on the extracted characteristic value of the synthetic text and the extracted characteristic value of the real text to obtain a synthetic text conversion characteristic value corresponding to the extracted characteristic value of the synthetic text and a real text conversion characteristic value corresponding to the extracted characteristic value of the real text;

4. The method according to claim 3, wherein the performing batch normalization calculations on the synthesized text conversion feature values and the actual text conversion feature values respectively to obtain at least two synthesized text normalization features corresponding to the synthesized text and at least two actual text normalization features corresponding to the actual text comprises:

respectively carrying out batch mean calculation on the synthetic text conversion characteristic value and the real text conversion characteristic value to obtain a synthetic text batch mean value corresponding to the synthetic text conversion characteristic value and a real text batch mean value corresponding to the real text conversion characteristic value;

respectively performing variance calculation on the synthetic text conversion characteristic value and the real text conversion characteristic value to obtain a synthetic text variance corresponding to the synthetic text conversion characteristic value and a real text variance corresponding to the real text conversion characteristic value;

5. The method of claim 1, wherein the performing feature weight ranking on each of the real text normalized features to obtain a first ranking result, and determining real text loss information corresponding to the real text according to the first ranking result comprises:

and determining real text loss information corresponding to the real text according to the first sequencing result, the probability of any pixel of the real text in the scene text image for training, and the height and width corresponding to the scene text image for training.

6. The method according to claim 1, wherein the performing feature weight ranking on each of the synthesized text normalized features to obtain a second ranking result, and determining synthesized text loss information corresponding to the synthesized text according to the second ranking result includes:

calculating the feature weight corresponding to each synthesized text normalized feature according to Taylor expansion to obtain a second feature weight calculation result, and ranking each synthesized text normalized feature based on the second feature weight calculation result to obtain a second ranking result;

7. The method of claim 6, wherein the ranking each of the synthesized text normalized features based on the second feature weight calculation result to obtain a second ranking result comprises:

8. An adaptive fitting model training device, comprising:

the acquisition module is used for acquiring scene text images for training; the scene text image for training is obtained by filling a synthetic text into at least one preset scene text image; the synthetic text is obtained by performing style adjustment on at least one training text; the preset scene text image is a scene image comprising at least one real text;

the processing module is used for respectively carrying out batch normalization processing on the synthetic texts and the real texts in the scene text images for training through a text self-adaptive model to be trained to obtain at least two synthetic text normalization features corresponding to the synthetic texts and at least two real text normalization features corresponding to the real texts;

the sorting module is used for carrying out feature weight sorting on each real text normalization feature to obtain a first sorting result and determining real text loss information corresponding to the real text according to the first sorting result; performing feature weight sorting on each synthesized text normalization feature to obtain a second sorting result, and determining synthesized text loss information corresponding to the synthesized text according to the second sorting result;

the training module is used for adjusting the model parameters of the text self-adaptive model to be trained until a preset training end condition is met according to the real text loss information and the synthetic text loss information to obtain a trained text self-adaptive model; the trained text self-adaptive model is used for generating a target text meeting a preset text effect.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.