CN113516125B

CN113516125B - Model training method, using method, device, equipment and storage medium

Info

Publication number: CN113516125B
Application number: CN202110700875.2A
Authority: CN
Inventors: 李盼盼; 秦勇
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-03-11
Anticipated expiration: 2041-06-24
Also published as: CN113516125A

Abstract

The application provides a model training method, a using method, a device, equipment and a storage medium, and the specific implementation scheme is as follows: acquiring at least two groups of generated images generated by a generated model, wherein the first group of images comprises at least one first generated image which is an abnormal text image with interference information generated based on a real normal text image; the second group of images comprises at least one second generated image, and the second generated image is a normal text image generated based on a real abnormal text image with interference information; using at least a first generated image contained in a first group of images generated by the generative model and a second generated image contained in a second group of images as a first training image; and training a preset model at least based on the first training image to obtain a trained target model, wherein the target model can at least perform text recognition on the text image with the interference information.

Description

Model training method, using method, device, equipment and storage medium

Technical Field

The present application relates to image processing technologies, and in particular, to a model training method, a model training apparatus, a model training device, and a storage medium.

Background

With the continuous development of computing technology and artificial intelligence technology, the artificial intelligence technology has been gradually applied in educational and teaching scenes. In the current education and teaching scene, the mathematics homework or the examination paper of the student mostly still adopt the manual mode to review, lead to the head of a family and mr to bring huge burden when reading in batches. Aiming at the phenomenon and the problem, various methods and systems for automatically judging questions and automatically scoring papers have been popularized in some large-scale education scenes. However, due to various reasons such as writing habits or shooting scenes, various problems such as a back-penetration phenomenon occur in the obtained text image, and thus, the robustness of the model is reduced.

Disclosure of Invention

The embodiment of the application provides a model training method, a using method, a device, equipment and a storage medium, which are used for solving the problems in the related technology, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a model training method, including:

acquiring at least two groups of generated images generated by a generated model, wherein the first group of images comprises at least one first generated image which is an abnormal text image with interference information generated based on a real normal text image; the second group of images comprises at least one second generated image, and the second generated image is a normal text image generated based on a real abnormal text image with interference information;

using at least a first generated image contained in a first group of images generated by the generative model and a second generated image contained in a second group of images as a first training image;

and training a preset model at least based on the first training image to obtain a trained target model, wherein the target model can at least perform text recognition on the text image with the interference information.

In a second aspect, an embodiment of the present application provides a model using method, including:

acquiring an image to be processed, wherein the image to be processed is an abnormal text image with interference information or a normal text image;

inputting the image to be processed into a target model, wherein the target model is obtained by the model training method;

and obtaining a recognition result, wherein the recognition result represents the text content of the image to be processed.

In a third aspect, an embodiment of the present application provides a model training apparatus, including:

a generated image acquisition unit, configured to acquire at least two groups of generated images generated by the generated model, where the first group of images includes at least one first generated image, and the first generated image is an abnormal text image with interference information generated based on a real normal text image; the second group of images comprises at least one second generated image, and the second generated image is a normal text image generated based on a real abnormal text image with interference information;

a training image acquisition unit configured to take at least a first generated image included in a first group of images generated by the generative model and a second generated image included in a second group of images as a first training image;

and the model training unit is used for training a preset model at least based on the first training image to obtain a trained target model, wherein the target model at least can perform text recognition on a text image with interference information.

In a fourth aspect, an embodiment of the present application provides a model using apparatus, including:

the device comprises a to-be-processed image acquisition unit, a processing unit and a processing unit, wherein the to-be-processed image acquisition unit is used for acquiring a to-be-processed image, and the to-be-processed image is an abnormal text image with interference information or a normal text image;

the model processing unit is used for inputting the image to be processed into a target model to obtain an identification result, and the identification result represents the text content of the image to be processed;

the target model is obtained by the model training method.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include: the method and the device can enable the trained target model to carry out text recognition on the text image with the interference information, so that the robustness of the model is improved.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a schematic flow chart diagram of an implementation of a model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of a method of using a model according to an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating an implementation of a model training method and a model using method in a specific example according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a mold using apparatus according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of an electronic device for implementing a model training method or a model using method of embodiments of the present disclosure.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 is a schematic flow chart of an implementation of a model training method according to an embodiment of the present disclosure. As shown in fig. 1, the method may include:

step S101: acquiring at least two groups of generated images generated by a generated model, wherein the first group of images comprises at least one first generated image which is an abnormal text image with interference information generated based on a real normal text image; the second group of images includes at least one second generated image which is a normal text image generated based on a real abnormal text image with interference information. That is, the first generated image and the second generated image are both images which are not normally acquired, but are output after the generating model processes the existing real normal text image or abnormal text image.

In the present application, the interference information may specifically refer to interference caused by uneven illumination; or interference due to a back-through phenomenon occurring due to writing; or interference due to photocopying problems. Here, for the same sheet of paper, when both the front and back sides are written, the written content on one side may affect the other side, for example, due to too much writing force, the written content on the front side may affect the back side, for example, the mirror image content of the written content on the front side appears in the back side, which may be referred to as a back-through phenomenon. And the problem of photocopying refers to: the printed text image does not meet the text definition requirement or the image definition requirement. The above-described situation may cause the occurrence of the disturbance information in the normal text image.

It should be noted that the above interference information is only exemplary, and other problems may also exist in practical application to cause that an image is unclear or text content is unclear, so that interference information appears in a normal text image.

In the present application, a text image can be specifically understood as a text image that meets the requirement of image definition or text definition.

Step S102: at least a first generated image contained in the first set of images generated by the generative model and a second generated image contained in the second set of images are taken as first training images.

Step S103: and training a preset model at least based on the first training image to obtain a trained target model, wherein the target model can at least perform text recognition on the text image with the interference information.

Therefore, the image serving as the training image in the scheme of the application not only comprises the normal text image but also comprises the abnormal text image, so that the diversity of the training sample is enriched, the target model obtained by training can identify the normal text image, and the text image with the interference information, namely the abnormal text image, can be identified, and thus the robustness of the target model is improved.

Moreover, because the image as the training data is generated by the generative model, namely automatically generated, the scheme of the application enriches the number of samples, reduces the cost for acquiring the training image, further reduces the cost for training the model and improves the training efficiency.

In addition, due to the fact that the generative model is used, the training image can be adjusted conveniently, and a foundation is laid for improving the robustness of the target model.

In a specific example, the preset model may include a text detection model for detecting a text position, and a text recognition model for recognizing a text content of a result (e.g., a text box containing the text content) output by the text detection model.

In a specific example of the present disclosure, the generative model includes at least a first generator and a second generator, wherein the first generator is configured to generate a first generated image of the first set of images; the second generator is to generate a second generated image of the second set of images.

Here, in a specific example, the generative model may be specifically a cyclic gan (cyclic additive networks), which enables style migration of an image, where the style migration may be roughly understood as a transformation style, that is, style conversion of an image is implemented without changing image content. The principle is roughly as follows: changing X into approximate Y by using a first model, then changing the approximate Y into approximate X by using a second model, simultaneously changing Y into approximate X by using the second model, and then changing the approximate X into approximate Y by using the first model to realize the style change of the image; in practical applications, the training may be accomplished using a consistency loss function. In this way, the generated image is obtained after the real images (such as the abnormal text image and the normal text image) are processed.

In a specific example, the generator may use a cross-layer superposition connection structure similar to a U-Net network, for example, the first generator includes 8 convolutional layers and 8 deconvolution layers, where the deconvolution layer is a result of superposition of hopping connections of the convolutional layers and then continues to perform a deconvolution operation; similarly, the second generator also comprises 8 convolutional layers and 8 deconvolution layers, wherein the deconvolution layers are continuously subjected to deconvolution operation after the result jump connection superposition of the convolutional layers. Of course, the number of the above-mentioned convolutional layers and deconvolution layers is only exemplary, and in practical applications, the number may be adjusted based on requirements, and the present application is not limited thereto.

In a specific example of the solution of the present application, the generative model further includes at least a first discriminator, where the first discriminator is configured to discriminate the first generated image generated by the first generator, so as to improve the degree of reality of the generated abnormal text image.

Or, in another specific example, the generative model further includes a second discriminator, where the second discriminator is configured to discriminate a second generated image generated by the second generator, so as to improve the degree of reality of the generated normal text image.

Or, in another specific example, the generative model further includes at least a first discriminator and a second discriminator, where the first discriminator is configured to discriminate the first generative image generated by the first generator to improve the degree of truth of the generated abnormal text image; the second discriminator is used for discriminating the second generated image generated by the second generator so as to improve the reality degree of the generated normal text image.

For example, in one example, the first discriminator may include 5 convolutional layers and 2 fully-connected layers, with the second fully-connected layer having a node number of 1. Further, the first generator in the generative model is configured to generate an image with a back-strike phenomenon or a photoprint based on the collected normal text image, which may be referred to as a generated image in this example, and accordingly, the first discriminator is configured to determine whether the generated image generated by the first generator is true or false. Here, the input of the first discriminator includes the generated image generated by the first generator, and the normal text image of the generated image generated by the first generator, thereby facilitating the first discriminator to realize the true-false discrimination.

In another example, the second discriminator may also include 5 convolutional layers and 2 fully-connected layers, and the number of nodes of the second fully-connected layer is 1. Further, the second generator in the generated model is responsible for generating a normal image based on the collected abnormal text image with the back-penetration phenomenon and the photoprinting, which may be referred to as a generated image in this example, and the second discriminator is used for judging whether the generated image generated by the second generator is true or false, where the input of the second discriminator includes the generated image generated by the second generator and the normal text image of the generated image generated by the second generator, thereby facilitating the discriminator to realize true or false discrimination.

Of course, the numbers of convolutional layers, fully-connected layers, and nodes included in the above classifiers, such as the first and second classifiers, are merely exemplary, and in practical applications, the numbers may be adjusted based on requirements, and the present application is not limited thereto.

Therefore, the generators and the discriminators can be enabled to be in game balance through multiple rounds of training in the training stage, for example, the first generator and the first discriminator are in game balance, the second generator and the second discriminator are in game balance, at the moment, the generators (the first generator and the second generator) can achieve a very good effect, namely, the judgers cannot accurately judge whether the generated images generated by the generators are true or false, and therefore, a foundation is laid for using the generated images generated by the generators as training data for subsequent text detection and text recognition, the training data acquisition efficiency is improved, and meanwhile, the labor cost is reduced.

In a specific example of the present disclosure, the taking at least a first generated image included in a first group of images generated by the generative model and a second generated image included in a second group of images as a first training image includes:

carrying out image series connection on the first generated image and a real normal text image which generates the first generated image to obtain a first series image, and taking the first series image as a first training image; and/or the presence of a gas in the gas,

and carrying out image series connection on the second generated image and the real abnormal text image which generates the second generated image to obtain a second series image, and taking the second series image as the first training image.

That is, a part of the images in the series is a real image, and the other part of the images is a generated image generated based on the real image. For example, the concatenation process may be completed by means of concatenation, so that a part of the concatenated images is a real image, and another part of the concatenated images is a generated image generated based on the real image. Therefore, a foundation is laid for subsequently improving the robustness of the target model.

In a specific example of the present disclosure, the training a preset model based on at least the first training image includes: inputting the first training image into a text detection model contained in the preset model, and obtaining a detection result; and inputting the detection result into a text recognition model contained in the preset model so as to train the text detection model and the text recognition model contained in the preset model. That is, in this example, the preset models include a text detection model for detecting a text position and a text recognition model for recognizing a text content of a result (for example, a text box containing the text content) output by the text detection model. For example, a text recognition model and a text detection model are constructed based on CRNN, and at this time, the text recognition model based on CRNN is a single-line text recognition model, and the input of the text recognition model depends on the detection result of the text detection model based on centret construction, that is, the output result, in other words, the input of the text detection model is the above-mentioned obtained serial images, and the output result is the input of the text recognition model trained in the first stage. Therefore, a foundation is laid for subsequently improving the robustness of the target model.

In a specific example of the scheme of the application, in order to enrich the diversity of the samples, a real abnormal text image with interference information and a real normal text image corresponding to the abnormal text image can be obtained; and using the real abnormal text image and the real normal text image as a second training image to train the preset model based on the second training image. Therefore, the identification accuracy and robustness of the target model are improved.

In a specific example of the present application, the first training image and the second training image described above are used in different training phases. In a specific example of the present application, the first training image is used in a training phase, and is trained until a preset model converges; and the second training image is used in the next training stage and is used as the input of the preset model after the previous training stage trains to converge. For example, in one stage, a preset model is trained using a first training image, for example, a text detection model and a text recognition model are trained until the training converges, and a text detection model and a text recognition model capable of processing normal text images and abnormal text images to be subjected to back-screening, photocopying and the like are obtained. And entering a next stage, wherein the normal text image of the real scene and the abnormal text image of the real scene can be used for continuing training the text detection model and the text recognition model obtained in the previous stage, so that the parameters obtained by training are further adjusted or corrected, the robustness of the text detection model and the text recognition model is further improved, and a foundation is laid for improving the user experience.

The application scheme further provides a model using method, as shown in fig. 2, including:

step S201: acquiring an image to be processed, wherein the image to be processed is an abnormal text image with interference information or a normal text image;

step S202: inputting the image to be processed into a target model, wherein the target model is obtained by the model training method;

step S203: and obtaining a recognition result, wherein the recognition result represents the text content of the image to be processed.

Therefore, the target model obtained by training in the scheme of the application can not only identify the normal text image, but also identify the text image with the interference information, namely, the abnormal text image, so that the robustness of the target model is improved, meanwhile, the user experience is also improved, and the use scene is enriched.

The scheme of the application is further described in detail by combining specific examples; in the present intelligence problem scene of judging, for example to the exercise problem of primary school's mathematics, because multiple reasons such as writing habit and scene of shooing lead to the text image that obtains probably to appear back of the body phenomenon of penetrating (same page of paper promptly, positive and negative two sides are all written, at this moment, the writing content that probably leads to one side influences the another side, for example, because the dynamics of writing is too big, lead to positive writing content to influence the reverse side, if the mirror image content of positive writing content appears in the reverse side, this phenomenon can be called back of the body phenomenon of penetrating), the inhomogeneous scheduling problem of illumination, this will very serious influence the robustness of text detection model and text recognition model, be difficult to guarantee its effect, and then serious influence the problem accuracy rate of judging of the intelligence application such as the problem of judging of shooing, the user experience of the intelligence application such as the problem of judging has been reduced to shoot. Based on the method, the training data are constructed and adjusted by using the generative model, so that the training process is strengthened, the text detection model and the text recognition model have stronger robustness, and the intelligent application of primary school mathematics shooting and question judgment and the like is better served.

Further, firstly, a simple introduction is made to the model adopted in the scheme of the present application, specifically:

the scheme of the application adopts CenterNet and Convolution Recurrent Neural Networks (CRNN); here, the CenterNet is a method of Anchor-free for general purpose target detection, which can be regarded as a regression-based method, and the general idea includes: setting the overall class N of an object to be predicted (such as an image to be predicted), wherein the number of output channels is N +2+ 2; here, the centret predicts the center point of the object, and outputs a score map (the value of each pixel in the score map is between 0 and 1, and represents the probability that the pixel belongs to the center point of the object of a certain class) for each class, thereby obtaining N score maps. In practical application, in the prediction process, it cannot be guaranteed that the predicted central point is a real central point and usually deviates, based on which, two channels are used for predicting the deviation amount of the central point (for example, one channel is used for predicting the deviation amount of an x axis, and the other channel is used for predicting the deviation amount of a y axis), one of the remaining two channels is used for predicting the width corresponding to the central point, and can also be understood as being used for predicting the distance from the central point to the left frame of the text box, and the other channel is used for predicting the height corresponding to the central point, and can also be understood as being used for predicting the distance from the central point to the upper frame of the text box, so that the height and the width corresponding to the central point can predict the text box; and then finding a possible central point (namely a suspected central point) of the object in the score map by setting a threshold, correcting the possible central point according to the xy offset corresponding to the central point, and then directly obtaining a text box, such as a rectangular box, through the central point and by combining the predicted width and height.

The CRNN comprises a convolutional neural network, a cyclic neural network and a translation layer from bottom to top, wherein the convolutional neural network is responsible for extracting features from images with characters, the cyclic neural network is responsible for carrying out sequence prediction by using the features extracted by the convolutional neural network, the translation layer translates a sequence obtained by the cyclic neural network into a letter sequence, and a target function can select a joint time sequence classification (CTC) loss function; in practical applications, the CRNN, although including different types of network structures, can still achieve end-to-end training, and the performance of the CRNN on various data sets is superior to that of other models.

The scheme of the application also adopts a cyclic GAN (cyclic associative adaptive networks), which is an important generative model and can realize style migration of the image, wherein the style migration can be roughly understood as a transformation style, namely, the style conversion of the image is realized under the condition of not changing the content of the image. The principle is roughly as follows: changing X into approximate Y by using a first model, then changing the approximate Y into approximate X by using a second model, simultaneously changing Y into approximate X by using the second model, and then changing the approximate X into approximate Y by using the first model, thereby realizing the style change of the image; in practical applications, the training may be accomplished using a consistency loss function.

Next, the scheme of the present application is explained in detail, and as shown in fig. 3, the specific technical scheme is as follows:

the first step is as follows: collecting a large number of text images, wherein the text images include abnormal text images, such as text images with a back-penetration phenomenon, text images with uneven illumination, and text images with a photocopy, and normal text images corresponding to the abnormal text images. Here, the photocopied text image may specifically refer to a printed text image that does not meet the text definition requirement or the image definition requirement, in other words, may be a printed text image that is not clear.

The second step is that: the CycleGAN model is constructed, for example, to include two generators and two discriminators, denoted as generator G1 and generator G2, and discriminator D1 and discriminator D2.

In an example, the generator may use a cross-layer superposition connection structure similar to a U-Net network, and the generator includes 8 convolutional layers and 8 deconvolution layers, wherein the deconvolution layers are obtained by continuously performing deconvolution operations after superposition of the resulting hopping connections of the convolutional layers.

In one example, the discriminator may include 5 convolutional layers and 2 fully-connected layers, with the second fully-connected layer having a node number of 1.

In an example, the generator G1 in the CycleGAN model is used to generate an image with a back-through phenomenon or a photoprint based on a collected normal text image, which may be referred to as a generated image in this example, and accordingly, the discriminator D1 is used to determine whether the generated image generated by the generator G1 is true or false. Here, the input of the discriminator D1 includes the generated image generated by the generator G1 and the normal text image of the generated image generated by the generator G1, thereby facilitating the discriminator to realize true and false discrimination.

In an example, the generator G2 in the CycleGAN model is responsible for generating a normal image based on the collected abnormal text image with the phenomenon of back-penetration and photoprinting, which may be referred to as a generated image in this example, and the discriminator D2 is used for judging whether the generated image generated by the generator G2 is true or false, where the input of the discriminator D2 includes the generated image generated by the generator G2 and the normal text image of the generated image generated by the generator G2, so that the discriminator can realize true or false judgment.

Therefore, the game balance between the generator and the discriminator can be achieved through multiple rounds of training in the training stage, at the moment, the generator can achieve a very good effect, namely the discriminator cannot accurately discriminate whether the generated image generated by the generator is true or false, so that a foundation is laid for using the generated image generated by the generator as training data of subsequent text detection and text recognition, the acquisition efficiency of the training data is improved, and meanwhile, the labor cost is reduced.

The third step: according to centrnet, other models can be used, for example, without limitation, a text detection model is constructed by using Resnet18 as a backbone network, a Resnet18 network includes 4 blocks connected in series, each block includes several layers of convolution operations, the first block outputs a feature map with a size of 1/4 of the input original image, the second block outputs a feature map with a size of 1/8 of the input original image, the third block outputs a feature map with a size of 1/16 of the input original image, the fourth block outputs a feature map with a size of 1/32 of the input original image, and in the data model DB, the number of feature maps output by each block is 128; converting the sizes of the 4 groups of feature maps into 1/4 sizes of the original image in an interpolation mode and connecting the feature maps in series to obtain a group of feature maps, wherein the number of channels is 512; and performing convolution operation once and deconvolution operation twice on the feature mapping of the 512 channels to obtain 6 (1 +2+2+ 1) channel output consistent with the size of the original image, wherein the first channel represents a score map belonging to the central point of the text box (namely each pixel point value in the score map is between 0 and 1 and represents the probability that a pixel point belongs to the central point of the text box), the second channel and the third channel respectively output the x offset of the central point and the y offset of the central point, and the fourth channel and the fifth channel respectively output the width and the height corresponding to the predicted central point so as to predict the rotation angle of the text box, and the sixth channel outputs the rotation angle of the text box.

The fourth step: according to CRNN, although other models may be used, this example is not limited to constructing a text recognition model, and includes three parts, a convolutional neural network, a recurrent neural network and a transcription layer, wherein the convolutional neural network includes a plurality of convolutional layers for extracting features of an input text image, the recurrent neural network uses two layers of bidirectional LSTM (long short term memory network) for constructing a time sequence relationship between characters, and the final transcription layer is used for deriving character strings from a probability matrix using a decoding algorithm.

It should be noted that, in the above stage of model construction, after the model is constructed, the model constructed is trained.

The fifth step: training a model; the process can be divided into three stages, namely a first stage, training the constructed cycleGAN by using a loss function corresponding to the cycleGAN, training the constructed text detection model by using a loss function corresponding to CenterNet, and training the constructed text recognition model by using a loss function corresponding to CRNN. It should be noted that the text image used by the CycleGAN constructed in the first stage training and the constructed text detection model is a real normal text image.

And a sixth step: after the model is trained to be convergent, the second stage is carried out, and a large number of normal text images and a large number of abnormal text images (namely false images) with back transparency, photoprint and the like are generated by using a cycleGAN generator. And further training the text detection model after the training in the first stage based on the image generated by the generator, wherein the initial parameters in the stage are parameters obtained by the training in the first stage, and the initial parameters are input into a series image formed by superposing a real normal text image and a generator on the basis of the real normal text image, or are input into a series image formed by superposing a real abnormal text image with back transparency and a generator on the basis of the normal text image. That is, a part of the images in the series is a real image, and the other part of the images is a generated image generated based on the real image.

Similarly, the text recognition model trained in the first stage is further trained on the basis of the image generated by the generator, and the initial parameters of the stage are parameters obtained by the training in the first stage; here, the input of the text detection model is similar to the input of the text recognition model, that is, the obtained tandem image, it should be noted that the text recognition model based on the CRNN structure is a single-line text recognition model, and the input of the text recognition model depends on the detection result of the text detection model based on the centret structure, that is, the output result is obtained, that is, the input of the text detection model is the obtained tandem image, and the output result is the input of the text recognition model after the first stage training.

The seventh step: and in the sixth step, after the text detection model and the text recognition model are trained to be converged, entering a third stage, wherein the initial parameters of the third stage are the parameters obtained by the training of the second stage. And in the same way, the output result of the text detection model is used as the input of the text recognition model, so that the text detection model and the text recognition model are trained until the training is converged to obtain the text detection model and the text recognition model which can process the normal text image and the abnormal text image with the back transparency, the photoprint and the like. In a specific application scene, the obtained text detection model and the text recognition model have stronger robustness. In the prediction stage, any text image to be detected or recognized is input, and a specific detection or recognition result is obtained through the trained model.

It should be noted that, in the third stage, the normal text image of the real scene and the abnormal text image of the real scene are used to continue training the text detection model and the text recognition model, so as to adjust the parameters obtained by the training in the second stage, further improve the robustness of the text detection model and the text recognition model, and further lay the foundation for improving the user experience.

Here, the above process completes the model training step to obtain the trained target text detection model and the trained target text recognition model.

And eighthly, entering a prediction stage, namely a model using stage, inputting any text image to be detected or recognized, and obtaining a specific detection or recognition result after completing the text detection model and the text recognition model through the training, thus completing the text recognition process.

Therefore, the text detection model and the text recognition model obtained by the scheme can recognize any text image, such as a normal text image or an abnormal text image containing interference information, so that the robustness of the model is improved.

The present application further provides a model training apparatus, as shown in fig. 4, including:

a generated image acquiring unit 401, configured to acquire at least two groups of generated images generated by the generated model, where the first group of images includes at least one first generated image, and the first generated image is an abnormal text image with interference information generated based on a real normal text image; the second group of images comprises at least one second generated image, and the second generated image is a normal text image generated based on a real abnormal text image with interference information;

a training image acquisition unit 402 configured to take at least a first generated image included in the first group of images generated by the generative model and a second generated image included in the second group of images as a first training image;

a model training unit 403, configured to train a preset model based on at least the first training image to obtain a trained target model, where the target model is capable of performing text recognition on at least a text image with interference information.

In a specific example of the solution of the present application, the generative model further includes at least a first discriminator, where the first discriminator is configured to discriminate a first generated image generated by the first generator, so as to improve the degree of reality of the generated abnormal text image; and/or the presence of a gas in the gas,

the generative model further comprises at least a second discriminator, wherein the second discriminator is used for discriminating a second generated image generated by the second generator so as to improve the degree of reality of the generated normal text image.

In a specific example of the scheme of the application, the training image obtaining unit is further configured to perform image concatenation on the first generated image and a real normal text image that generates the first generated image to obtain a first concatenated image, and use the first concatenated image as the first training image; and/or the presence of a gas in the gas,

In a specific example of the scheme of the application, the model training unit is further configured to input the first training image to a text detection model included in the preset model, and obtain a detection result; and inputting the detection result into a text recognition model contained in the preset model so as to train the text detection model and the text recognition model contained in the preset model.

In a specific example of the present application, wherein,

the training image acquisition unit is also used for acquiring a real abnormal text image with interference information and a real normal text image corresponding to the abnormal text image; taking the real abnormal text image and the real normal text image as a second training image;

the model training unit is further configured to train the preset model based on the second training image.

In a specific example of the present application, the first training image and the second training image are used in different training phases.

In a specific example of the present application, the first training image is used in a training phase, and is trained until a preset model converges; and the second training image is used in the next training stage and is used as the input of the preset model after the previous training stage trains to converge.

The functions of each module in each apparatus in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

The present application further provides a model using apparatus, as shown in fig. 5, including:

a to-be-processed image obtaining unit 501, configured to obtain an image to be processed, where the image to be processed is an abnormal text image with interference information or a normal text image;

the model processing unit 502 is configured to input the image to be processed to a target model to obtain an identification result, where the identification result represents text content of the image to be processed;

here, the target model is a model obtained by the above-described model training method.

FIG. 6 is a block diagram of an electronic device for implementing a model training method or a model using method of embodiments of the present disclosure. As shown in fig. 6, the electronic apparatus includes: a memory 610 and a processor 620, the memory 610 having stored therein computer programs executable on the processor 620. The processor 620, when executing the computer program, implements the model training method or the model using method in the above-described embodiments. The number of the memory 610 and the processor 620 may be one or more.

The electronic device further includes:

the communication interface 630 is used for communicating with an external device to perform data interactive transmission.

If the memory 610, the processor 620 and the communication interface 630 are implemented independently, the memory 610, the processor 620 and the communication interface 630 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 610, the processor 620, and the communication interface 630 are integrated on a chip, the memory 610, the processor 620, and the communication interface 630 may complete communication with each other through an internal interface.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of model training, comprising:

training a preset model at least based on the first training image to obtain a trained target model, wherein the target model can at least perform text recognition on a text image with interference information;

wherein the generative model comprises at least a first generator and a second generator, wherein the first generator is configured to generate a first generated image of the first set of images; the second generator is to generate a second generated image of the second set of images;

the generative model further comprises at least a first discriminator and a second discriminator, wherein the first generator is in game balance with the first discriminator, and the second generator is in game balance with the second discriminator; in game balancing, the first discriminator cannot determine whether a first generated image generated by the first generator is true or false, and the second discriminator cannot determine whether a second generated image generated by the second generator is true or false.

2. The method of claim 1, wherein the first discriminator is configured to discriminate a first generated image generated by the first generator to improve the degree of realism of the generated abnormal text image; the second discriminator is used for discriminating the second generated image generated by the second generator so as to improve the reality degree of the generated normal text image.

3. The method according to claim 1 or 2, wherein said at least using, as the first training image, a first generated image included in a first set of images generated by the generative model and a second generated image included in a second set of images, comprises:

4. The method of claim 1, wherein the training a pre-set model based on at least the first training image comprises:

inputting the first training image into a text detection model contained in the preset model, and obtaining a detection result;

and inputting the detection result into a text recognition model contained in the preset model so as to train the text detection model and the text recognition model contained in the preset model.

5. The method of claim 1, further comprising:

acquiring a real abnormal text image with interference information and a real normal text image corresponding to the abnormal text image;

and taking a real abnormal text image and a real normal text image as a second training image, and training the preset model based on the second training image.

6. The method of claim 5, wherein the first training image and the second training image are used in different training phases.

7. The method of claim 6, wherein the first training image is used in a training phase and trained to converge to a predetermined model; and the second training image is used in the next training stage and is used as the input of the preset model after the previous training stage trains to converge.

8. A method of using a model, comprising:

inputting the image to be processed into a target model, wherein the target model is obtained by the model training method according to any one of claims 1 to 7;

9. A model training apparatus, comprising:

the model training unit is used for training a preset model at least based on the first training image to obtain a trained target model, wherein the target model can at least perform text recognition on a text image with interference information;

10. The apparatus of claim 9, wherein the first discriminator is configured to discriminate the first generated image generated by the first generator to improve the degree of realism of the generated abnormal text image; the second discriminator is used for discriminating the second generated image generated by the second generator so as to improve the reality degree of the generated normal text image.

11. The apparatus according to claim 9 or 10, wherein the training image obtaining unit is further configured to perform image concatenation on the first generated image and a real normal text image that generates the first generated image to obtain a first concatenated image, and use the first concatenated image as the first training image; and/or the presence of a gas in the gas,

12. The apparatus according to claim 9, wherein the model training unit is further configured to input the first training image to a text detection model included in the preset model, and obtain a detection result; and inputting the detection result into a text recognition model contained in the preset model so as to train the text detection model and the text recognition model contained in the preset model.

13. The apparatus of claim 9, wherein,

14. The apparatus of claim 13, wherein the first training image and the second training image are used in different training phases.

15. The apparatus of claim 14, wherein the first training image is used in a training phase and trained to converge to a predetermined model; and the second training image is used in the next training stage and is used as the input of the preset model after the previous training stage trains to converge.

16. A model using apparatus, comprising:

wherein the target model is obtained by the model training method of any one of claims 1 to 7.

17. An electronic device, comprising: comprising a processor and a memory, said memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 8.

18. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.