CN114462489A

CN114462489A - Training method of character recognition model, character recognition method and equipment, electronic equipment and medium

Info

Publication number: CN114462489A
Application number: CN202111633893.XA
Authority: CN
Inventors: 孟闯; 熊剑平
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-05-10

Abstract

The application discloses a training method of a character recognition model, a character recognition method and equipment, electronic equipment and a medium, wherein the method comprises the following steps: acquiring labeled data, non-labeled data and feedback joint loss of the labeled data and the non-labeled data, wherein the feedback joint loss is calculated based on the labeled data, the non-labeled data and a loss function; random character disturbance enhancement is carried out on the non-tag data to obtain disturbed non-tag data; and performing supervised joint semi-supervised training on the character recognition model in training by using the labeled data, the feedback joint loss and the disturbed non-labeled data until the loss function is converged to obtain the trained character recognition model. Through the mode, the method and the device have the advantages that the label data, the feedback joint loss and the disturbed label-free data are used for realizing the supervision joint semi-supervision training on the character recognition model in the training, so that the character recognition capability of the character recognition model in a general scene is improved.

Description

Training method of character recognition model, character recognition method and equipment, electronic equipment and medium

Technical Field

The present application relates to the field of text recognition technologies, and in particular, to a method for training a text recognition model, a text recognition method and apparatus, an electronic apparatus, and a medium.

Background

Generally, with the improvement of various demands of people on the use of product equipment, when the product equipment is used for character recognition, users usually want to keep the timeliness and the accuracy of the character recognition.

Optical Character Recognition (OCR) has become one of the more important technologies in the field of artificial intelligence, and it is of great significance to perform character recognition on a general scene based on the OCR technology. The text recognition accuracy rate in a general scene is related to the data sample capacity, and most scenes are obtained by shooting through equipment such as a smart phone and a camera.

At present, in the process of identifying texts on an acquired image, due to the fact that scene texts are screened, some difficult text samples are filtered, the acquired text samples are further reduced, regular semantic information often does not exist among the texts or characters in the texts, semantic modeling cannot be conducted on serial numbers in the scene texts, a large number of manual labels are often needed in real scenes, recognition of the texts in each scene is limited, misrecognition is prone to occurring, and the robustness is poor and the generalization capability is poor.

Disclosure of Invention

In order to solve the above technical problem, a technical solution adopted in a first aspect of the present application is to provide a method for training a character recognition model, including: acquiring labeled data, unlabeled data and feedback joint loss of the labeled data and the unlabeled data, wherein the feedback joint loss is calculated based on the labeled data, the unlabeled data and a loss function; random character disturbance enhancement is carried out on the non-tag data to obtain disturbed non-tag data; and carrying out supervised joint semi-supervised training on the character recognition model in training by using the labeled data, the feedback joint loss and the disturbed unlabelled data until the loss function is converged to obtain the trained character recognition model.

In order to solve the above technical problem, a second aspect of the present application provides an identification device, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring labeled data, unlabeled data and feedback joint loss of the labeled data and the unlabeled data, and the feedback joint loss is calculated based on the labeled data, the unlabeled data and a loss function; the disturbance enhancement module is used for carrying out random character disturbance enhancement on the non-tag data to obtain the disturbed non-tag data; and the supervised training module is used for carrying out supervised combined semi-supervised training on the character recognition model in training by using the labeled data, the feedback combined loss and the disturbed unlabeled data until the loss function is converged to obtain the trained character recognition model.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, including: a processor and a memory, the memory having stored therein a computer program, the processor being adapted to execute the computer program to implement the method according to the first aspect of the application.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program is capable of implementing the method of the first aspect of the present application when being executed by a processor.

The beneficial effect of this application is: in order to accurately identify the character content of the general scene, the character identification model is designed, disturbance enhancement is carried out on the non-label data, and the label data, the feedback joint loss and the disturbed non-label data are used for realizing the supervision joint semi-supervision training on the character identification model in the training, so that the text identification capability of the character identification model in the general scene is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a training method for a text recognition model of the present application;

FIG. 2 is a flow chart of a specific framework of the training method for text recognition models according to the present application;

FIG. 3 is a flow chart illustrating a specific structure of the student model of FIG. 2 according to the present application;

FIG. 4 is a flow chart illustrating the structure of the three-branch residual block in FIG. 3 according to the present application;

FIG. 5 is a flowchart illustrating an embodiment of step S12 of FIG. 1;

FIG. 6 is a flow diagram of one specific embodiment of random character perturbation enhancement of FIG. 5;

FIG. 7 is a flowchart illustrating an embodiment of step S13 of FIG. 1;

FIG. 8 is a block diagram illustrating the structure of an embodiment of the text recognition device of the present application;

FIG. 9 is a block diagram illustrating the structure of an embodiment of the electronic device of the present application;

FIG. 10 is a schematic block circuit diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

To illustrate the technical solution of the present application, the following description is made through specific embodiments, and the present application provides a text recognition method, which is applied to a text recognition model, please refer to fig. 1, where fig. 1 is a schematic flow chart of a training method of the text recognition model of the present application, and the method specifically includes the following steps:

s11: acquiring labeled data, unlabeled data and feedback joint loss of the labeled data and the unlabeled data;

generally, text images of various scenes are input into a character recognition model, and if a person uses a label to label the text image in a sample manner, the text image and the label corresponding to the text image can be determined to be collectively referred to as labeled data, while only the text image is determined to be unlabeled data.

In a real scene, a large number of unlabeled samples exist, and for a character recognition model, labeled data and unlabeled data can be input. Therefore, the sample range of the input character recognition model is expanded, a large amount of label-free sample data is fully and effectively utilized, and compared with other scene text recognition methods, the method can achieve a better recognition effect.

The character recognition model is designed for accurately recognizing the character content of the general scene, the character recognition model can process the label data and the non-label data to obtain the feedback joint loss, the feedback joint loss is obtained through a back propagation process, and specifically, the feedback joint loss is calculated based on the label data, the non-label data and a loss function and is used for performing feedback adjustment on the character recognition model.

S12: random character disturbance enhancement is carried out on the non-tag data to obtain disturbed non-tag data;

the text recognition accuracy rate is related to the data sample capacity in a general scene, when a character recognition model runs, sometimes the data volume is too small, the result obtained by training is inaccurate, at the moment, the capacity expansion can be performed on the number of characters through random disturbance, for example, the characters are subjected to up-and-down floating or Gaussian disturbance in a range, so that the data volume of the characters is increased.

For unlabeled samples, random character perturbation enhancement is added, with the aim of increasing the diversity of each character of the text string in the text image. By adding random character disturbance enhancement, the character recognition model can have the anti-noise capability, and the robustness of the character recognition model operation is improved.

S13: and carrying out supervised joint semi-supervised training on the character recognition model in training by using the labeled data, the feedback joint loss and the disturbed unlabeled data until the loss function is converged to obtain the trained character recognition model.

Specifically, the labeled data, the feedback joint loss and the disturbed unlabeled data are input into a character recognition model in training so as to perform supervised joint semi-supervised training on the character recognition model in the training, and finally the trained character recognition model can be obtained until the loss function is converged.

Therefore, under the condition of small samples with limited number of labeled samples, the label-free samples can be efficiently utilized, a general scene character recognition method based on semi-supervised learning is designed, and the character recognition method is operated by adopting a corresponding character recognition model to obtain an accurate predicted text sequence.

Therefore, in order to accurately identify the character content of the general scene, the character identification model is designed, disturbance enhancement is carried out on the non-label data, and the label data, the feedback joint loss and the disturbed non-label data are used for realizing the supervised joint semi-supervised training on the character identification model in the training, so that the text identification capability of the character identification model in the general scene is improved.

Further, the training character recognition model includes a training student model and a training teacher model, please refer to fig. 2, fig. 2 is a specific frame flow diagram of the training method of the present application, the overall architecture of the training character recognition model includes two networks, the training student model and the training teacher model, the network structure frames of the two models are the same, the teacher network parameters are calculated through the student network, and the student network parameters are obtained through the loss function gradient descent and the back propagation update.

As shown in fig. 2, in the training stage, the labeled data and the unlabeled data are jointly input into a character recognition model in training, which includes a student model in training and a teacher model in training, the student model in training is supervised trained, and a connection time Classification loss (CTC) is performed on the labeled data, wherein a CTC loss function can solve the problem of whether input and output are aligned, so that the character-by-character labeling is avoided, and only the samples are labeled line by line.

And performing mean square error loss training on the unlabeled samples, and balancing the importance of the unlabeled samples through a proper coefficient, wherein the proper coefficient refers to a supervision weight determined by mean square error and can be preset artificially. When training begins, the weighting coefficient of the mean square error loss of the student model and the teacher model is 0, the student model needs to be supervised and trained by the labeled data, a student model with a good identification effect is obtained, the weighting coefficient of the mean square error loss is larger and larger along with the increase of training times, a large amount of unlabeled sample data is fully and effectively utilized, and compared with other scene text identification methods, the better identification effect can be achieved only by utilizing the limited labeled sample data.

Furthermore, the supervised joint semi-supervised training is carried out on the character recognition model in the training by using the labeled data, the feedback joint loss and the disturbed unlabelled data, and specifically comprises the following steps:

inputting the feedback joint loss, the labeled data and the disturbed unlabeled data into a student model in training for supervised joint semi-supervised training to obtain a first predicted value; inputting the disturbed label-free data into a teacher model in training for semi-supervised training to obtain a second predicted value.

The network structure of the teacher model is copied through the student models, the student models participate in supervised training and semi-supervised training, and the teacher model only participates in semi-supervised training.

Referring to fig. 3 and 4, fig. 3 is a schematic flow chart of a specific structure of the student model in fig. 2 of the present application; FIG. 4 is a flow chart illustrating the structure of the three-branch residual block in FIG. 3 according to the present application;

further, the network structure of the student model includes a three-branch residual block, a pooling layer, and a recurrent neural network (LSTM).

The three-branch residual block is a first residual structure which is added with a residual structure of a 1 x 1 convolutional neural network and is connected with a cross layer on the basis of a 3 x 3 convolutional neural network, the three-branch residual block is used for independently representing character sequence characteristics, and the convolutional neural network is used for acquiring image information of scene characters; as shown in fig. 4, the input is divided into three branches, the left side is a first residual structure formed by directly connecting weighting points across layers, the middle is an input 3 × 3 convolutional neural network, the right side is an input 1 × 1 convolutional neural network, and the three are weighted and then uniformly output.

The recurrent neural network models the sequential information of the character sequence characteristics to learn the relationship between characters.

Because the residual error structure is provided with a plurality of branches, namely a plurality of gradient flowing paths are added in the character recognition model, the effect of the residual error structure can learn more unique characteristic representation in a complex general character scene, the problems of gradient disappearance, gradient explosion and the like of a deep character recognition model are solved, and the convergence speed of the character recognition model is accelerated.

Further, the image information includes at least one of texture, space, and local detail of the text image of the scene. In scene character recognition, the convolutional neural network can acquire the texture, space, local detail and other information of a scene character image, wherein the space refers to position information, and the local detail refers to receptive field; the local detail refers to the convolution window acquiring the local characteristic information of the text image.

Furthermore, in order to further optimize the feature learning capability of the scene character recognition model, a second residual error structure is added between the last layer of convolutional neural network and the first layer of cyclic neural network of the student model, and is used for combining character sequence features extracted by the convolutional neural network, reinforcing semantic information between learning characters by the cyclic neural network and improving the character recognition capability of the general scene.

Further, referring to fig. 5 and fig. 6, fig. 5 is a flowchart illustrating an embodiment of step S12 in fig. 1 according to the present application, and fig. 6 is a flowchart illustrating an embodiment of a method for enhancing random character perturbation in fig. 5, which includes the following steps:

s21: acquiring a text image input into a student model as label-free data;

for the unlabeled sample, random character perturbation enhancement is added, the purpose is to increase the diversity of each character of a text character string in an image, specifically, a text image is input, wherein the text image + label becomes labeled data, and only the text image is unlabeled data.

S22: equally dividing the text image into N image sub-blocks, wherein N is a positive integer greater than or equal to 1;

as shown in fig. 6, the text image is equally divided into N image sub-blocks, and specifically, when N is 4, it means that there are 4 image sub-blocks.

S23: initializing N image sub-blocks along the boundary of the text image to form 2(N +1) reference points, wherein each reference point is provided with a range circle with the radius of R, and the center of the circle is taken as an initial origin;

then 2(N +1) reference points are initialized along the upper, lower, left, and right boundaries of the image, each of which sets a range circle having a radius R with the center of the circle as an initial origin. Specifically, when there are 4 image sub-blocks, there are 10 reference points 2(N +1), where R may be preset artificially, and may be set to be greater than or equal to 0, and is set according to requirements, and is not limited herein.

S24: and randomly disturbing pixel points in the range circle according to Gaussian distribution to change the shape and/or the torsion of each character in the label-free data.

Specifically, the formula of the gaussian distribution adopted is as follows (1):

wherein mu is mean, sigma is variance, x is abscissa of pixel point, p (x) is ordinate of pixel point, the application adopts standard Gaussian distribution, that is, mu is 0, sigma is²When the formula is 1, the formula is as shown in (2):

for example, each character in fig. 6 shows different degrees of distortion and different shapes, but does not affect the semantic information represented by the character, so that the network can have the anti-noise capability and is more robust by adding random character perturbation enhancement.

Further, before the random character perturbation enhancing is performed on the non-tag data, the method further comprises: when the label is subjected to encoding preprocessing, preset characters are inserted between the repeated characters, wherein the preset characters are different from the label.

Specifically, since there may be repeated characters in the text image, in the process of training the text model for recognition, it may be possible to insert preset characters between the repeated characters, such as adding blank characters between the repeated characters, when performing encoding preprocessing on the labels before performing random character perturbation enhancement on the label-free data, although other characters may also be different from the labels.

In this way, the best path is calculated by selecting the most likely character at each time step during decoding, duplicate characters are deleted, and then all special characters are deleted from the path, leaving the text that has been recognized.

Further, after performing supervised joint semi-supervised training on the perturbed unlabeled data, the perturbed labeled data and the feedback joint loss, please refer to fig. 7, in which fig. 7 is a flowchart illustrating an embodiment of step S13 in fig. 1 of the present application, and the method further includes:

s31: extracting a label with label data;

in the reality scene, often artificially carry out the sample mark to the character one by one, these sample marks the label of data of having the time of the meter, through extracting these labels, can be in student's model, to having the standard that the label data provided the contrast for student's model can have the supervised training to having the label data.

S32: performing loss calculation on the label, the first predicted value and the second predicted value input loss function;

and a loss function is arranged in the character recognition model, and loss calculation is carried out on the input label, the first predicted value and the second predicted value, so that a preset loss result can be obtained, the loss function is converged, and the trained character recognition model is obtained.

S33: and determining a first predicted value or a second predicted value corresponding to the obtained preset loss result as a predicted text sequence.

Specifically, the first predicted value or the second predicted value corresponding to the obtained preset loss result can be determined as a predicted text sequence through comparison and selection, wherein the preset loss result can be the minimum loss result, so that the predicted text sequence is closer to the real text sequence.

The preset loss result may be the first predicted value or the second predicted value.

Further, the loss functions include supervised and unsupervised loss functions; the supervised loss functions include at least a join time classification loss function.

Performing loss calculation on the label, the first predicted value and the second predicted value input loss function, wherein the loss calculation comprises the following steps:

and calling a supervision loss function, fitting the label and the first predicted value to obtain a connection time classification loss value, so that the connection time classification loss function is converged to obtain a trained student model.

Calling an unsupervised loss function, processing the second predicted value and the first predicted value to obtain a mean square error loss value, so that the unsupervised loss function is converged to obtain a trained teacher model, wherein a difference value between the second predicted value and the first predicted value is smaller than a preset difference value, the loss value is close to 0, and the prediction result of the teacher model is mainly ensured to be similar to that of the student network as far as possible.

Wherein the feedback joint loss is determined based on the sum of the joint time classification loss value and the mean square error loss value, as shown in figure 2.

Still further, the method further comprises: and updating parameters of the student model by using the loss function gradient descent and optimizer, wherein the parameters of the student model are obtained by the loss function gradient descent and back propagation, and in addition, the parameters of the teacher model are obtained by updating a sliding average function through the parameters of the student model, and all historical training results do not need to be stored.

Specifically, the moving average function formula is as follows (3):

θ′_t＝αθ′_t-1+(1-α)θ_t (3)

where α is the moving average weight balance coefficient, θ_tIs a parameter of the student model at time t, theta'_t-1Is a parameter of the teacher model at time t-1, θ'_tIs the parameter of the teacher model at time t.

Further, updating parameters of the student model using the loss function gradient descent and optimizer includes: in the back propagation process, the weights and bias terms of the convolutional neural network and the weights and bias terms of the cyclic neural network in the network structure of the student model are adjusted through the algorithm of the optimizer to update the parameters of the student model, so that the smaller the CTC loss is, the closer the text sequence predicted by the student model is to the real text sequence. That is, the magnitude of the join time classification penalty is inversely related to the trueness of the predicted text sequence.

In order to illustrate the technical solution of the present application, the present application further provides a text recognition method, including: acquiring a text image; and calling the trained character recognition model to recognize the text image so as to obtain a predicted text sequence.

Therefore, the application provides a general scene character recognition method based on semi-supervised learning; and designing a student and teacher scene character recognition model based on the three-branch residual block and the residual LSTM, carrying out supervised training on the student model by using labeled data, and carrying out combined semi-supervised training on the student model and the teacher model by using unlabeled data.

Therefore, compared with the method only using supervised learning, the method effectively uses the unlabeled samples by adopting a combined semi-supervised learning mode under the condition of the same number of labeled samples, and can greatly improve the character recognition rate of a general scene. And for a label-free sample, the anti-interference capability of the general scene text recognition model is enhanced by adding random character disturbance enhancement, so that the robustness of the model can be greatly improved.

For explaining a technical solution of the present application, the present application further provides a text recognition apparatus, please refer to fig. 8, where fig. 8 is a schematic block diagram of a structure of an embodiment of the text recognition apparatus of the present application, and the text recognition apparatus 60 includes: an acquisition module 61, a perturbation enhancement module 62, a supervised training module 63 and a semi-supervised training module 64.

An obtaining module 61, configured to obtain tagged data, non-tagged data, and feedback joint loss of the tagged data and the non-tagged data, where the feedback joint loss is obtained by calculating the tagged data, and a loss function;

the disturbance enhancement module 62 is configured to perform random character disturbance enhancement on the non-tag data to obtain disturbed non-tag data;

and the supervised training module 63 is used for the joint semi-supervised training module 64, and performing supervised joint semi-supervised training on the character recognition model in training by using the labeled data, the feedback joint loss and the disturbed unlabeled data until the loss function is converged to obtain the trained character recognition model.

Therefore, in order to accurately recognize the character content of the general scene, the character recognition model is designed, the disturbance enhancement module 62 is used for carrying out disturbance enhancement on the non-label data, the supervision training module 63 is combined with the semi-supervision training module 64, and the label data, the feedback combined loss and the disturbed non-label data are used for realizing supervision combined semi-supervision training on the character recognition model in training, so that the text recognition capability of the trained character recognition model in the general scene is improved.

For explaining a technical solution of the present application, the present application further provides an electronic device, which may be a computer or a mobile phone, and the like, and specifically, without limitation, please refer to fig. 9, where fig. 9 is a schematic block diagram of a structure of an embodiment of the electronic device of the present application, and the electronic device 7 includes: the processor 71 and the memory 72, the memory 72 stores a computer program 721, and the processor 71 is configured to execute the computer program 721 to implement the method according to the embodiment of the present application, which is not described herein again.

In addition, referring to fig. 10, fig. 10 is a schematic circuit block diagram of an embodiment of a computer-readable storage medium of the present application, where the computer-readable storage medium 8 stores a computer program 81, and the computer program 81 can be executed by a processor to implement the method according to the embodiment of the present application, which is not described herein again.

If implemented in the form of software functional units and sold or used as a stand-alone product, may also be stored in a device having a memory function. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage device and includes instructions (program data) for causing a computer (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. The aforementioned storage device includes: various media such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and electronic devices such as a computer, a mobile phone, a notebook computer, a tablet computer, and a camera having the storage medium.

The description of the execution process of the program data in the device with the storage function may refer to the above description of the method embodiments of the present application, and will not be described herein again.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A method for training a character recognition model, the method comprising:

acquiring labeled data, unlabeled data and feedback joint loss of the labeled data and the unlabeled data, wherein the feedback joint loss is calculated based on the labeled data, the unlabeled data and a loss function;

performing character disturbance enhancement on the non-tag data to obtain disturbed non-tag data;

and performing supervised joint semi-supervised training on the character recognition model in training by using the labeled data, the feedback joint loss and the disturbed unlabeled data until the loss function is converged to obtain the trained character recognition model.

2. The method of claim 1,

the character recognition model in training comprises a student model in training and a teacher model in training, and the network structure of the teacher model is the same as that of the student model;

and performing supervised joint semi-supervised training on the character recognition model in the training by using the labeled data, the feedback joint loss and the disturbed unlabeled data, wherein the method comprises the following steps:

inputting the feedback joint loss, the labeled data and the disturbed unlabeled data into the student model in training for supervised joint semi-supervised training to obtain a first predicted value;

inputting the disturbed label-free data into the teacher model in training for semi-supervised training to obtain a second predicted value.

3. The method of claim 2,

when the loss function is converged, obtaining a trained character recognition model, including:

extracting the label of the labeled data;

and inputting the label, the first predicted value and the second predicted value into the loss function to perform loss calculation, so as to obtain a preset loss result, so that the loss function is converged, and the trained character recognition model is obtained.

4. The method of claim 3,

the loss functions include supervised loss functions and unsupervised loss functions, the supervised loss functions including at least join time classification loss functions;

the inputting the label, the first predicted value and the second predicted value into the loss function for loss calculation includes:

calling the supervised loss function, fitting the label and the first predicted value to obtain a connection time classification loss value so as to enable the connection time classification loss function to be converged, and obtaining a trained student model;

calling the unsupervised loss function, processing the second predicted value and the first predicted value to obtain a mean square error loss value so as to enable the unsupervised loss function to be converged to obtain a trained teacher model, wherein a difference value between the second predicted value and the first predicted value is smaller than a preset difference value, so that the trained student model and the trained teacher model form the trained character recognition model;

wherein the feedback joint loss is determined based on a sum of the joint time classification loss value and the mean square error loss value.

5. The method according to any one of claims 2 to 4,

the character disturbance enhancement of the label-free data comprises the following steps:

acquiring a text image input into the student model as the label-free data;

equally dividing the text image into N image sub-blocks, wherein N is a positive integer greater than or equal to 1;

initializing the N image sub-blocks to form 2(N +1) reference points along the boundary of the text image, wherein each reference point sets a range circle with a radius of R, the center of the circle is used as an initial origin, and R is greater than or equal to 0;

and randomly disturbing pixel points in the range circle according to Gaussian distribution to change the shape and/or the torsion of each character in the label-free data.

6. The method of claim 5,

before the performing character perturbation enhancement on the unlabeled data, the method further comprises:

and when the label is subjected to encoding preprocessing, inserting preset characters between repeated characters, wherein the preset characters are different from the label.

7. The method of claim 1,

the network structure of the student model comprises a three-branch residual block, a pooling layer and a cyclic neural network;

the three-branch residual block is a first residual structure which is added with a residual structure of a 1 x 1 convolutional neural network and is connected with a cross layer on the basis of a 3 x 3 convolutional neural network, the three-branch residual block is used for independently representing character sequence characteristics, and the convolutional neural network is used for acquiring image information of scene characters;

and the recurrent neural network carries out sequential information modeling on the character sequence characteristics so as to learn the association relation between characters.

8. The method of claim 7,

and a second residual error structure is arranged between the last layer of the convolutional neural network and the first layer of the cyclic neural network of the student model, and the second residual error structure is used for combining the character sequence characteristics extracted by the convolutional neural network to strengthen semantic information between learning characters of the cyclic neural network.

9. The method of claim 8,

the image information comprises at least one of texture, space and local detail of the scene text image.

10. The method of claim 1,

the method further comprises the following steps:

and updating the parameters of the student model by using a loss function gradient descent and optimizer, wherein the parameters of the teacher model are obtained by updating a moving average function through the parameters of the student model.

11. The method of claim 10,

the gradient descent of the loss function and the updating of the parameters of the student model by using an optimizer comprise:

in the back propagation process, the weights and bias terms of the convolutional neural network and the weights and bias terms of the recurrent neural network in the network structure of the student model are adjusted by the algorithm of the optimizer to update the parameters of the student model.

12. A method for recognizing a character, the method comprising:

acquiring a text image;

invoking the trained character recognition model of any one of claims 1-11 to recognize the text image to obtain a predicted text sequence.

13. A character recognition apparatus characterized by comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring labeled data, unlabeled data and feedback joint loss of the labeled data and the unlabeled data, and the feedback joint loss is calculated based on the labeled data, the unlabeled data and a loss function;

the disturbance enhancement module is used for carrying out character disturbance enhancement on the non-tag data to obtain disturbed non-tag data;

and the supervised training module is used for a combined semi-supervised training module, and carrying out supervised combined semi-supervised training on the character recognition model in training by using the labeled data, the feedback combined loss and the disturbed unlabelled data until the loss function is converged to obtain the trained character recognition model.

14. An electronic device, comprising: a processor and a memory, the memory having stored therein a computer program for execution by the processor to implement the method of any of claims 1-12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when being executed by a processor, is adapted to carry out the method according to any one of claims 1-12.