CN115620304A

CN115620304A - Training method of text recognition model, text recognition method and related device

Info

Publication number: CN115620304A
Application number: CN202211256325.7A
Authority: CN
Inventors: 孟闯; 曹莹; 陈媛媛; 熊剑平
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2023-01-17

Abstract

The application discloses a training method of a text recognition model, a text recognition method and a related device, wherein the method comprises the following steps: performing mask processing on the first sample image to obtain a first mask feature and a first non-mask area image of a first mask area image in the first sample image; encoding a first non-mask region image of a first sample text image by using an encoder of a text recognition model to obtain a first encoding characteristic; predicting the first mask feature and the first coding feature to obtain a first text recognition result of the first text image; based on at least the first text recognition result, parameters of an encoder of the text recognition model are adjusted. By means of the method, the text recognition effect of the text recognition model can be improved.

Description

Training method of text recognition model, text recognition method and related device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a training method for a text recognition model, a text recognition method, and a related apparatus.

Background

The natural scenes contain rich text information, such as card identification, intelligent short video subtitle audit, industrial number identification and other scenes. If people can extract the characters and further process the characters, the characters can provide very favorable basis and rich information for understanding the image semantics.

The precondition of the character extraction processing is the acquisition of a natural scene image. At present, most of natural scene images are shot by electronic equipment such as a hand-held mobile phone and a flat panel. Shaking is easy to occur in the shooting process of artificial shooting, so that the shot image is blurred, and the recognition effect of the natural scene image is poor.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a training method of a text recognition model, a text recognition method and a related device, which can improve the text recognition effect of the text recognition model.

In order to solve the above technical problem, a first aspect of the present application provides a method for training a text recognition model, where the method includes: performing mask processing on the first sample image to obtain a first mask feature and a first non-mask area image of a first mask area image in the first sample image; encoding a first non-mask region image of a first sample text image by using an encoder of a text recognition model to obtain a first encoding characteristic; predicting the first mask feature and the first coding feature to obtain a first text recognition result of the first text image; based on at least the first text recognition result, parameters of an encoder of the text recognition model are adjusted.

In order to solve the above technical problem, a second aspect of the present application provides a text recognition method, including: acquiring a target image; encoding the target image by using an encoder of the text recognition model to obtain target encoding characteristics of the target image; predicting the target coding characteristics of the target image by using a prediction module of the text recognition model to obtain a target text in the target image; the text recognition model is obtained by training with the method of the first aspect.

To solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the memory stores program instructions; the processor is configured to execute the program instructions stored in the memory to implement the method for training the text recognition model according to the first aspect or to implement the method for text recognition according to the second aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium for storing program instructions, the program instructions being executable to implement the method for training a text recognition model according to the first aspect or to implement the method for text recognition according to the second aspect.

The beneficial effect of this application is: different from the situation of the prior art, the method carries out mask processing on the first sample text image in the training process of the text recognition model to obtain a first mask feature and a first non-mask region image of the first mask region image in the first sample text image; encoding a first non-mask region image of a first sample text image by using an encoder of a text recognition model to obtain a first encoding characteristic; predicting the first mask feature and the first coding feature to obtain a first text recognition result of the first text image; based on at least the first text recognition result, parameters of an encoder of the text recognition model are adjusted. Parameters of an encoder are adjusted by using a first text recognition result obtained by predicting the first mask features and the first coding features, so that the encoder of the text recognition model can more accurately extract features of a text image under the condition of fuzzy images, and then accurate text content is obtained according to the extracted features, and the aim of improving the recognition effect of the text recognition model is fulfilled.

Drawings

FIG. 1 is a flowchart illustrating a first embodiment of a training method for a text recognition model provided in the present application;

FIG. 2 is a schematic illustration of a position mask provided herein determining a fused feature of a first sample image;

FIG. 3 is a schematic diagram of an encoder provided herein obtaining a first encoding characteristic;

FIG. 4 is a flowchart illustrating a second embodiment of a method for training a text recognition model provided in the present application;

FIG. 5 is a schematic diagram of an overall framework of a second embodiment of a training method of a text recognition model provided by the present application;

FIG. 6 is a flowchart illustrating a third embodiment of a training method for a text recognition model provided in the present application;

FIG. 7 is a schematic diagram of an overall framework of a third embodiment of a training method of a text recognition model provided by the present application;

FIG. 8 is a flowchart illustrating an embodiment of a text recognition method provided in the present application;

FIG. 9 is a schematic diagram of a frame structure of an embodiment of an electronic device provided in the present application;

FIG. 10 is a block diagram of one embodiment of a computer-readable storage medium provided herein.

Detailed Description

The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

It should be noted that, in the embodiments of the present application, there are descriptions related to "first", "second", etc., and the descriptions of "first", "second", etc. are only used for descriptive purposes and are not to be interpreted as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

Referring to fig. 1-3 in combination, fig. 1 is a schematic flowchart illustrating a first embodiment of a training method for a text recognition model provided by the present application, fig. 2 is a schematic diagram illustrating a position mask determining a fusion feature of a first text image provided by the present application, and fig. 3 is a schematic diagram illustrating an encoder obtaining a first encoding feature provided by the present application; the training method of the text recognition model comprises the following steps:

s11: and performing mask processing on the first sample text image to obtain a first mask feature and a first non-mask region image of a first mask region image in the first sample text image.

In one embodiment, step S11 may be performed by a position masker included in the text recognition model, and the first sample text image is marked with the real text recognition result. The first mask region image in the first sample text image may be determined according to a mask ratio. In a specific embodiment, a mask ratio may be preset, the first sample image is divided into a plurality of image blocks along a preset direction according to the mask ratio, and at least one image block is randomly masked to obtain a first mask area image. For example, if the preset masking ratio is three fifths, the first sample image may be divided into five image blocks, and three of the image blocks may be randomly masked. In another specific embodiment, the first sample image may be first randomly divided into a plurality of image blocks, and then at least one image block is masked based on a masking ratio to obtain a first masked area image. The region of the first sample text image other than the first mask region image is the first non-mask region image.

While determining the first mask region image and the first non-mask region image in the first sample text image based on the preset mask ratio, the word embedding vector of the first mask region image may also be determined. In a specific embodiment, after determining the mask ratio, mask ratio information may be obtained, and the mask ratio information may include information of multiple dimensions, for example, the mask ratio information includes the mask ratio and text information corresponding to the first mask area image, and according to the mask ratio information, the embedding layer in the position mask may return a corresponding word embedding vector. In another specific embodiment, the first mask region image may be determined first, text information included in the region is obtained, and a corresponding word embedding vector is obtained according to the text information. And after the word embedding vector of the first mask area image is obtained, fusing the word embedding vector of the first mask area image and the area characteristics of the first mask area image to obtain the first mask characteristics of the first mask area image. The region feature of the first mask region image may be obtained by feature extraction of the first mask region image by the position mask; or other devices may perform feature extraction on the first sample text image in advance to obtain the image features of the first sample text image, and then obtain the features corresponding to the first mask region image from the image features of the first sample text image to obtain the region features of the first mask region image. As shown in fig. 2, in a specific embodiment, the word embedding vector of the first mask area image and the area feature of the first mask area image of the first sample text image are in different dimensions, the word embedding vector of the first mask area image is mapped to a preset dimension through a full connection layer, and the word embedding vector of the preset dimension and the area feature of the first mask area image are fused to obtain the first mask feature of the first mask area image. The preset dimension is a dimension of the area feature of the first mask area image. The area features of the first mask area image may be extracted by a deep convolutional neural network, and in an embodiment, the area features of the first mask area image may be obtained from the image features of the first sample text image by extracting the image features of the first sample text image by using the deep convolutional neural network. In this embodiment, the first sample image may be a target sample image, and correspondingly, the first mask area image and the first non-mask area image may be a target mask area image and a target non-mask area image.

S12: and encoding a first non-mask region image of the first sample text image by using an encoder of the text recognition model to obtain a first encoding characteristic.

In one embodiment, encoding the first non-mask region image of the first sample text image with an encoder of the text recognition model to obtain the first encoding feature includes extracting features of the first non-mask region image with the encoder to obtain a target non-mask feature, and for convenience of distinguishing, the target non-mask feature is referred to as the first non-mask feature. As shown in fig. 3, performing a self-attention calculation on the first non-mask feature to obtain a self-attention feature; and fusing the self-attention feature and the first non-mask feature to obtain a first coding feature. In particular, the first non-masked feature is passed through the self-attention layer to obtain a self-attention feature. The self-attention layer may include three fully-connected layers, so that the first non-mask feature is mapped by the three fully-connected layers to obtain a query vector, a key value vector, and a value vector, the query vector and the key value vector are subjected to a dot product operation to obtain a score value, the score value is subjected to a dot product with the value vector after normalization (for example, normalization by a SoftMax activation function) to obtain a self-attention coefficient of the first non-mask feature, and the self-attention feature is obtained based on the self-attention coefficient. And summing the self-attention feature and the first non-mask feature, and normalizing to obtain a first coding feature. The self-attention feature obtained at this time may be multi-dimensional or one-dimensional.

If the first non-mask feature is a feature with multiple dimensions, the attention coefficient of the first non-mask feature obtained through the attention layer may also be a feature with multiple dimensions, and the attention coefficient of each dimension is multiplied by the corresponding first non-mask feature to obtain the attention feature corresponding to the dimension; and summing the self-attention feature of the dimension with the corresponding first non-mask feature to obtain a third coding feature of the dimension. Normalizing the third coding features of multiple dimensions and then passing the third coding features through a feedforward neural network to obtain fourth coding features of all dimensions; and summing and normalizing the fourth coding features of each dimension and the third coding features of each dimension to obtain the first coding features of each dimension. Wherein, the feedforward neural network can be composed of a plurality of fully-connected layers. In this embodiment, the first non-mask region image may be a target non-mask region image, and the first coding feature may be a target coding feature.

S13: and predicting the first mask characteristic and the first coding characteristic to obtain a first text recognition result of the first text image.

In an embodiment, a Long Short Term neural network (LSTM) may be used to predict the first mask feature and the first coding feature, so as to obtain a first text recognition result of the first text image. It is to be understood that, in other embodiments, other neural networks may be used to predict the first mask feature and the first coding feature, which is not limited herein. In the embodiment, the first mask feature is added in the text prediction process, so that the text recognition model after the parameters are adjusted has higher anti-interference performance, namely, a more accurate recognition result can be obtained when a dirty text image is recognized by using the trained text recognition model.

S14: based on at least the first text recognition result, parameters of an encoder of the text recognition model are adjusted.

In an embodiment, parameters of the encoder may be adjusted based on the first text recognition result. Specifically, the first recognition loss may be obtained from a difference between the first text recognition result and the real text recognition result. Based on the first recognition loss, parameters of an encoder of the text recognition model are adjusted. Wherein the first recognition loss may be a CTC loss, i.e. (Connectionist Temporal Classification loss). The CTC loss function can solve the problem of whether input and output are aligned or not, avoids marking characters one by one, and only needs to mark samples line by line. During training, special characters are inserted between repeated characters when a label text is coded, and weights and bias items in a network are continuously adjusted through an optimizer Adam algorithm (a random optimization method of adaptive momentum) in a back propagation process, so that the smaller the CTC loss is, the closer the text sequence predicted by a model is to a real text sequence. During decoding, the best path is calculated by selecting the most possible characters at each time step, repeated characters are deleted, then all special characters are deleted from the path, and the rest is the first text recognition result.

In another embodiment, a parameter of the encoder may be adjusted based on the first text recognition result and the second text recognition result. Specifically, a first recognition loss is obtained according to the difference between the first text recognition result and the real text recognition result; obtaining a second recognition loss according to the difference between the second text recognition result and the real text recognition result; parameters of the encoder are adjusted based on the first recognition loss and the second recognition loss. And the second text recognition result is obtained by predicting the second coding characteristics by using a first prediction module of the text recognition model, and the second coding characteristics are obtained by coding the first text image by a coder.

In the above manner, the first sample image is subjected to mask processing in the training process of the text recognition model, so that a first mask feature and a first non-mask region image of a first mask region image in the first sample image are obtained; encoding a first non-mask region image of a first sample text image by using an encoder of a text recognition model to obtain a first encoding characteristic; predicting the first mask feature and the first coding feature to obtain a first text recognition result of the first text image; based on at least the first text recognition result, parameters of an encoder of the text recognition model are adjusted. Parameters of an encoder are adjusted by using a first text recognition result obtained by predicting the first mask feature and the first coding feature, so that the encoder of the text recognition model can more accurately extract features of a text image under the condition of fuzzy images, and then accurate text content is obtained according to the extracted features, and the aim of improving the recognition effect of the text recognition model is fulfilled.

Referring to fig. 4 and 5 in combination, fig. 4 is a schematic flowchart of a second embodiment of a training method of a text recognition model provided by the present application, and fig. 5 is a schematic overall framework diagram of the second embodiment of the training method of the text recognition model provided by the present application; the method comprises the following steps:

s41: and performing mask processing on the first sample text image to obtain a first mask feature and a first non-mask region image of a first mask region image in the first sample text image.

S42: and encoding a first non-mask region image of the first sample text image by using an encoder of the text recognition model to obtain a first encoding characteristic.

For the specific implementation of steps S41 to S42, please refer to steps S11 to S12 of the first implementation of the training method for text recognition models, which are not described herein again.

S43: and predicting the first mask characteristic and the first coding characteristic to obtain a first text recognition result of the first text image.

In one embodiment, step S43 may be performed by the second prediction module. Specifically, the second prediction module predicts the first mask feature and the first coding feature by using the LSTM to obtain a first text recognition result of the first text image.

S44: and encoding the first sample text image by using the encoder to obtain a second encoding characteristic.

In an embodiment, a manner of encoding the first sample text image to obtain the second encoding characteristic may be the same as a manner of encoding the first non-mask region image of the first sample text image to obtain the first encoding characteristic, and details thereof are not repeated herein. It is understood that in other embodiments, the second encoding feature may be obtained in other manners, and is not limited in detail herein.

S45: and predicting the second coding characteristics by using a first prediction module of the text recognition model to obtain a second text recognition result of the first text image.

In an embodiment, the first prediction module predicts the second encoding feature by using LSTM to obtain a second text recognition result of the first text image.

S46: parameters of the first prediction module are adjusted based on the second text recognition result.

S47: and adjusting parameters of the encoder based on the first text recognition result and the second text recognition result.

In one embodiment, a first recognition loss is obtained based on a difference between a first text recognition result and a real text recognition result, and a second recognition loss is obtained based on a difference between a second text recognition result and the real text recognition result; based on the first recognition loss and the second recognition loss, parameters of the encoder are adjusted. Further, after the first identification loss is obtained, parameters of the position mask and the second prediction module can be adjusted according to the first identification loss; after the second recognition loss is obtained, parameters of the first prediction module may be adjusted based on the second recognition loss. Wherein both the first recognition loss and the second recognition loss can be CTC losses.

In this embodiment, two branches may be used to train the text recognition model. As shown in fig. 5, the first branch directly uses the encoder to encode the first sample text image to obtain the second encoding characteristic, and uses the first prediction module of the text recognition model to predict the second encoding characteristic to obtain the second text recognition result of the first sample text image. And calculating a second recognition loss according to the difference between the second text recognition result and the real text recognition result, and adjusting parameters of the encoder and the first prediction module according to the second recognition loss. In the second branch, a position mask is firstly used for carrying out random mask on a first sample text image to obtain a first mask characteristic and a first non-mask region of a first mask region, an encoder is used for encoding the first non-mask region image of the first sample text image to obtain a first encoding characteristic, a second prediction module is used for predicting the first mask characteristic and the first encoding characteristic to obtain a first text recognition result, the difference between the first text recognition result and the real text recognition result is used for calculating a first recognition loss, and parameters of the position mask, the second prediction module and the encoder are adjusted according to the first recognition loss. The encoders of the first branch and the second branch may be the same encoder, that is, the first branch and the second branch are the same encoder. The text recognition model is trained by adopting double branches, so that the text recognition model can use the visual texture characteristics of the natural scene image where the text content is located, and can also use the language information in the visual context to guide the model to accurately recognize the text content in some complicated scenes such as shielding, noise and the like in a recessive mode.

Referring to fig. 6 and fig. 7 in combination, fig. 6 is a schematic flowchart of a third embodiment of a training method of a text recognition model provided by the present application, and fig. 7 is a schematic overall framework diagram of the third embodiment of the training method of the text recognition model provided by the present application; the method comprises the following steps:

s61: the position mask and the encoder are pre-trained using the second sample text image.

The second sample text image may be an unmarked image or an annotated image. In the actual training process, the number of labeled sample text images is limited, and in this case, the position mask and the encoder can be pre-trained by using unlabeled sample text images, so that the position mask and the encoder have certain feature extraction capability; and then, a small amount of labeled sample text images are utilized to perform second training, so that the text recognition model has better text recognition capability.

In one embodiment, the second sample text image is an unmarked image, and a position mask is used for performing mask processing on the second sample text image to obtain a second mask feature and a second non-mask region image of a second mask region image in the second sample text image; encoding a second non-mask region image of the second sample text image by using an encoder to obtain a second encoding characteristic; reconstructing pixel information of the second mask region image by using a decoder based on the second mask feature and the second coding feature to obtain reconstructed pixel information of the second mask region image; parameters of the position mask, the encoder and the decoder are adjusted based on the original pixel information and the reconstructed pixel information of the second mask region image.

Specifically, a mask ratio may be preset, the second sample text image is divided into a plurality of image blocks along a preset direction according to the mask ratio, at least one image block is randomly masked to obtain a second masked area image, and an area in the second sample text image other than the second masked area image is a second non-masked area image. While determining a second mask region image and a second non-mask region image in the second sample text image based on a preset mask ratio, a word embedding vector corresponding to the second mask region image may be determined. In a specific embodiment, after determining the mask ratio, mask ratio information may be obtained, and the mask ratio information may include information of multiple dimensions, for example, the mask ratio information includes the mask ratio and text information of the second mask region image, and according to the mask ratio information, the embedding layer in the position mask may return a corresponding word embedding vector. In another specific embodiment, the second mask region image may be determined first, text information included in the region is obtained, and a corresponding word embedding vector is obtained according to the text information.

And after the word embedding vector of the second mask area image is obtained, fusing the word embedding vector of the second mask area image and the area characteristics of the second mask area image to obtain the second mask characteristics of the second mask area image. In a specific embodiment, the word embedding vector of the second mask area image and the image feature of the second sample text image are in different dimensions, the word embedding vector of the second mask area image is mapped to a preset dimension through a full connection layer, and the word embedding vector of the preset dimension and the area feature of the second mask area image are fused to obtain the second mask feature of the second mask area image. The preset dimension is a dimension of the area feature of the second mask area image. The region features of the second mask region image may be extracted by a deep convolutional neural network.

Further, performing feature extraction on the second non-mask region image by using an encoder to obtain a target non-mask feature, wherein the target non-mask feature is called as a second non-mask feature, and performing self-attention calculation on the second non-mask feature to obtain a self-attention feature; and fusing the self-attention feature and the second non-mask feature to obtain a second coding feature. Specifically, the second non-mask features are mapped through three full connection layers to obtain a query vector, a key value vector and a value vector, the query vector and the key value vector are subjected to dot product operation to obtain a fractional value, the fractional value is subjected to dot multiplication through a SoftMax activation function and the value vector to obtain a self-attention coefficient of the second non-mask features, the self-attention coefficient and the second non-mask features are multiplied to obtain self-attention features, and the self-attention features and the second non-mask features are summed and normalized to obtain second coding features. Wherein the second non-masking feature, the self-attention feature, and the second coding feature may be features of multiple dimensions. In a specific embodiment, the self-attention coefficient of each dimension may be multiplied by the corresponding second non-mask feature to obtain a self-attention feature corresponding to the dimension; and summing the self-attention feature of the dimension with the corresponding second non-mask feature to obtain a fifth coding feature of the dimension. And normalizing the fifth coding features of multiple dimensions, then obtaining the sixth coding features of each dimension through a feedforward neural network, and summing and normalizing the sixth coding features of each dimension and the fourth coding features of each dimension to obtain the second coding features of each dimension.

And merging the second mask characteristic and the second coding characteristic, and reconstructing pixel information of the second mask region image by using a decoder based on the merged characteristic to obtain reconstructed pixel information of the second mask region image. And calculating the loss of mean square error by using the original pixel information and the reconstructed pixel information of the second mask region image, and adjusting parameters of the position mask, the encoder and the decoder according to the loss of mean square error.

In this embodiment, the second sample text image may be a target sample image, and correspondingly, the second mask area image and the second non-mask area image may be a target mask area image and a target non-mask area image; the second coding feature may be a target coding feature.

S62: and performing mask processing on the first sample text image to obtain a first mask feature and a first non-mask region image of a first mask region image in the first sample text image.

In an embodiment, the position mask trained in step S61 is used to perform a mask process on the first sample image, so as to obtain a first mask feature and a first non-mask region image of the first mask region image in the first sample image. Wherein, the first sample text image is marked with a real text recognition result. If the second sample text image is also labeled with the real text recognition result, the first sample text image may be the same as the second sample text image, for example, both of them are the target sample image.

S63: and encoding a first non-mask region image in the first sample text image by using an encoder of the text recognition model to obtain a first encoding characteristic.

S64: and predicting the first mask characteristic and the first coding characteristic to obtain a first text recognition result of the first text image.

S65: based on at least the first text recognition result, parameters of an encoder of the text recognition model are adjusted.

For the specific implementation of steps S62-S65, please refer to steps S11-14 of the first implementation of the training method for text recognition models, which are not described herein again.

In a specific embodiment, as shown in fig. 7, the first sample text image and the second sample text image are the same and both are labeled with the real text recognition result. Performing self-supervision training on a text recognition model, masking a sample text image by using a position encoder of the text recognition model to obtain a second mask characteristic of a second mask region image (such as a region where a character L, a character d and a character y are located in the figure 7), wherein a region except the second mask region image in the sample text image is a second non-mask region image (such as a region where a character i, a character n, a character s and a character a are located in the figure 7), encoding the second non-mask region image by using the encoder to obtain a second encoding characteristic, and merging the second encoding characteristic and the second mask characteristic to obtain a merged characteristic; performing pixel reconstruction on the second mask region image based on the merging characteristic by using a decoder to obtain reconstructed pixel information of the second mask region image; calculating a mean square error loss based on the original pixel information and the reconstructed pixel information of the second mask region image, and adjusting parameters of the position mask, the encoder, and the decoder according to the mean square error loss. Then, performing supervised training on the text recognition model, wherein two branches can be adopted for training, the first branch adopts a trained encoder to encode the sample text image, and a first prediction module is used for predicting encoding characteristics to obtain a second text recognition result; the second branch adopts a position mask to mask the sample text image to obtain a first mask characteristic and a first non-mask area image corresponding to a first mask area image of the sample text image, the first mask area image and the first non-mask area image obtained at the moment can be the same as or different from a second mask area image and a second non-mask area image in the self-supervision training process, the first non-mask area image is coded by a coder to obtain a corresponding first coding characteristic, and a second prediction module is used for predicting the first coding characteristic output by the coder and the first mask characteristic output by the position mask to obtain a first text recognition result; adjusting parameters of an encoder according to the first text recognition result and the second text recognition result; adjusting parameters of a position mask and a second prediction module according to the first text recognition result; and adjusting the parameters of the first prediction module according to the second text recognition result. The text recognition model trained by the method can accurately recognize the text content under the condition that the text content in the natural scene image is shielded.

Referring to fig. 8, fig. 8 is a schematic flowchart illustrating an embodiment of a text recognition method provided in the present application, where the method includes:

s81: and acquiring a target image.

The target image may be an image obtained by shooting any natural scene. In one embodiment, the target image includes a text region.

S82: and encoding the target image by using an encoder of the text recognition model to obtain the target encoding characteristics of the target image.

This step may refer to the above description of encoding the first non-mask region image of the first sample image by using the encoder, which is not described herein again.

S83: and predicting the target coding characteristics of the target image by using a prediction module of the text recognition model to obtain a target text in the target image.

The prediction module of the text recognition model may be the first prediction module or the second prediction module, and in this embodiment, the first prediction module is used as the prediction module of the text recognition model. The text recognition model is obtained through training by any one of the above-mentioned training method embodiments of the text recognition model. For a specific training method, please refer to any of the above embodiments, which are not repeated herein.

Referring to fig. 9, fig. 9 is a schematic diagram of a frame structure of an embodiment of an electronic device provided in the present application.

The electronic device 90 comprises a memory 91 and a processor 92 coupled to each other, the memory 91 storing program instructions, and the processor 92 being configured to execute the program instructions stored in the memory 91 to implement the steps of any of the above-described embodiments of the training method for a text recognition model or to implement the steps of the above-described embodiments of the text recognition method. In one particular implementation scenario, the electronic device 90 may include, but is not limited to: a microcomputer, a server, and the electronic device 90 may also include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

In particular, the processor 92 is adapted to control itself and the memory 91 to implement the steps of any of the above-described method embodiments. The processor 92 may also be referred to as a CPU (Central Processing Unit). The processor 92 may be an integrated circuit chip having signal processing capabilities. The Processor 92 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 92 may be collectively implemented by an integrated circuit chip.

Referring to fig. 10, fig. 10 is a block diagram illustrating an embodiment of a computer-readable storage medium provided in the present application.

The computer-readable storage medium 100 stores program instructions 101, and the program instructions 101, when executed by a processor, are configured to implement the steps of any of the above-described embodiments of the text recognition model training method, or implement the steps of the above-described embodiments of the text recognition method.

The computer-readable storage medium 100 may be a medium that can store a computer program, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, or may be a server that stores the computer program, and the server may transmit the stored computer program to another device for operation, or may self-operate the stored computer program.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

If the technical scheme of the application relates to personal information, a product applying the technical scheme of the application clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the application relates to sensitive personal information, before the sensitive personal information is processed, a product applying the technical scheme of the application obtains individual consent and simultaneously meets the requirement of 'explicit consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is considered as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization in the modes of pop-up window information or asking the person to upload personal information thereof and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

The above description is only an embodiment of the present application, and is not intended to limit the scope of the present application, and all equivalent structures or equivalent processes performed by the present application and the contents of the attached drawings, which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method for training a text recognition model, the method comprising:

performing mask processing on a first sample text image to obtain a first mask feature and a first non-mask region image of a first mask region image in the first sample text image;

encoding the first non-mask region image of the first sample text image by using an encoder of the text recognition model to obtain a first encoding characteristic;

predicting the first mask feature and the first coding feature to obtain a first text recognition result of the first text image;

adjusting parameters of the encoder of the text recognition model based at least on the first text recognition result.

2. The method of claim 1, wherein prior to said adjusting parameters of said encoder of said text recognition model based at least on said first text recognition result, said method further comprises:

encoding the first sample text image by using the encoder to obtain a second encoding characteristic;

predicting the second coding features by utilizing a first prediction module of the text recognition model to obtain a second text recognition result of the first text image;

adjusting parameters of the first prediction module based on the second text recognition result;

the adjusting parameters of the encoder of the text recognition model based at least on the first text recognition result comprises:

adjusting parameters of the encoder based on the first text recognition result and the second text recognition result.

3. The method of claim 2, wherein the first sample text image is tagged with a real text recognition result;

the adjusting parameters of the encoder based on the first text recognition result and the second text recognition result, and the adjusting parameters of the first prediction module based on the second text recognition result, comprising:

obtaining a first recognition loss based on a difference between the first text recognition result and the real text recognition result, and obtaining a second recognition loss based on a difference between the second text recognition result and the real text recognition result;

adjusting a parameter of the encoder based on the first identified loss and the second identified loss; and

adjusting a parameter of the first prediction module based on the second identified loss.

4. The method according to claim 3, wherein the masking the first sample image to obtain the first mask feature and the first non-mask region image of the first mask region image in the first sample image is performed by using a position masker;

the step of predicting the first mask feature and the first coding feature to obtain a first text recognition result of the first text image is performed by using a second prediction module;

after the deriving a first recognition loss based on the difference between the first text recognition result and the real text recognition result, the method further comprises:

adjusting parameters of the position mask and the second prediction module based on the first identified loss.

5. The method according to claim 1, wherein the masking the first sample image to obtain the first mask feature and the first non-mask region image of the first mask region image in the first sample image is performed by using a position masker;

before the masking the first sample text image to obtain the first mask feature and the first non-mask region image of the first mask region image in the first sample text image, the method further includes:

the position mask and the encoder are pre-trained with a second sample text image, wherein the second sample text image is an unmarked image.

6. The method of claim 5, wherein pre-training the position mask and the encoder with the second sample text image comprises:

masking the second sample text image by using the position mask to obtain a second mask feature of a second mask region image and a second non-mask region image in the second sample text image;

encoding the second non-mask region image of the second sample text image by using the encoder to obtain a second encoding characteristic;

reconstructing pixel information of the second mask region image by using a decoder based on the second mask feature and the second coding feature to obtain reconstructed pixel information of the second mask region image;

adjusting parameters of the position mask, the encoder, and the decoder based on original pixel information and the reconstructed pixel information of the second mask region image.

7. The method according to claim 6, wherein the masking the first sample text image to obtain a first mask feature and a first non-mask region image of a first mask region image in the first sample text image, or the masking the second sample text image to obtain a second mask feature and a second non-mask region image of a second mask region image in the second sample text image, comprises:

determining a target mask area image and a target non-mask area image in a target sample image based on a preset mask proportion, and determining a word embedding vector corresponding to the target mask area image;

fusing a word embedding vector corresponding to the target mask area image and the area characteristics of the target mask area image to obtain the target mask characteristics of the target mask area image;

the target sample image is a first sample image, the target mask area image is a first mask area image, the target non-mask area image is a first non-mask area image, and the target mask feature is a first mask feature; or the target sample image is a second sample text image, the target mask area image is a second mask area image, the target non-mask area image is a second non-mask area image, and the target mask feature is a second mask feature.

8. The method according to claim 7, wherein the determining the target mask region image and the target non-mask region in the target sample image based on the preset mask ratio comprises:

dividing the target sample image into a plurality of image blocks along a preset direction based on the preset mask proportion, randomly selecting at least one image block from the plurality of image blocks as the target mask area image, and using the rest image blocks as the target non-mask area image;

the fusing the word embedding vector corresponding to the target mask area image and the area characteristic of the target mask area image to obtain the target mask characteristic of the target mask area image includes:

mapping the word embedding vector into a preset dimension, wherein the preset dimension is the dimension of the region feature;

and fusing the word embedding vector with the preset dimensionality and the region characteristic to obtain the target mask characteristic.

9. The method of claim 6, wherein the encoding, with an encoder of the text recognition model, a first non-masked region image of the first sample text image to obtain first encoding features or encoding, with the encoder, a second non-masked region image of the second sample text image to obtain second encoding features comprises:

performing feature extraction on the target non-mask area image to obtain target non-mask features;

performing self-attention processing on the target non-mask features to obtain self-attention features;

fusing the target non-mask features and the self-attention features to obtain target coding features;

the target non-mask area image is a first non-mask area image, and the target coding feature is a first coding feature, or the target non-mask area image is a second non-mask area image, and the target coding feature is a second coding feature.

10. A method of text recognition, the method comprising:

acquiring a target image;

encoding the target image by using an encoder of a text recognition model to obtain target encoding characteristics of the target image;

predicting the target coding characteristics of the target image by using a prediction module of a text recognition model to obtain a target text in the target image; wherein the text recognition model is a text recognition model trained by the method of any one of claims 1-9.

11. An electronic device comprising a memory and a processor coupled to each other,

the memory stores program instructions;

the processor is configured to execute the program instructions stored in the memory to implement the training method of the text recognition model according to any one of claims 1 to 9 or to implement the text recognition method according to claim 10.

12. A computer-readable storage medium for storing program instructions executable to implement a method of training a text recognition model according to any one of claims 1 to 9 or to implement a method of text recognition according to claim 10.